By Abe


2012-09-12 18:23:49 8 Comments

Is there a way to trigger a method in a Spider class just before it terminates?

I can terminate the spider myself, like this:

class MySpider(CrawlSpider):
    #Config stuff goes here...

    def quit(self):
        #Do some stuff...
        raise CloseSpider('MySpider is quitting now.')

    def my_parser(self, response):
        if termination_condition:
            self.quit()

        #Parsing stuff goes here...

But I can't find any information on how to determine when the spider is about to quit naturally.

5 comments

@Chris 2013-09-19 22:17:43

For me the accepted did not work / is outdated at least for scrapy 0.19. I got it to work with the following though:

from scrapy.signalmanager import SignalManager
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        SignalManager(dispatcher.Any).connect(
            self.closed_handler, signal=signals.spider_closed)

    def closed_handler(self, spider):
        # do stuff here

@slavugan 2017-04-05 16:04:36

if you have many spiders and want to do something before each of them closing, maybe it will be convenient to add statscollector in your project.

in settings:

STATS_CLASS = 'scraper.stats.MyStatsCollector'

and collector:

from scrapy.statscollectors import StatsCollector

class MyStatsCollector(StatsCollector):
    def _persist_stats(self, stats, spider):
        do something here

@Levon 2016-10-12 09:45:11

For Scrapy version 1.0.0+ (it may also work for older versions).

from scrapy import signals

class MySpider(CrawlSpider):
    name = 'myspider'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        print('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        print('Closing {} spider'.format(spider.name))

One good usage is to add tqdm progress bar to scrapy spider.

# -*- coding: utf-8 -*-
from scrapy import signals
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tqdm import tqdm


class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['somedomain.comm']
    start_urls = ['http://www.somedomain.comm/ccid.php']

    rules = (
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccds.php\?id=.*'),
             callback='parse_item',
             ),
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccid.php$',
                           restrict_xpaths='//table/tr[contains(., "SMTH")]'), follow=True),
    )

    def parse_item(self, response):
        self.pbar.update()  # update progress bar by 1
        item = MyItem()
        # parse response
        return item

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        self.pbar = tqdm()  # initialize progress bar
        self.pbar.clear()
        self.pbar.write('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        self.pbar.clear()
        self.pbar.write('Closing {} spider'.format(spider.name))
        self.pbar.close()  # close progress bar

@An Se 2018-01-31 10:12:28

this must be selected answer, thanks Levon

@not2qubit 2018-10-02 13:12:48

This is the new method! Although it look less transparent, it's advantage is to remove the extra clutter of using: def __init__(self):.. and the PyDispatcher import with from scrapy.xlib.pydispatch import dispatcher.

@THIS USER NEEDS HELP 2015-10-23 22:29:51

Just to update, you can just call closed function like this:

class MySpider(CrawlSpider):
    def closed(self, reason):
        do-something()

@Aminah Nuraini 2015-11-02 20:48:59

In my scrapy it's def close(self, reason):, not closed

@El Ruso 2016-01-29 23:14:56

@AminahNuraini Scrapy 1.0.4 def closed(reason)

@dm03514 2012-09-12 18:40:11

It looks like you can register a signal listener through dispatcher.

I would try something like:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
      # second param is instance of spder about to be closed.

@Abe 2012-09-12 18:52:20

Works perfectly. But I'd suggest naming the method MySpider.quit() or something similar, to avoid confusion with the signal name. Thanks!

@Daniel Werner 2012-09-13 19:23:36

Excellent solution. And yes, the example should work exactly the same with a CrawlSpider.

@not2qubit 2014-01-04 20:06:28

This solution also work fine on Scrapy 0.20.0, contrary to what @Chris said below.

@shellbye 2014-12-25 02:44:57

This solution also work fine on Scrapy 0.24.4, contrary to what @Chris said below.

@chishaku 2015-03-09 09:26:25

I'm confused by why the second parameter of spider_closed is necessary. Isn't the spider to be closed self?

@Desprit 2016-09-16 12:14:56

Doesn't work with v. 1.1 because xlib.pydispatch was deprecated. Instead, they recommend to use PyDispatcher. Though couldn't make it work yet...

@wj127 2017-03-15 14:51:00

Fabolous! This is exactly what I was looking for! And works perfectly fine! Great input mate! And thanks :3

@not2qubit 2018-10-02 13:06:59

This still works in Python 3.6.4, with Scrapy 1.5.1 and using PyDispatcher 2.0.5, and even if you also have a def spider_closed(..) in some pipeline Class in your pipelines.py. However, it is also deprecated as shown here, so use the new method as explained by @Levon.

Related Questions

Sponsored Content

52 Answered Questions

[SOLVED] Calling an external command in Python

18 Answered Questions

[SOLVED] Using global variables in a function

1 Answered Questions

Scrapy: CrawlSpider doesn't parse the response

15 Answered Questions

[SOLVED] How to make a chain of function decorators?

10 Answered Questions

[SOLVED] Calling a function of a module by using its name (a string)

  • 2008-08-06 03:36:08
  • ricree
  • 503881 View
  • 1252 Score
  • 10 Answer
  • Tags:   python

1 Answered Questions

[SOLVED] scrapy: call a function when a spider opens

  • 2017-09-22 16:36:40
  • NFB
  • 547 View
  • 0 Score
  • 1 Answer
  • Tags:   python scrapy

1 Answered Questions

[SOLVED] Scrapy. How to change spider settings after start crawling?

1 Answered Questions

1 Answered Questions

[SOLVED] Scrapy DupeFilter on a per spider basis?

  • 2014-08-07 15:20:56
  • todinov
  • 861 View
  • 3 Score
  • 1 Answer
  • Tags:   scrapy

2 Answered Questions

[SOLVED] Scrapy CrawlSpider doesn't crawl the first landing page

Sponsored Content