By Eugene Nagorny


2013-02-08 17:18:45 8 Comments

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:

    crawler = Crawler(Settings(settings))
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()
    print "It can't be printed out!"

It works at it should: visits pages, scrape needed info and stores output json where I told it(via FEED_URI). But when spider finishing his work(I can see it by number in output json) execution of my script wouldn't resume. Probably it isn't scrapy problem. And answer should somewhere in twisted's reactor. How could I release thread execution?

2 comments

@Medeiros 2013-09-27 21:39:34

In scrapy 0.19.x you should do this:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

Note these lines

settings = get_project_settings()
crawler = Crawler(settings)

Without it your spider won't use your settings and will not save the items. Took me a while to figure out why the example in documentation wasn't saving my items. I sent a pull request to fix the doc example.

One more way to do it is just call command directly from you script

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

@Steven Almeroth 2013-02-10 20:59:35

You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed signal:

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher

from testspiders.spiders.followall import FollowAllSpider

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

And the command line log output might look something like:

[email protected]:/srv/scrapy/testspiders$ ./api
2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor...
2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished)
2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 23934,...}
2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished)
2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped.
[email protected]:/srv/scrapy/testspiders$

@Eugene Nagorny 2013-02-13 11:59:29

It definitely should be described in documentations. Thanks.

@Sjaak Trekhaak 2013-06-26 06:54:41

I've submitted a pull request which describes how to stop the reactor to the scrapy documentation, it should be in soon :)

@Medeiros 2013-09-25 21:44:16

When running scrapy from script like that how do I pass arguments for scrapy? Like -o output.json -t json

@Steven Almeroth 2013-09-26 00:29:54

see argparse

@William Kinaan 2014-02-09 18:06:16

where should I put that script please?

@Marco Dinatsoli 2014-08-17 21:19:43

could you help me here please? stackoverflow.com/questions/25353650/…

@arcolife 2014-09-05 07:42:11

instead of using an additional stop_reactor, this works: crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

Related Questions

Sponsored Content

3 Answered Questions

[SOLVED] Get Scrapy crawler output/results in script file function

1 Answered Questions

Unable to scrape while running scrapy spider sequentially

1 Answered Questions

[SOLVED] Twisted Reactor not restarting in scrapy

3 Answered Questions

1 Answered Questions

[SOLVED] Running Scrapy from a script with file output

  • 2017-04-18 09:20:32
  • Kurt Peek
  • 1164 View
  • 4 Score
  • 1 Answer
  • Tags:   python scrapy

1 Answered Questions

How to use APscheduler with scrapy

2 Answered Questions

[SOLVED] Running scrapy from script not including pipeline

1 Answered Questions

Scrapy 1.0 - Getting return value after running from python script

1 Answered Questions

How can I pause a Scrapy when I meet website blocking?

2 Answered Questions

[SOLVED] Scrapy run from script not working

  • 2013-09-13 19:34:04
  • danielfrg
  • 1589 View
  • 5 Score
  • 2 Answer
  • Tags:   python scrapy

Sponsored Content