By Coding_Rabbit


2016-04-03 10:25:38 8 Comments

I'm using scrapy to get data and I want to use flask web framework to show the results in webpage. But I don't know how to call the spiders in the flask app. I've tried to use CrawlerProcess to call my spiders,but I got the error like this :

ValueError
ValueError: signal only works in main thread

Traceback (most recent call last)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index
process = CrawlerProcess()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__
install_shutdown_handlers(self._signal_shutdown)
File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers
reactor._handleSignals()
File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
_SignalReactorMixin._handleSignals(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals
signal.signal(signal.SIGINT, self.sigInt)
ValueError: signal only works in main thread

My scrapy code like this:

class EPGD(Item):

genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = Field()
symbol = Field()
description = Field()

class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = "man"
    start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]

db = DB_Con()
collection = db.getcollection(name, term)

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        self.collection.update({"genID":item['genID']}, dict(item),  upsert=True)
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

My flask code like this:

@app.route('/', methods=['GET', 'POST'])
def index():
    process = CrawlerProcess()
    process.crawl(EPGD_spider)
    return redirect(url_for('details'))


@app.route('/details', methods = ['GET'])
def epgd():
    if request.method == 'GET':
        results = db['EPGD_test'].find()
        json_results= []
        for result in results:
            json_results.append(result)
        return toJson(json_results)

How can I call my scrapy spiders when using flask web framework?

2 comments

@Pawel Miech 2016-05-17 08:04:17

Adding HTTP server in front of your spiders is not that easy. There are couple of options.

1. Python subprocess

If you are really limited to Flask, if you can't use anything else, only way to integrate Scrapy with Flask is by launching external process for every spider crawl as other answer recommends (note that your subprocess needs to be spawned in proper Scrapy project directory).

Directory structure for all examples should look like this, I'm using dirbot test project

> tree -L 1                                                                                                                                                              

├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py

Here's code sample to launch Scrapy in new process:

# server.py
import subprocess

from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    """
    Run spider in another process and store items in file. Simply issue command:

    > scrapy crawl dmoz -o "output.json"

    wait for  this command to finish, and read output.json to client.
    """
    spider_name = "dmoz"
    subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
    with open("output.json") as items_file:
        return items_file.read()

if __name__ == '__main__':
    app.run(debug=True)

Save above as server.py and visit localhost:5000, you should be able to see items scraped.

2. Twisted-Klein + Scrapy

Other, better way is using some existing project that integrates Twisted with Werkzeug and displays API similar to Flask, e.g. Twisted-Klein. Twisted-Klein would allow you to run your spiders asynchronously in same process as your web server. It's better in that it won't block on every request and it allows you to simply return Scrapy/Twisted deferreds from HTTP route request handler.

Following snippet integrates Twisted-Klein with Scrapy, note that you need to create your own base class of CrawlerRunner so that crawler will collect items and return them to caller. This option is bit more advanced, you're running Scrapy spiders in same process as Python server, items are not stored in file but stored in memory (so there is no disk writing/reading as in previous example). Most important thing is that it's asynchronous and it's all running in one Twisted reactor.

# server.py
import json

from klein import route, run
from scrapy import signals
from scrapy.crawler import CrawlerRunner

from dirbot.spiders.dmoz import DmozSpider


class MyCrawlerRunner(CrawlerRunner):
    """
    Crawler object that collects items and returns output after finishing crawl.
    """
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        # create crawler (Same as in base CrawlerProcess)
        crawler = self.create_crawler(crawler_or_spidercls)

        # handle each item scraped
        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        # create Twisted.Deferred launching crawl
        dfd = self._crawl(crawler, *args, **kwargs)

        # add callback - when crawl is done cal return_items
        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    """
    :param output: items scraped by CrawlerRunner
    :return: json with list of items
    """
    # this just turns items into dictionaries
    # you may want to use Scrapy JSON serializer here
    return json.dumps([dict(item) for item in output])


@route("/")
def schedule(request):
    runner = MyCrawlerRunner()
    spider = DmozSpider()
    deferred = runner.crawl(spider)
    deferred.addCallback(return_spider_output)
    return deferred


run("localhost", 8080)

Save above in file server.py and locate it in your Scrapy project directory, now open localhost:8080, it will launch dmoz spider and return items scraped as json to browser.

3. ScrapyRT

There are some problems arising when you try to add HTTP app in front of your spiders. For example you need to handle spider logs sometimes (you may need them in some cases), you need to handle spider exceptions somehow etc. There are projects that allow you to add HTTP API to spiders in an easier way, e.g. ScrapyRT. This is an app that adds HTTP server to your Scrapy spiders and handles all the problems for you (e.g. handling logging, handling spider errors etc).

So after installing ScrapyRT you only need to do:

> scrapyrt 

in your Scrapy project directory, and it will launch HTTP server listening for requests for you. You then visit http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com and it should launch your spider for you crawling url given.

Disclaimer: I'm one of the authors of ScrapyRt.

@Coding_Rabbit 2016-05-17 09:55:19

Thanks for your detailed reply! I've solved this problem by the method from scrapy doc runner.crawl. But there is still a warning sometimes (not everytime) occur, Rector not restartable. I'm still working on this. I'll try your method, maybe this will solve the sporadic problem!

@nirvana-msu 2017-10-21 01:27:22

Could you please elaborate on the reasoning behind "only way to integrate Scrapy with Flask is by launching external process for every spider crawl". Being a newbie to both Flask and Scrapy, the first naive approach that comes to mind is to just start CrawlerProcess within the same thread handing HTTP request. Which issues could that lead to and why is it a bad idea? Is it documented somewhere? Thanks!

@nirvana-msu 2017-10-21 01:37:31

Is the only problem that it will block the process (which can be solved by just having more workers serving requests for a web server - assuming of course you don't need to run too many spiders simultaneously)? Or will it actually fail to run for whatever reason?

@pgwalsh 2016-04-04 13:41:17

This only works if you're using a crawler in a self contained manner. How about using subprocess module with subprocess.call().

I changed you're spider in the following manner and it worked. I do not have the database setup so those lines have been commented out.

    import scrapy 
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
from scrapy import Request


class EPGD(scrapy.Item):
    genID = scrapy.Field()
    genID_url = scrapy.Field()
    taxID = scrapy.Field()
    taxID_url = scrapy.Field()
    familyID = scrapy.Field()
    familyID_url = scrapy.Field()
    chromosome = scrapy.Field()
    symbol = scrapy.Field()
    description = scrapy.Field()

class EPGD_spider(scrapy.Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = "man"
    start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]


    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
        url_list = []
        base_url = "http://epgd.biosino.org/EPGD"

        for site in sites:
            item = EPGD()
            item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
            item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
            item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
            item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
            item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
            item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
            item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
            item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
            item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
            #self.collection.update({"genID":item['genID']}, dict(item),  upsert=True)
            yield item

            sel_tmp = Selector(response)
            link = sel_tmp.xpath('//span[@id="quickPage"]')

            for site in link:
                url_list.append(site.xpath('a/@href').extract())

            for i in range(len(url_list[0])):
                if cmp(url_list[0][i], "#") == 0:
                    if i+1 < len(url_list[0]):
                        print url_list[0][i+1]
                        actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
                        yield Request(actual_url, callback=self.parse)
                        break
                    else:
                        print "The index is out of range!"



process = CrawlerProcess()
process.crawl(EPGD_spider)
process.start()

You should be able to run the above in:

subprocess.check_output(['scrapy', 'runspider', "epgd.py"])

@Coding_Rabbit 2016-04-05 02:02:05

Do you have any documents or examples that can show me how to do with this?

@pgwalsh 2016-04-05 14:34:46

subprocess.call(["scrapy", "crawl", "your_crawler_name"]) I wouldn't have the process in your crawler template or spider.

@Coding_Rabbit 2016-04-06 00:49:47

When I tried to use the method subprocess, I got an error like this Scrapy 1.0.5 - no active project Unknown command: crawl Use "scrapy" to see available commands. I know this is because I didn't run the crawl in the right folder or path, but I don't know how to switch it to the right one.So how to deal with this?

@pgwalsh 2016-04-06 13:13:33

My apologies, I wasn't paying attention and realized you weren't using the crawler in a self contained manner. What are your imports. I was able to get your spider working, but I had to change several settings.

Related Questions

Sponsored Content

22 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 2781252 View
  • 3105 Score
  • 22 Answer
  • Tags:   python directory

40 Answered Questions

[SOLVED] How do I check whether a file exists without exceptions?

49 Answered Questions

[SOLVED] How to merge two dictionaries in a single expression?

1 Answered Questions

500 internal server error when flask form contains null values

  • 2016-02-16 05:50:27
  • Adam Cooley
  • 383 View
  • -1 Score
  • 1 Answer
  • Tags:   python flask

0 Answered Questions

Chatbot FB posting response with python and flask

1 Answered Questions

[SOLVED] Twilio HTTP error and Google Spreadsheet

1 Answered Questions

[SOLVED] Schoolboy error with Flask Forms?

  • 2018-02-03 09:55:53
  • Gingmeister
  • 56 View
  • 0 Score
  • 1 Answer
  • Tags:   python flask

1 Answered Questions

[SOLVED] AttributeError when using pandas and Google App Engine

1 Answered Questions

[SOLVED] Flask WTF giving me jinja2.exceptions.UndefinedError: 'form' is undefined

  • 2016-11-24 00:15:55
  • Jessi
  • 2905 View
  • 0 Score
  • 1 Answer
  • Tags:   python flask

1 Answered Questions

[SOLVED] WTForms bringing in data from form in unicode formating

Sponsored Content