By G Gill


2012-03-13 09:11:31 8 Comments

I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself?

It is given in the documentation that either the name of the spider or the URL can be given, but when i given the url it throws an error:

//name of my spider is example, but i am giving url instead of my spider name(It works fine if i give spider name).

scrapy crawl example.com

ERROR:

File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.14.1-py2.7.egg/scrapy/spidermanager.py", line 43, in create raise KeyError("Spider not found: %s" % spider_name) KeyError: 'Spider not found: example.com'

How can i make scrapy to use my spider on the url given in the terminal??

6 comments

@Mayur Koshti 2015-08-28 12:20:45

You can also try this:

>>> scrapy view http://www.sitename.com

It will open a window in browser of requested URL.

@Steven Almeroth 2015-02-16 18:20:53

Sjaak Trekhaak has the right idea and here is how to allow multiples:

class MySpider(scrapy.Spider):
    """
    This spider will try to crawl whatever is passed in `start_urls` which
    should be a comma-separated string of fully qualified URIs.

    Example: start_urls=http://localhost,http://example.com
    """
    def __init__(self, name=None, **kwargs):
        if 'start_urls' in kwargs:
            self.start_urls = kwargs.pop('start_urls').split(',')
        super(Spider, self).__init__(name, **kwargs)

@glindste 2013-03-08 10:34:25

An even easier way to allow multiple url-arguments than what Peter suggested is by giving them as a string with the urls separated by a comma, like this:

-a start_urls="http://example1.com,http://example2.com"

In the spider you would then simply split the string on ',' and get an array of urls:

self.start_urls = kwargs.get('start_urls').split(',')

@pemistahl 2012-10-05 15:51:05

This is an extension to the approach given by Sjaak Trekhaak in this thread. The approach as it is so far only works if you provide exactly one url. For example, if you want to provide more than one url like this, for instance:

-a start_url=http://url1.com,http://url2.com

then Scrapy (I'm using the current stable version 0.14.4) will terminate with the following exception:

error: running 'scrapy crawl' with more than one spider is no longer supported

However, you can circumvent this problem by choosing a different variable for each start url, together with an argument that holds the number of passed urls. Something like this:

-a start_url1=http://url1.com 
-a start_url2=http://url2.com 
-a urls_num=2

You can then do the following in your spider:

class MySpider(BaseSpider):

    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
        super(MySpider, self).__init__(*args, **kwargs) 

        urls_num = int(kwargs.get('urls_num'))

        start_urls = []
        for i in xrange(1, urls_num):
            start_urls.append(kwargs.get('start_url{0}'.format(i)))

        self.start_urls = start_urls

This is a somewhat ugly hack but it works. Of course, it's tedious to explicitly write down all command line arguments for each url. Therefore, it makes sense to wrap the scrapy crawl command in a Python subprocess and generate the command line arguments in a loop or something.

Hope it helps. :)

@mmv-ru 2015-11-02 22:11:59

If I call scrapy 0.24.4 like this: scrapy crawl MySpider -a start_urls=http://example.com/ -o - -t json Everything work well. Initially I put options between -o and - and get same as You error.

@Subhash 2012-03-15 11:49:45

Use scrapy parse command. You can parse a url with your spider. url is passed from command.

$ scrapy parse http://www.example.com/ --spider=spider-name

http://doc.scrapy.org/en/latest/topics/commands.html#parse

@dan3 2013-02-24 07:28:16

Unfortunately, scrapy parse doesn't seem to have options to save results to a file (in various formats) like scrapy crawl does

@jeffjv 2016-04-20 00:10:29

If you are looking to just debug why a particular url your spider is failing on this is an easy option.

@Citricguy 2017-05-22 12:34:26

Can't save/export to file easily. Otherwise this would have been perfect.

@Sjaak Trekhaak 2012-03-13 11:00:35

I'm not really sure about the commandline option. However, you could write your spider like this.

class MySpider(BaseSpider):

    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
      super(MySpider, self).__init__(*args, **kwargs) 

      self.start_urls = [kwargs.get('start_url')] 

And start it like: scrapy crawl my_spider -a start_url="http://some_url"

@G Gill 2012-03-13 11:26:07

thank you so much, this is exactly what i was looking for. It worked fine for me :)

@pemistahl 2012-10-05 16:11:23

This approach only works for exactly one url. If you want to provide more than one url, see my approach in this thread.

@Steven Almeroth 2015-02-16 18:20:27

For multiple URLs: self.start_urls = kwargs.pop('start_urls').split(',') which is run before the super().

Related Questions

Sponsored Content

2 Answered Questions

[SOLVED] Getting twisted.defer.CancelledError when using Scrapy

  • 2016-03-11 19:33:51
  • the_interest_seeker
  • 343 View
  • 0 Score
  • 2 Answer
  • Tags:   python scrapy twisted

0 Answered Questions

Scrapyd Deploy Error: EOFError: EOF when reading a line

1 Answered Questions

inappropriate deploy Scrapy proxies

1 Answered Questions

[SOLVED] ValueError: Missing scheme in request url: h

2 Answered Questions

[SOLVED] scrapy not exporting data to elastic search

1 Answered Questions

[SOLVED] Running scrapy spider with Anaconda

1 Answered Questions

[SOLVED] error in scrapy web crawler tutorial

1 Answered Questions

[SOLVED] Error starting Scrapy project

1 Answered Questions

[SOLVED] differences between scrapy.crawler and scrapy.spider?

  • 2012-11-27 05:55:33
  • Java Xu
  • 974 View
  • 3 Score
  • 1 Answer
  • Tags:   python scrapy

1 Answered Questions

[SOLVED] Scrapy Error even on the tutorial given by them on their documentation

  • 2012-06-25 13:43:27
  • Sanjoy
  • 668 View
  • 1 Score
  • 1 Answer
  • Tags:   python scrapy

Sponsored Content