By no1


2011-01-17 06:17:01 8 Comments

How do you utilize proxy support with the python web-scraping framework Scrapy?

7 comments

@Amom 2013-12-16 10:25:22

Single Proxy

  1. Enable HttpProxyMiddleware in your settings.py, like this:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
    }
    
  2. pass proxy to request via request.meta:

    request = Request(url="http://example.com")
    request.meta['proxy'] = "host:port"
    yield request
    

You also can choose a proxy address randomly if you have an address pool. Like this:

Multiple Proxies

class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']

    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req

@Rafael T 2014-12-22 20:16:41

The documentation says that the HttpProxyMiddleware is setting the proxy inside every Requests meta attr, so enabling ProxyMiddleware AND setting it manually would make no sense

@Thamme Gowda 2017-07-21 03:48:12

I should have copied this code. I glanced it and then coded myself, but proxy functionality was not working. Now I see the proxy value was set to request.headers instead of request.meta. Stupid me (face palm)! I went to see the HttpProxyMiddleware code, it skips if someone has already set request.meta['proxy'], so there is no need to list it in the settings github.com/scrapy/scrapy/blob/master/scrapy/…

@Shahryar Saljoughi 2015-04-18 10:46:02

1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn’t it ?

@ccdpowell 2015-05-07 01:09:38

I implement your solution which looks correct, but I keep getting a Twisted error: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>] ANY ADVICE???

@Greg Sadetsky 2016-02-28 03:03:22

Take care to use base64.b64encode instead of base64.encodestring as the latter adds a newline character to the encoded base64 result...! See stackoverflow.com/a/32243566/426790

@Ekrem Gurdal 2018-07-06 07:37:48

How can we change proxy after 20 request to not to be banned?

@ephemient 2011-01-17 06:29:08

From the Scrapy FAQ,

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,

C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

@no1 2011-01-17 11:59:19

Thanks ... So I need to set this var before running scrapy crawler it's not possible to set it or change it from the crawler code

@Pablo Hoffman 2011-01-25 19:35:58

You can even set the proxy on a per-request base with: request.meta['proxy'] = 'your.proxy.address';

@Lionel 2011-11-20 16:59:40

How do you authenticate the proxy?

@ocean800 2017-06-19 22:58:14

@ephemient How can we tell if scrapy is using the proxy?

@Shannon Cole 2018-06-24 12:53:50

@ocean800 I use scrapy to scrape a website that shows your current IP to see if it's using the proxy. That way I can load the page via a chrome and see my actual IP and compare it to what scrapy sees on the same page.

@Niranjan Sagar 2015-12-01 01:58:10

There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies "Scrapy proxy middleware"

@pinkvoid 2015-11-18 07:58:32

As I've had trouble by setting the environment in /etc/environment, here is what I've put in my spider (Python):

os.environ["http_proxy"] = "http://localhost:12345"

@SIM 2018-04-22 19:42:20

Best approach I've found out so far.

@Andrea Ianni ௫ 2015-10-27 13:20:01

In Windows I put together a couple of previous answers and it worked. I simply did:

C:>  set http_proxy = http://username:[email protected]:port

and then I launched my program:

C:/.../RightFolder> scrapy crawl dmoz

where "dmzo" is the program name (I'm writing it because it's the one you find in a tutorial on internet, and if you're here you have probably started from the tutorial).

@laurent alsina 2013-01-18 14:58:29

that would be:

export http_proxy=http://user:[email protected]:port

@Allan Ruin 2014-03-30 15:41:59

I use this yet I just received [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]

@Andrea Ianni ௫ 2015-10-27 15:26:04

In Windows: "set http_proxy=user:[email protected]:port";

Related Questions

Sponsored Content

7 Answered Questions

20 Answered Questions

[SOLVED] How can I get a list of locally installed Python modules?

  • 2009-04-11 12:34:18
  • Léo Léopold Hertz 준영
  • 896745 View
  • 773 Score
  • 20 Answer
  • Tags:   python module pip

10 Answered Questions

21 Answered Questions

[SOLVED] How do I download a file over HTTP using Python?

  • 2008-08-22 15:34:13
  • Owen
  • 835399 View
  • 708 Score
  • 21 Answer
  • Tags:   python http urllib

14 Answered Questions

[SOLVED] Getting the last element of a list in Python

  • 2009-05-30 19:28:53
  • Janusz
  • 1353575 View
  • 1471 Score
  • 14 Answer
  • Tags:   python list indexing

9 Answered Questions

[SOLVED] How to know if an object has an attribute in Python

  • 2009-03-04 14:45:59
  • Lucas Gabriel Sánchez
  • 607101 View
  • 1185 Score
  • 9 Answer
  • Tags:   python attributes

17 Answered Questions

[SOLVED] How to use threading in Python?

3 Answered Questions

[SOLVED] Proxy IP for Scrapy framework

2 Answered Questions

scraping css values using scrapy framework

  • 2016-09-02 18:50:46
  • Hussain
  • 141 View
  • -2 Score
  • 2 Answer
  • Tags:   python scrapy

Sponsored Content