By Saw


2018-07-14 09:41:28 8 Comments

I've used the CrawlSpider successfully before. But when I changed the code in order to integrate with Redis and add my own middlewares to set UserAgent and cookies, the spider doesn't parse the responses anymore, and thus the spider doesn't generate new requests, the spider closed soon after beginning.

Here's the running stats

Even if I code this: def parse_start_url(self, response): return self.parse_item(response) It only parses the response from first url

Here's my code: Spider:

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from yydzh.items import YydzhItem
from scrapy.spiders import Rule, CrawlSpider


class YydzhSpider(CrawlSpider):
    name = 'yydzhSpider'
    allowed_domains = ['yydzh.com']
    start_urls = ['http://www.yydzh.com/thread.php?fid=198']
    rules = (
         Rule(LinkExtractor(allow='thread\.php\?fid=198&page=([1-9]|1[0-9])#s', 
         restrict_xpaths=("//div[@class='pages']")), 
         callback='parse_item', follow=True,
         ),
    )

#def parse_start_url(self, response):
#   return self.parse_item(response)

def parse_item(self, response):
    item = YydzhItem()
    for each in response.xpath \
        ("//*[@id='ajaxtable']//tr[@class='tr2'][last()]/following-sibling::tr[@class!='tr2']"):
        item['title'] = each.xpath("./td[2]/h3[1]/a//text()").extract()[0]
        item['author'] = each.xpath('./td[3]/a//text()').extract()[0]
        item['category'] = each.xpath('./td[2]/span[1]//text()').extract()[0]
        item['url'] = each.xpath("./td[2]/h3[1]//a/@href").extract()[0]
        yield item

Settings I think crucial:

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
DOWNLOADER_MIDDLEWARES = {
'yydzh.middlewares.UserAgentmiddleware': 500,
'yydzh.middlewares.CookieMiddleware': 600
}
COOKIES_ENABLED = True

Middleware: UserAgentmiddleware changes the user agent randomly to avoid being noticed by the server

CookieMiddleware adds the cookies to request for pages that ask for log-in to scan

logger = logging.getLogger(__name__)


class UserAgentmiddleware(UserAgentMiddleware):

def process_request(self, request, spider):
    agent = random.choice(agents)
    request.headers["User-Agent"] = agent

class CookieMiddleware(RetryMiddleware):

def __init__(self, settings, crawler):
    RetryMiddleware.__init__(self, settings)
    self.rconn = redis.Redis(host=REDIS_HOST, port=REDIS_PORT,
                             password=REDIS_PASS, db=1, decode_responses=True)  
    init_cookie(self.rconn, crawler.spider.name)


@classmethod
def from_crawler(cls, crawler):
    return cls(crawler.settings, crawler)

def process_request(self, request, spider):
    redisKeys = self.rconn.keys()
    while len(redisKeys) > 0:
        elem = random.choice(redisKeys)
        if spider.name + ':Cookies' in elem:
            cookie = json.loads(self.rconn.get(elem))
            request.cookies = cookie
            request.meta["accountText"] = elem.split("Cookies:")[-1]
            break
        else:
            redisKeys.remove(elem)

def process_response(self, request, response, spider):
    if('您没有登录或者您没有权限访问此页面' in str(response.body)):
        accountText = request.meta["accountText"]
        remove_cookie(self.rconn, spider.name, accountText)
        update_cookie(self.rconn, spider.name, accountText)
        logger.warning("更新Cookie成功!(账号为:%s)" % accountText)
        return request

    return response

1 comments

@user10084120 2018-07-15 12:40:35

Find the problem: All the urls have been filtered by Redis server before previous requests, and restart it can solve the problem.

Related Questions

Sponsored Content

24 Answered Questions

[SOLVED] How do I parse a string to a float or int in Python?

8 Answered Questions

[SOLVED] Parsing values from a JSON file?

  • 2010-05-14 15:54:20
  • michele
  • 2211588 View
  • 1218 Score
  • 8 Answer
  • Tags:   python json parsing

1 Answered Questions

[SOLVED] Scrapy - Understanding CrawlSpider and LinkExtractor

2 Answered Questions

[SOLVED] Scrapy crawl resume does not crawl anything and just finishes

1 Answered Questions

[SOLVED] Value Errors When Retrieving Images With Scrapy

1 Answered Questions

1 Answered Questions

[SOLVED] need help on ajax pagination crawl by python scrapy

  • 2013-12-23 12:14:32
  • user3119144
  • 662 View
  • 0 Score
  • 1 Answer
  • Tags:   python ajax

1 Answered Questions

2 Answered Questions

[SOLVED] How to add Headers to Scrapy CrawlSpider Requests?

  • 2013-01-08 16:58:44
  • CatShoes
  • 14506 View
  • 5 Score
  • 2 Answer
  • Tags:   python scrapy

2 Answered Questions

[SOLVED] Scrapy: crawlspider not generating all links in nested callbacks

Sponsored Content