python - Dynamically change scrapy Request scheduler priority -

June 15, 2015

i'm using scrapy perform test on internal web app. once tests done, use crawlspider check everywhere, , run each response html validator , 404 media files.

it work except this: crawl @ end, get things in random order... so, url perform delete operation being executed before other operations.

i schedule delete @ end. tried many way, such kind of scheduler:

from scrapy import log

class deletedelayer(object):     def enqueue_request(self, spider, request):         if request.url.find('delete') != -1:             log.msg("delay %s" % request.url, log.debug)             request.priority = 50

but not work... see delete being "delay" in log executed during execution.

i thought of using middleware can pile in memory delete url , when spider_idle signal called put them in, i'm not sure on how this.

what best way acheive this?

default priority request 0, set priority 50 not work
you can use middleware collect (insert requests own queue, e.g, redis set) , ignore (return ingnorerequest exception) 'delete' request
start 2nd crawl requests load queue in step 2

Search This Blog

Three

python - Dynamically change scrapy Request scheduler priority -

Comments

Post a Comment

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -