python - Dynamically change scrapy Request scheduler priority -
i'm using scrapy perform test on internal web app. once tests done, use crawlspider check everywhere, , run each response html validator , 404 media files.
it work except this: crawl @ end, get
things in random order... so, url perform delete operation being executed before other operations.
i schedule delete @ end. tried many way, such kind of scheduler:
from scrapy import log
class deletedelayer(object): def enqueue_request(self, spider, request): if request.url.find('delete') != -1: log.msg("delay %s" % request.url, log.debug) request.priority = 50
but not work... see delete being "delay" in log executed during execution.
i thought of using middleware can pile in memory delete url , when spider_idle
signal called put them in, i'm not sure on how this.
what best way acheive this?
- default priority request 0, set priority 50 not work
- you can use middleware collect (insert requests own queue, e.g, redis set) , ignore (return ingnorerequest exception) 'delete' request
- start 2nd crawl requests load queue in step 2
Comments
Post a Comment