python - Re-feed some already crawled URL to spider/scheduler -
there url (domain.com/list) lists 10 links need crawl periodically. these links change pretty each 30 seconds roughly, need re-crawl domain.com/list check new links. crawling links takes more 30 seconds because of size cannot cron script each 30 seconds since end several concurrent spiders. missing links because spider takes long during first run acceptable situation though.
i wrote spider middleware remove visited links (for cases in links change partially). tried include in process_spider_output new request domain.com/list dont_filter=true list feeded again scheduler, end tons of requests. code is:
def process_spider_output(self, response, result, spider): in result: if isinstance(i, request): state = spider.state.get('crawled_links', deque([])) if unquote(i.url) in state or i.url in state: print "removed %s" % continue yield yield spider.make_requests_from_url('http://domain.com/list')
this seems pretty ugly , not sure whether works intended. tried hook spider idle , closed signals try re-crawl site without success.
what's best way re-crawl specific urls monitor changes occur often, , without closing spider in use?
thanks in advance
crawling links takes more 30 seconds because of size cannot cron script each 30 seconds since end several concurrent spiders.
there's common practice of using file containing process pid mutual exclusion lock, , bailing out if file exists, , process still running. if put spidering code program sort of structure...
import sys import os pidfile = '/tmp/mycrawler.pid' def do_the_thing(): # <your spider code here> def main(): # check if we're running if os.path.exists(pidfile): pid = int(open(pidfile, 'r').read()) try: os.kill(pid, 0) print "we're running pid %d" % pid sys.exit(1) except oserror: pass # write pid file open(pidfile, 'w').write(str(os.getpid())) # thing, ensuring delete pid file when done try: do_the_thing() finally: os.unlink(pidfile) if __name__ == '__main__': main()
...then can run cron
like, , it'll wait until last instance has finished before running again.
Comments
Post a Comment