python - Re-feed some already crawled URL to spider/scheduler -


there url (domain.com/list) lists 10 links need crawl periodically. these links change pretty each 30 seconds roughly, need re-crawl domain.com/list check new links. crawling links takes more 30 seconds because of size cannot cron script each 30 seconds since end several concurrent spiders. missing links because spider takes long during first run acceptable situation though.

i wrote spider middleware remove visited links (for cases in links change partially). tried include in process_spider_output new request domain.com/list dont_filter=true list feeded again scheduler, end tons of requests. code is:

def process_spider_output(self, response, result, spider):      in result:         if isinstance(i, request):             state = spider.state.get('crawled_links', deque([]))             if unquote(i.url) in state or i.url in state:                 print "removed %s" %                 continue         yield      yield spider.make_requests_from_url('http://domain.com/list') 

this seems pretty ugly , not sure whether works intended. tried hook spider idle , closed signals try re-crawl site without success.

what's best way re-crawl specific urls monitor changes occur often, , without closing spider in use?

thanks in advance

crawling links takes more 30 seconds because of size cannot cron script each 30 seconds since end several concurrent spiders.

there's common practice of using file containing process pid mutual exclusion lock, , bailing out if file exists, , process still running. if put spidering code program sort of structure...

import sys import os  pidfile = '/tmp/mycrawler.pid'   def do_the_thing():     # <your spider code here>   def main():      # check if we're running     if os.path.exists(pidfile):         pid = int(open(pidfile, 'r').read())         try:             os.kill(pid, 0)             print "we're running pid %d" % pid             sys.exit(1)         except oserror:             pass      # write pid file     open(pidfile, 'w').write(str(os.getpid()))      # thing, ensuring delete pid file when done     try:         do_the_thing()     finally:         os.unlink(pidfile)   if __name__ == '__main__':     main() 

...then can run cron like, , it'll wait until last instance has finished before running again.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -