Scrapy multiple Rules for SgmlLinkExtractor don't work -

August 15, 2010

i want crawl entire site , extract links conditionally.

as suggested in link tried multiple rules doesn't work. scrapy doesn't crawl pages

i tried code doesn't scrap details.

class businesslistspider(crawlspider):     name = 'businesslist'     allowed_domains = ['www.businesslist.ae']     start_urls = ['http://www.businesslist.ae/']      rules = (         rule(sgmllinkextractor()),         rule(sgmllinkextractor(allow=r'company/(\d)+/'), callback='parse_item'),     )      def parse_item(self, response):         self.log('hi, item page! %s' % response.url)         hxs = htmlxpathselector(response)         = businesslistitem()         company = hxs.select('//div[@class="text companyname"]/strong/text()').extract()[0]         address = hxs.select('//div[@class="text location"]/text()').extract()[0]         location = hxs.select('//div[@class="text location"]/a/text()').extract()[0]         i['url'] = response.url         i['company'] = company         i['address'] = address         i['location'] = location         return

in case doesn't apply second rule, doesn't parse detail pages.

first rule rule(sgmllinkextractor()) matches every links, , scrapy ignores second one.

try followings:

... start_urls = ['http://www.businesslist.ae/sitemap.html'] ... # rule(sgmllinkextractor()),

Search This Blog

Three

Scrapy multiple Rules for SgmlLinkExtractor don't work -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

c# - Farseer ContactListener is not working -

Automatically create pages in phpfox -