Scrapy multiple Rules for SgmlLinkExtractor don't work -
i want crawl entire site , extract links conditionally.
as suggested in link tried multiple rules doesn't work. scrapy doesn't crawl pages
i tried code doesn't scrap details.
class businesslistspider(crawlspider): name = 'businesslist' allowed_domains = ['www.businesslist.ae'] start_urls = ['http://www.businesslist.ae/'] rules = ( rule(sgmllinkextractor()), rule(sgmllinkextractor(allow=r'company/(\d)+/'), callback='parse_item'), ) def parse_item(self, response): self.log('hi, item page! %s' % response.url) hxs = htmlxpathselector(response) = businesslistitem() company = hxs.select('//div[@class="text companyname"]/strong/text()').extract()[0] address = hxs.select('//div[@class="text location"]/text()').extract()[0] location = hxs.select('//div[@class="text location"]/a/text()').extract()[0] i['url'] = response.url i['company'] = company i['address'] = address i['location'] = location return in case doesn't apply second rule, doesn't parse detail pages.
first rule rule(sgmllinkextractor()) matches every links, , scrapy ignores second one.
try followings:
... start_urls = ['http://www.businesslist.ae/sitemap.html'] ... # rule(sgmllinkextractor()),
Comments
Post a Comment