R: XPath expression returns links outside of selected element -
i using r scrape links main table on that page, using xpath syntax. main table third on page, , want links containing magazine article.
my code follows:
require(xml) (x = htmlparse("http://www.numerama.com/magazine/recherche/125/hadopi/date")) (y = xpathapply(x, "//table")[[3]]) (z = xpathapply(y, "//table//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href")) (links = unique(z))
if @ output, final links not come main table sidebar, though selected main table in third line asking object y
include third table.
what doing wrong? correct/more efficient way code xpath?
note: xpath novice writing.
answered (really quickly), much! solution below.
extract <- function(x) { message(x) html = htmlparse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date")) html = xpathapply(html, "//table")[[3]] html = xpathapply(html, ".//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href") html = gsub("#ac_newscomment", "", html) html = unique(html) } d = lapply(1:125, extract) d = unlist(d) write.table(d, "numerama.hadopi.news.txt", row.names = false)
this saves links news items keyword 'hadopi' on website.
you need start pattern .
if want restrict search current node. /
goes start of document (even if root node not in y
).
xpathsapply(y, ".//a/@href" )
alternatively, can extract third table directly xpath:
xpathapply(x, "//table[3]//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href")
Comments
Post a Comment