R: XPath expression returns links outside of selected element -


i using r scrape links main table on that page, using xpath syntax. main table third on page, , want links containing magazine article.

my code follows:

require(xml) (x = htmlparse("http://www.numerama.com/magazine/recherche/125/hadopi/date")) (y = xpathapply(x, "//table")[[3]]) (z = xpathapply(y, "//table//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href")) (links = unique(z)) 

if @ output, final links not come main table sidebar, though selected main table in third line asking object y include third table.

what doing wrong? correct/more efficient way code xpath?

note: xpath novice writing.

answered (really quickly), much! solution below.

extract <- function(x) {     message(x)     html = htmlparse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))     html = xpathapply(html, "//table")[[3]]     html = xpathapply(html, ".//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href")     html = gsub("#ac_newscomment", "", html)     html = unique(html) }  d = lapply(1:125, extract) d = unlist(d) write.table(d, "numerama.hadopi.news.txt", row.names = false) 

this saves links news items keyword 'hadopi' on website.

you need start pattern . if want restrict search current node. / goes start of document (even if root node not in y).

xpathsapply(y, ".//a/@href" ) 

alternatively, can extract third table directly xpath:

xpathapply(x, "//table[3]//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href") 

Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

CSS3 Transition to highlight new elements created in JQuery -