R: XPath expression returns links outside of selected element -

June 15, 2010

i using r scrape links main table on that page, using xpath syntax. main table third on page, , want links containing magazine article.

my code follows:

require(xml) (x = htmlparse("http://www.numerama.com/magazine/recherche/125/hadopi/date")) (y = xpathapply(x, "//table")[[3]]) (z = xpathapply(y, "//table//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href")) (links = unique(z))

if @ output, final links not come main table sidebar, though selected main table in third line asking object y include third table.

what doing wrong? correct/more efficient way code xpath?

note: xpath novice writing.

answered (really quickly), much! solution below.

extract <- function(x) {     message(x)     html = htmlparse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))     html = xpathapply(html, "//table")[[3]]     html = xpathapply(html, ".//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href")     html = gsub("#ac_newscomment", "", html)     html = unique(html) }  d = lapply(1:125, extract) d = unlist(d) write.table(d, "numerama.hadopi.news.txt", row.names = false)

this saves links news items keyword 'hadopi' on website.

you need start pattern . if want restrict search current node. / goes start of document (even if root node not in y).

xpathsapply(y, ".//a/@href" )

alternatively, can extract third table directly xpath:

xpathapply(x, "//table[3]//a[contains(@href,'/magazine/') , not(contains(@href, '/recherche/'))]/@href")

Search This Blog

Three

R: XPath expression returns links outside of selected element -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -