web scraping - Extracting the same data from various HTML documents -

February 15, 2011

let's have several html pages unrelated websites, contain same overall information. want extract information in flexible manner, i.e. want have write small number of data extractors of pages (ideally, one). fields (to use blog example) author, date, title, text. classes of html tags denote these totally different each page, still display on page in same way. example, take this post cnn , this post gawker. both contain same information - information want - somewhere on page when displayed. there nice way extract data? writing separate extractors option, not one; there thousand styles of documents in dataset want use.

the way can finding common element in of websites (e.g. share same dom structure, or have same id, or preceded same content in previous tag <h1>).

otherwise, need write different rules or regular expressions each case.

unless, of course, write algorithm intelligent capable of recognizing content intention/meaning different html - not simple nor quick write in way.

Search This Blog

Three

web scraping - Extracting the same data from various HTML documents -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Automatically create pages in phpfox -

c# - Farseer ContactListener is not working -