web scraping - Extracting the same data from various HTML documents -


let's have several html pages unrelated websites, contain same overall information. want extract information in flexible manner, i.e. want have write small number of data extractors of pages (ideally, one). fields (to use blog example) author, date, title, text. classes of html tags denote these totally different each page, still display on page in same way. example, take this post cnn , this post gawker. both contain same information - information want - somewhere on page when displayed. there nice way extract data? writing separate extractors option, not one; there thousand styles of documents in dataset want use.

the way can finding common element in of websites (e.g. share same dom structure, or have same id, or preceded same content in previous tag <h1>).

otherwise, need write different rules or regular expressions each case.

unless, of course, write algorithm intelligent capable of recognizing content intention/meaning different html - not simple nor quick write in way.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -