java - Extracting certain text from html file -
i want extract texts html file placed between parapraph(p) , link(a href) tags.i want without java regex , html parsers.i thougth
while ((word = reader.readline()) !=null) { //iterate end of file if(word.contains("<p>")) { //catching p tag while(!word.contains("</p>") { //iterate end of tag try { //start writing out.write(word); } catch (ioexception e) { } } } }
but not working.the code seems pretty valid me.how reader can catch "p" , "a href" tags.
the problems start when have <p>blah</p>
in single line. 1 simple solution change <
\n<
- this:
boolean insidepar = false; while ((line = reader.readline()) !=null) { for(string word in line.replaceall("<","\n<").split("\n")){ if(word.contains("<p>")){ insidepar = true; }else if(word.contains("</p>")){ insidepar = false; } if(insidepar){ // write word} } }
still i'd recommend using parser library @hovercraftfullofeels.
edit: i've updated code it's bit closer working version, there more problems along way.
Comments
Post a Comment