vbscript - Extract text between HTML tags -
i have many html files need extract text. if it's on 1 line, can quite if tag wraps around or on multiple lines can't figure how this. here's mean:
<section id="mysection"> text here line here <br> last line of text. </section>
i'm not concerned <br>
text, unless wrap text around. area want begins "mysection" , ended </section>
. i'd end this:
some text here line here last line of text.
i'd prefer vbscript or command line option (sed?) i'm not sure begin. help?
normally you'd use internet explorer com object this:
root = "c:\base\dir" set ie = createobject("internetexplorer.application") each f in fso.getfolder(root).files ie.navigate "file:///" & f.path while ie.busy : wscript.sleep 100 : wend text = ie.document.getelementbyid("mysection").innertext wscript.echo replace(text, vbnewline, "") next
however, <section>
tag not supported prior ie 9, , in ie 9 com object doesn't seem handle correctly, getelementbyid("mysection")
returns opening tag:
>>> wsh.echo ie.document.getelementbyid("mysection").outerhtml <section id=mysection>
you use regular expression instead, though:
root = "c:\base\dir" set fso = createobject("scripting.filesystemobject") set re1 = new regexp re1.pattern = "<section id=""mysection"">([\s\s]*?)</section>" re1.global = false re2.ignorecase = true set re2 = new regexp re2.pattern = "(<br>|\s)+" re2.global = true re2.ignorecase = true each f in fso.getfolder(root).files html = fso.opentextfile(filename).readall set m = re1.execute(html) if m.count > 0 text = trim(re2.replace(m.submatches(0).value, " ")) end if wscript.echo text next
Comments
Post a Comment