vbscript - Extract text between HTML tags -


i have many html files need extract text. if it's on 1 line, can quite if tag wraps around or on multiple lines can't figure how this. here's mean:

<section id="mysection"> text here line here <br> last line of text. </section> 

i'm not concerned <br> text, unless wrap text around. area want begins "mysection" , ended </section>. i'd end this:

some text here  line here  last line of text. 

i'd prefer vbscript or command line option (sed?) i'm not sure begin. help?

normally you'd use internet explorer com object this:

root = "c:\base\dir"  set ie = createobject("internetexplorer.application")  each f in fso.getfolder(root).files   ie.navigate "file:///" & f.path   while ie.busy : wscript.sleep 100 : wend    text = ie.document.getelementbyid("mysection").innertext    wscript.echo replace(text, vbnewline, "") next 

however, <section> tag not supported prior ie 9, , in ie 9 com object doesn't seem handle correctly, getelementbyid("mysection") returns opening tag:

>>> wsh.echo ie.document.getelementbyid("mysection").outerhtml <section id=mysection> 

you use regular expression instead, though:

root = "c:\base\dir"  set fso = createobject("scripting.filesystemobject")  set re1 = new regexp re1.pattern = "<section id=""mysection"">([\s\s]*?)</section>" re1.global  = false re2.ignorecase = true  set re2 = new regexp re2.pattern = "(<br>|\s)+" re2.global  = true re2.ignorecase = true  each f in fso.getfolder(root).files   html = fso.opentextfile(filename).readall    set m = re1.execute(html)   if m.count > 0     text = trim(re2.replace(m.submatches(0).value, " "))   end if    wscript.echo text next 

Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

CSS3 Transition to highlight new elements created in JQuery -