Parsing large xml data using python's elementtree -
i'm learning how parse xml data using elementtree. got error say:parseerror: not well-formed (invalid token): line 1, column 2.
my code right below, , bit of xml data after code.
import xml.etree.elementtree et tree = et.fromstring("c:\pbc.xml") root = tree.getroot() article in root.findall('article'): print ' '.join([t.text t in pub.findall('title')]) author in article.findall('author'): print 'author name: {}'.format(author.text) journal in article.findall('journal'): # venue tags id attribute print 'journal'
<?xml version="1.0" encoding="iso-8859-1"?> <!doctype dblp system "dblp.dtd"> <dblp> <article mdate="2002-01-03" key="persons/codd71a"> <author>e. f. codd</author> <title>further normalization of data base relational model.</title> <journal>ibm research report, san jose, california</journal> <volume>rj909</volume> <month>august</month> <year>1971</year> <cdrom>ibmtr/rj909.pdf</cdrom> <ee>db/labs/ibm/rj909.html</ee> </article> <article mdate="2002-01-03" key="persons/hall74"> <author>patrick a. v. hall</author> <title>common subexpression identification in general algebraic systems.</title> <journal>technical rep. uksc 0060, ibm united kingdom scientific centre</journal> <month>november</month> <year>1974</year> </article>
you using .fromstring()
instead of .parse()
:
import xml.etree.elementtree et tree = et.parse("c:\pbc.xml") root = tree.getroot()
.fromstring()
expects given xml data in bytestring, not filename.
if document large (many megabytes or more) should use et.iterparse()
function instead , clear elements have processed:
for event, article in et.iterparse('c:\\pbc.xml', tag='article'): title in aarticle.findall('title'): print 'title: {}'.format(title.txt) author in article.findall('author'): print 'author name: {}'.format(author.text) journal in article.findall('journal'): print 'journal' article.clear()
Comments
Post a Comment