Parsing large xml data using python's elementtree -

March 15, 2010

i'm learning how parse xml data using elementtree. got error say:parseerror: not well-formed (invalid token): line 1, column 2.

my code right below, , bit of xml data after code.

import xml.etree.elementtree et  tree = et.fromstring("c:\pbc.xml") root = tree.getroot()   article in root.findall('article'):     print ' '.join([t.text t in pub.findall('title')])     author in article.findall('author'):         print 'author name: {}'.format(author.text)     journal in article.findall('journal'):  # venue tags id attribute         print 'journal'

<?xml version="1.0" encoding="iso-8859-1"?> <!doctype dblp system "dblp.dtd"> <dblp> <article mdate="2002-01-03" key="persons/codd71a"> <author>e. f. codd</author> <title>further normalization of data base relational model.</title> <journal>ibm research report, san jose, california</journal> <volume>rj909</volume> <month>august</month> <year>1971</year> <cdrom>ibmtr/rj909.pdf</cdrom> <ee>db/labs/ibm/rj909.html</ee> </article>  <article mdate="2002-01-03" key="persons/hall74"> <author>patrick a. v. hall</author> <title>common subexpression identification in general algebraic systems.</title> <journal>technical rep. uksc 0060, ibm united kingdom scientific centre</journal> <month>november</month> <year>1974</year> </article>

you using .fromstring() instead of .parse():

import xml.etree.elementtree et  tree = et.parse("c:\pbc.xml") root = tree.getroot()

.fromstring() expects given xml data in bytestring, not filename.

if document large (many megabytes or more) should use et.iterparse() function instead , clear elements have processed:

for event, article in et.iterparse('c:\\pbc.xml', tag='article'):     title in aarticle.findall('title'):         print 'title: {}'.format(title.txt)     author in article.findall('author'):         print 'author name: {}'.format(author.text)     journal in article.findall('journal'):         print 'journal'      article.clear()

Search This Blog

Three

Parsing large xml data using python's elementtree -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -