regex - re.findall in Python 3 -

January 15, 2010

i wanted use function re.findall(), searches through webpage pattern:

from urllib.request import request, urlopen import re   url = request('http://www.cmegroup.com/trading/products/#sortfield=oi&sortasc=false&venues=3&page=1&cleared=1&group=1', headers={'user-agent': 'mozilla/20.0.1'}) webpage = urlopen(url).read()  findrows = re.compile('<td class="cmetablecenter">(.*)</td>') row_array = re.findall(findrows, webpage) #error here

i error:

typeerror: can't use string pattern on bytes-like object

urllib.request.urlopen returns bytes object, not (unicode) string. should decode before trying match anything. example, if know page in utf-8:

webpage = urlopen(url).read().decode('utf8')

better http libraries automatically you, determining right encoding isn't trivial or possible, python's standard library doesn't.

another option use bytes regex instead:

findrows = re.compile(b'<td class="cmetablecenter">(.*)</td>')

this useful if don't know encoding either , don't mind working bytes objects throughout code.

Search This Blog

Three

regex - re.findall in Python 3 -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

c# - Farseer ContactListener is not working -

Automatically create pages in phpfox -