regex - re.findall in Python 3 -


i wanted use function re.findall(), searches through webpage pattern:

from urllib.request import request, urlopen import re   url = request('http://www.cmegroup.com/trading/products/#sortfield=oi&sortasc=false&venues=3&page=1&cleared=1&group=1', headers={'user-agent': 'mozilla/20.0.1'}) webpage = urlopen(url).read()  findrows = re.compile('<td class="cmetablecenter">(.*)</td>') row_array = re.findall(findrows, webpage) #error here 

i error:

typeerror: can't use string pattern on bytes-like object 

urllib.request.urlopen returns bytes object, not (unicode) string. should decode before trying match anything. example, if know page in utf-8:

webpage = urlopen(url).read().decode('utf8') 

better http libraries automatically you, determining right encoding isn't trivial or possible, python's standard library doesn't.

another option use bytes regex instead:

findrows = re.compile(b'<td class="cmetablecenter">(.*)</td>') 

this useful if don't know encoding either , don't mind working bytes objects throughout code.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -