regex - re.findall in Python 3 -
i wanted use function re.findall(), searches through webpage pattern:
from urllib.request import request, urlopen import re url = request('http://www.cmegroup.com/trading/products/#sortfield=oi&sortasc=false&venues=3&page=1&cleared=1&group=1', headers={'user-agent': 'mozilla/20.0.1'}) webpage = urlopen(url).read() findrows = re.compile('<td class="cmetablecenter">(.*)</td>') row_array = re.findall(findrows, webpage) #error here
i error:
typeerror: can't use string pattern on bytes-like object
urllib.request.urlopen
returns bytes
object, not (unicode) string. should decode before trying match anything. example, if know page in utf-8:
webpage = urlopen(url).read().decode('utf8')
better http libraries automatically you, determining right encoding isn't trivial or possible, python's standard library doesn't.
another option use bytes
regex instead:
findrows = re.compile(b'<td class="cmetablecenter">(.*)</td>')
this useful if don't know encoding either , don't mind working bytes
objects throughout code.
Comments
Post a Comment