HTTP error 403 in Python 3 Web Scraping -
i trying scrap website practice, kept on getting http error 403 (does think i'm bot)?
here code:
#import requests import urllib.request bs4 import beautifulsoup #from urllib import urlopen import re webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortfield=oi&sortasc=false&venues=3&page=1&cleared=1&group=1').read findrows = re.compile('<tr class="- banding(?:on|off)>(.*?)</tr>') findlink = re.compile('<a href =">(.*)</a>') row_array = re.findall(findrows, webpage) links = re.finall(findlink, webpate) print(len(row_array)) iterator = []
the error is:
file "c:\python33\lib\urllib\request.py", line 160, in urlopen return opener.open(url, data, timeout) file "c:\python33\lib\urllib\request.py", line 479, in open response = meth(req, response) file "c:\python33\lib\urllib\request.py", line 591, in http_response 'http', request, response, code, msg, hdrs) file "c:\python33\lib\urllib\request.py", line 517, in error return self._call_chain(*args) file "c:\python33\lib\urllib\request.py", line 451, in _call_chain result = func(*args) file "c:\python33\lib\urllib\request.py", line 599, in http_error_default raise httperror(req.full_url, code, msg, hdrs, fp) urllib.error.httperror: http error 403: forbidden
this because of mod_security
or similar server security feature blocks known spider/bot user agents (urllib
uses python urllib/3.3.0
, it's detected). try setting known browser user agent with:
from urllib.request import request, urlopen req = request('http://www.cmegroup.com/trading/products/#sortfield=oi&sortasc=false&venues=3&page=1&cleared=1&group=1', headers={'user-agent': 'mozilla/5.0'}) webpage = urlopen(req).read()
this works me.
by way, in code missing ()
after .read
in urlopen
line, think it's typo.
tip: since exercise, choose different, non restrictive site. maybe blocking urllib
reason...
Comments
Post a Comment