Problem HTTP error 403 in Python 3 Web Scraping
This is probably because of mod_security
or some similar server security feature which blocks known spider/bot user agents (urllib
uses something like python urllib/3.3.0
, it's easily detected). Try setting a known browser user agent with:
from urllib.request import Request, urlopen
req = Request(
url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()
This works for me.
By the way, in your code you are missing the ()
after .read
in the urlopen
line, but I think that it's a typo.
TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib
for some reason...
Web Scraping getting error (HTTP Error 403: Forbidden) using urllib
I had no probelems using the request
package. I did need to add user-agent as without, I was getting the same issue as you. Try this:
import requests
test_URL = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'
def get_data(link):
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
req = requests.get(link,headers=hdr)
content = req.content
return content
data = get_data(test_URL)
Web scraping when goes to 403 page
import requests
import pandas as pd
from bs4 import BeautifulSoup
# make sure you insert the headers as a dict as you missed the : within your original code
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
}
def main(url):
# included headers in request
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
# response 200
print(r)
# this is how you can use pandas with the previous headers to get 200 response text
df = pd.read_html(r.text)
print(df) # you will get error --> ValueError: No tables found because you are dealing with JS website behind CloudFlare protection! try selenium then!
main('https://mirror-h.org/archive/page/1 ')
Web scraping using python: urlopen returns HTTP Error 403: Forbidden
You might need to specify more headers, try this:
import urllib.request
url = "https://www.fragrantica.com/perfume/Tom-Ford/Tobacco-Vanille-1825.html"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request=urllib.request.Request(url=url, headers=headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
Python Web Scrape - 403 Error
Try to use session()
from requests
as below:
import requests
my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code) # 200
Related Topics
Making Python Loggers Output All Messages to Stdout in Addition to Log File
How to Find the Location of Python Module Sources
How to Run Python Code from Sublime Text 2
Turn a String into a Valid Filename
Creating a Pandas Dataframe from a Numpy Array: How to Specify the Index Column and Column Headers
How to Get Indices of a Sorted Array in Python
How to Write a Multidimensional Array to a Text File
Including Non-Python Files with Setup.Py
Insert a Row to Pandas Dataframe
Numpy "Where" with Multiple Conditions
How to Access the Child Classes of an Object in Django Without Knowing the Name of the Child Class
Purpose of "%Matplotlib Inline"
How to Install Psycopg2 with "Pip" on Python
How to Plot Multiple Seaborn Jointplot in Subplot
Python Numpy Valueerror: Operands Could Not Be Broadcast Together with Shapes