Problem Http Error 403 in Python 3 Web Scraping

Problem HTTP error 403 in Python 3 Web Scraping

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:

from urllib.request import Request, urlopen

req = Request(
url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

This works for me.

By the way, in your code you are missing the () after .read in the urlopen line, but I think that it's a typo.

TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib for some reason...

Web Scraping getting error (HTTP Error 403: Forbidden) using urllib

I had no probelems using the request package. I did need to add user-agent as without, I was getting the same issue as you. Try this:

import requests 

test_URL = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'

def get_data(link):
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}

req = requests.get(link,headers=hdr)
content = req.content

return content

data = get_data(test_URL)

Web scraping when goes to 403 page

import requests
import pandas as pd
from bs4 import BeautifulSoup

# make sure you insert the headers as a dict as you missed the : within your original code
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
}

def main(url):
# included headers in request
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
# response 200
print(r)

# this is how you can use pandas with the previous headers to get 200 response text
df = pd.read_html(r.text)
print(df) # you will get error --> ValueError: No tables found because you are dealing with JS website behind CloudFlare protection! try selenium then!


main('https://mirror-h.org/archive/page/1 ')

Web scraping using python: urlopen returns HTTP Error 403: Forbidden

You might need to specify more headers, try this:

import urllib.request    

url = "https://www.fragrantica.com/perfume/Tom-Ford/Tobacco-Vanille-1825.html"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}

request=urllib.request.Request(url=url, headers=headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need

Python Web Scrape - 403 Error

Try to use session() from requests as below:

import requests

my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'

response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code) # 200


Related Topics



Leave a reply



Submit