Urllib2.Httperror: Http Error 403: Forbidden

urllib2.HTTPError: HTTP Error 403: Forbidden

By adding a few more headers I was able to get the data:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}

req = urllib2.Request(site, headers=hdr)

try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()

content = page.read()
print content

Actually, it works with just this one additional header:

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

urllib.error.HTTPError: HTTP Error 403: Forbidden with urllib.requests

The server at prntscr.com is actively rejecting your request. There are many reasons why that could be. Some sites will check for the user agent of the caller to make see if that's the case. In my case, I used httpie to test if it would allow me to download through a non-browser app. It worked. So then I simply reused made up a user header to see if it's just the lack of user-agent.

import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'MyApp/1.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")

It worked! Now I don't know what logic the server uses. For instance, I tried a standard Mozilla/5.0 and that did not work. You won't always encounter this issue (most sites are pretty lax in what they allow as long as you are reasonable), but when you do, try playing with the user-agent. If nothing works, try using the same user-agent as your browser for instance.

Problem HTTP error 403 in Python 3 Web Scraping

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:

from urllib.request import Request, urlopen

req = Request(
url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

This works for me.

By the way, in your code you are missing the () after .read in the urlopen line, but I think that it's a typo.

TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib for some reason...

Web Scraping getting error (HTTP Error 403: Forbidden) using urllib

I had no probelems using the request package. I did need to add user-agent as without, I was getting the same issue as you. Try this:

import requests 

test_URL = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'

def get_data(link):
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}

req = requests.get(link,headers=hdr)
content = req.content

return content

data = get_data(test_URL)

urllib.error.HTTPError: HTTP Error 403: Forbidden for urlretrieve

Maybe it helps you::

import os
import requests
from bs4 import BeautifulSoup

def save_image(response, name, path=""):
with open(path + name, "wb") as f:
f.write(response.content)

session = requests.Session()

headers = {
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
}
session.headers = headers
response = session.get("https://manganelo.com/chapter/oa919470/chapter_23")
if response.status_code == 200:
headers["Referer"] = "https://manganelo.com/chapter/oa919470/chapter_23"
session.headers = headers
soup = BeautifulSoup(response.text, "lxml")
for img in soup.findAll("img"):
url = img["src"]
name = url.split("/")[-1]

print(url)
r = session.get(url)
print(r)
if r.status_code == 200:
save_image(r, name, path="")
else:
print("Huston we have a problem....")
else:
print(responce)

For more details requests.get returns 403 while the same url works in browser

Beautiful Soup - urllib.error.HTTPError: HTTP Error 403: Forbidden

Don't use urllib.request.urlretrieve. Instead, use the requests library like this:

import requests

url = 'https://goodlogo.com/images/logos/small/nike_classic_logo_2355.gif'

path = "D:\\Test.gif"

response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

file = open(path, "wb")

file.write(response.content)

file.close()

Output:

Sample Image

Hope that this helps!



Related Topics



Leave a reply



Submit