urllib2.HTTPError: HTTP Error 403: Forbidden
By adding a few more headers I was able to get the data:
import urllib2,cookielib
site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
Actually, it works with just this one additional header:
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
urllib.error.HTTPError: HTTP Error 403: Forbidden with urllib.requests
The server at prntscr.com
is actively rejecting your request. There are many reasons why that could be. Some sites will check for the user agent of the caller to make see if that's the case. In my case, I used httpie to test if it would allow me to download through a non-browser app. It worked. So then I simply reused made up a user header to see if it's just the lack of user-agent.
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'MyApp/1.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")
It worked! Now I don't know what logic the server uses. For instance, I tried a standard Mozilla/5.0
and that did not work. You won't always encounter this issue (most sites are pretty lax in what they allow as long as you are reasonable), but when you do, try playing with the user-agent. If nothing works, try using the same user-agent as your browser for instance.
Problem HTTP error 403 in Python 3 Web Scraping
This is probably because of mod_security
or some similar server security feature which blocks known spider/bot user agents (urllib
uses something like python urllib/3.3.0
, it's easily detected). Try setting a known browser user agent with:
from urllib.request import Request, urlopen
req = Request(
url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()
This works for me.
By the way, in your code you are missing the ()
after .read
in the urlopen
line, but I think that it's a typo.
TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib
for some reason...
Web Scraping getting error (HTTP Error 403: Forbidden) using urllib
I had no probelems using the request
package. I did need to add user-agent as without, I was getting the same issue as you. Try this:
import requests
test_URL = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'
def get_data(link):
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
req = requests.get(link,headers=hdr)
content = req.content
return content
data = get_data(test_URL)
urllib.error.HTTPError: HTTP Error 403: Forbidden for urlretrieve
Maybe it helps you::
import os
import requests
from bs4 import BeautifulSoup
def save_image(response, name, path=""):
with open(path + name, "wb") as f:
f.write(response.content)
session = requests.Session()
headers = {
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
}
session.headers = headers
response = session.get("https://manganelo.com/chapter/oa919470/chapter_23")
if response.status_code == 200:
headers["Referer"] = "https://manganelo.com/chapter/oa919470/chapter_23"
session.headers = headers
soup = BeautifulSoup(response.text, "lxml")
for img in soup.findAll("img"):
url = img["src"]
name = url.split("/")[-1]
print(url)
r = session.get(url)
print(r)
if r.status_code == 200:
save_image(r, name, path="")
else:
print("Huston we have a problem....")
else:
print(responce)
For more details requests.get returns 403 while the same url works in browser
Beautiful Soup - urllib.error.HTTPError: HTTP Error 403: Forbidden
Don't use urllib.request.urlretrieve
. Instead, use the requests
library like this:
import requests
url = 'https://goodlogo.com/images/logos/small/nike_classic_logo_2355.gif'
path = "D:\\Test.gif"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
file = open(path, "wb")
file.write(response.content)
file.close()
Output:
Hope that this helps!
Related Topics
Security of Python's Eval() on Untrusted Strings
Extracting Just Month and Year Separately from Pandas Datetime Column
Using Pandas to Pd.Read_Excel() for Multiple Worksheets of the Same Workbook
How to Keep Python Print from Adding Newlines or Spaces
Apply Function to Each Element of a List
Integer Division in Python 2 and Python 3
Getting an "Invalid Syntax" When Trying to Perform String Interpolation
How to Subtract a Day from a Date
How to Create a Zip Archive of a Directory
How to Extract Text from a PDF File
Custom Sorting in Pandas Dataframe
Are Python Variables Pointers? or Else, What Are They
"Command Not Found" Using Line in Argument to Os.System Using Python
Ubuntu 'Failed to Import the Site Module' Error Message
Python Dictionary: Are Keys() and Values() Always the Same Order