Changing User Agent in Python 3 for urrlib.request.urlopen
From the Python docs:
import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
Changing user agent on urllib2.urlopen
Setting the User-Agent from everyone's favorite Dive Into Python.
The short story: You can use Request.add_header to do this.
You can also pass the headers as a dictionary when creating the Request itself, as the docs note:
headers should be a dictionary, and will be treated as if
add_header()
was called with each key and value as arguments. This is often used to “spoof” theUser-Agent
header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"
, whileurllib2
‘s default user agent string is"Python-urllib/2.6"
(on Python 2.6).
urlopen via urllib.request with valid User-Agent returns 405 error
You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).
Urllib sends a POST because you include the data
argument in the Request constructor as is documented here:
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.
It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.
import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)
https://github.com/requests/requests
urlopen of urllib.request cannot open a page in python 3.7
Urllib is pretty old and small module. For webscraping, requests
module is recommended.
You can check out this answer for additional information.
Web scraping using python: urlopen returns HTTP Error 403: Forbidden
You might need to specify more headers, try this:
import urllib.request
url = "https://www.fragrantica.com/perfume/Tom-Ford/Tobacco-Vanille-1825.html"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request=urllib.request.Request(url=url, headers=headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
Problems with user agent on urllib
What you did above is clearly a mess. The code should not run at all. Try the below way instead.
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
URL = "https://hsreplay.net/meta/#tab=matchups&sortBy=winrate"
req = Request(URL,headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
name = soup.find("h1").text
print(name)
Output:
HSReplay.net
Btw, you can scrape few items that are not javascript encrypted
from that page. However, the core content of that page are generated dynamically so you can't grab them using urllib
and BeautifulSoup
. To get them you need to choose any browser simulator like selenium
etc.
Related Topics
How to Merge Multiple Lists into One List in Python
Nameerror: Global Name 'Xrange' Is Not Defined in Python 3
Timedelta to String Type in Pandas Dataframe
Call Class Method from Another Class
Printing Tuple with String Formatting in Python
Drag and Drop Explorer Files to Tkinter Entry Widget
When to Use Sys.Path.Append and When Modifying %Pythonpath% Is Enough
How to Overlay Plots from Different Cells
How to Use Hex() Without 0X in Python
Extract Day of Year and Julian Day from a String Date
Differencebetween Join and Merge in Pandas
What Exactly Is the Point of Memoryview in Python