Changing User Agent in Python 3 for Urrlib.Request.Urlopen

Changing User Agent in Python 3 for urrlib.request.urlopen

From the Python docs:

import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

Changing user agent on urllib2.urlopen

Setting the User-Agent from everyone's favorite Dive Into Python.

The short story: You can use Request.add_header to do this.

You can also pass the headers as a dictionary when creating the Request itself, as the docs note:

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib2‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

urlopen via urllib.request with valid User-Agent returns 405 error

You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).

Urllib sends a POST because you include the data argument in the Request constructor as is documented here:

https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.

It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.

import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)

https://github.com/requests/requests

urlopen of urllib.request cannot open a page in python 3.7

Urllib is pretty old and small module. For webscraping, requests module is recommended.
You can check out this answer for additional information.

Web scraping using python: urlopen returns HTTP Error 403: Forbidden

You might need to specify more headers, try this:

import urllib.request    

url = "https://www.fragrantica.com/perfume/Tom-Ford/Tobacco-Vanille-1825.html"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}

request=urllib.request.Request(url=url, headers=headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need

Problems with user agent on urllib

What you did above is clearly a mess. The code should not run at all. Try the below way instead.

from bs4 import BeautifulSoup
from urllib.request import Request,urlopen

URL = "https://hsreplay.net/meta/#tab=matchups&sortBy=winrate"

req = Request(URL,headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
name = soup.find("h1").text
print(name)

Output:

HSReplay.net

Btw, you can scrape few items that are not javascript encrypted from that page. However, the core content of that page are generated dynamically so you can't grab them using urllib and BeautifulSoup. To get them you need to choose any browser simulator like selenium etc.



Related Topics



Leave a reply



Submit