Changing user agent on urllib2.urlopen
Setting the User-Agent from everyone's favorite Dive Into Python.
The short story: You can use Request.add_header to do this.
You can also pass the headers as a dictionary when creating the Request itself, as the docs note:
headers should be a dictionary, and will be treated as if
add_header()
was called with each key and value as arguments. This is often used to “spoof” theUser-Agent
header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"
, whileurllib2
‘s default user agent string is"Python-urllib/2.6"
(on Python 2.6).
Changing User Agent in Python 3 for urrlib.request.urlopen
From the Python docs:
import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
Custom user-agent with urllib2, python 2.7
Call this url - http://httpbin.org/headers
The source code will have your user agent. :-)
You can embedd this in your code as you want.
However for now all I want to show here in the code below is that this url will let you know your user agent,
stuff=urllib2.urlopen("http://httpbin.org/headers").read()
print stuff
{
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/2.7",
"X-Request-Id": "43jhc13b-3dj4-4eb5-8780-ad7cfs4790cd"
}
}
Hope that answers your question
Python3 - Urllib2 | Need to remove the User-Agent header completely
Doing the following fixed my problem.
headers = {
"User-Agent": None
}
Unfortunately I had to switch from Urllib2 to the "requests" module, because with Urllib, using "None" has thrown an error.
Thanks anyway for all the replies!
urlopen via urllib.request with valid User-Agent returns 405 error
You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).
Urllib sends a POST because you include the data
argument in the Request constructor as is documented here:
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.
It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.
import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)
https://github.com/requests/requests
HTTP403 Error urllib2.urlopen(URL)
Google is using User-Agent filtering to prevent bots from interacting with its search service. You can observe this by comparing these results with curl(1)
and optionally using the -A
flag to change the User-Agent string:
$ curl -I 'http://www.google.com/search?q=something%20unusual'
HTTP/1.1 403 Forbidden
...
$ curl -I 'http://www.google.com/search?q=something%20unusual' -A 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
HTTP/1.1 200 OK
You should probably instead be using the Google Custom Search service to automate Google searches. Alternatively, you could set your own User-Agent header with the urllib2
library (instead of the default of something like "Python-urllib/2.6"
), but this may contravene Google's terms of service.
Problems with user agent on urllib
What you did above is clearly a mess. The code should not run at all. Try the below way instead.
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
URL = "https://hsreplay.net/meta/#tab=matchups&sortBy=winrate"
req = Request(URL,headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
name = soup.find("h1").text
print(name)
Output:
HSReplay.net
Btw, you can scrape few items that are not javascript encrypted
from that page. However, the core content of that page are generated dynamically so you can't grab them using urllib
and BeautifulSoup
. To get them you need to choose any browser simulator like selenium
etc.
Related Topics
How to Display Pandas Dataframe of Floats Using a Format String for Columns
Django Media_Url and Media_Root
How to Upgrade All Python Packages with Pip
How to Pass Variables Across Functions
Perform Commands Over Ssh with Python
Different Behaviour for List._Iadd_ and List._Add_
How to Filter a Dictionary According to an Arbitrary Condition Function
Python Socket Not Receiving Without Sending
How to Open a File Using the Open with Statement
Secondary Axis with Twinx(): How to Add to Legend
Slicing a List in Python Without Generating a Copy
How to Detach Matplotlib Plots So That the Computation Can Continue
Generate Random Numbers Summing to a Predefined Value
Split Cell into Multiple Rows in Pandas Dataframe
How to Escape Os.System() Calls
Comparing Two Lists Using the Greater Than or Less Than Operator