Changing User Agent on Urllib2.Urlopen

Changing user agent on urllib2.urlopen

Setting the User-Agent from everyone's favorite Dive Into Python.

The short story: You can use Request.add_header to do this.

You can also pass the headers as a dictionary when creating the Request itself, as the docs note:

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib2‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

Changing User Agent in Python 3 for urrlib.request.urlopen

From the Python docs:

import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

Custom user-agent with urllib2, python 2.7

Call this url - http://httpbin.org/headers

The source code will have your user agent. :-)

You can embedd this in your code as you want.
However for now all I want to show here in the code below is that this url will let you know your user agent,

stuff=urllib2.urlopen("http://httpbin.org/headers").read()

print stuff
{
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/2.7",
"X-Request-Id": "43jhc13b-3dj4-4eb5-8780-ad7cfs4790cd"
}
}

Hope that answers your question

Python3 - Urllib2 | Need to remove the User-Agent header completely

Doing the following fixed my problem.

headers = {
"User-Agent": None
}

Unfortunately I had to switch from Urllib2 to the "requests" module, because with Urllib, using "None" has thrown an error.

Thanks anyway for all the replies!

urlopen via urllib.request with valid User-Agent returns 405 error

You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).

Urllib sends a POST because you include the data argument in the Request constructor as is documented here:

https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.

It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.

import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)

https://github.com/requests/requests

HTTP403 Error urllib2.urlopen(URL)

Google is using User-Agent filtering to prevent bots from interacting with its search service. You can observe this by comparing these results with curl(1) and optionally using the -A flag to change the User-Agent string:

$ curl -I 'http://www.google.com/search?q=something%20unusual'
HTTP/1.1 403 Forbidden
...

$ curl -I 'http://www.google.com/search?q=something%20unusual' -A 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
HTTP/1.1 200 OK

You should probably instead be using the Google Custom Search service to automate Google searches. Alternatively, you could set your own User-Agent header with the urllib2 library (instead of the default of something like "Python-urllib/2.6"), but this may contravene Google's terms of service.

Problems with user agent on urllib

What you did above is clearly a mess. The code should not run at all. Try the below way instead.

from bs4 import BeautifulSoup
from urllib.request import Request,urlopen

URL = "https://hsreplay.net/meta/#tab=matchups&sortBy=winrate"

req = Request(URL,headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
name = soup.find("h1").text
print(name)

Output:

HSReplay.net

Btw, you can scrape few items that are not javascript encrypted from that page. However, the core content of that page are generated dynamically so you can't grab them using urllib and BeautifulSoup. To get them you need to choose any browser simulator like selenium etc.



Related Topics



Leave a reply



Submit