Does python urllib2 automatically uncompress gzip data fetched from webpage?
- How can I tell if the data at a URL is gzipped?
This checks if the content is gzipped and decompresses it:
from StringIO import StringIO
import gzip
request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
- Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?
No. The urllib2 doesn't automatically uncompress the data because the 'Accept-Encoding' header is not set by the urllib2 but by you using: request.add_header('Accept-Encoding','gzip, deflate')
python urllib2 returns garbage
This page is returned in gzip
encoding.
(Try printing out response.headers['content-encoding']
to verify this.)
Most likely the web-site doesn't respect 'Accept-Encoding' field in request and suggests that the client supports gzip (most modern browsers do).
urllib2
doesn't support deflating, but you can use gzip
module for that as described e.g. in this thread: Does python urllib2 automatically uncompress gzip data fetched from webpage? .
Urllib2 get garbled string instead of page source
Here is a possible way to get the source information using requests
and BeautifulSoup
import requests
from bs4 import BeautifulSoup
#Url to request
url = "http://finance.sina.com.cn/china/20150905/065523161502.shtml"
r = requests.get(url)
#Use BeautifulSoup to organise the 'requested' content
soup=BeautifulSoup(r.content,"lxml")
print soup
In Python, how do I decode GZIP encoding?
I use zlib to decompress gzipped content from web.
import zlib
import urllib
f=urllib.request.urlopen(url)
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)
Why does python urllib2 urlopen return something different from the browser on API call
You are retrieving GZIPped, compressed data; the server expressly tells you it does with Content-Encoding: gzip
. Either use the zlib
library to decompress the data:
import zlib
decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
data = decomp.decompress(val)
or use a library that supports transparent decompression if the response headers indicate compression has been used, like requests
.
Urllib2 request for data from an HTTPS RSS feed returns garbage characters
Well I didn't solve the problem with urllib2 but I figured out that you can use requests without specifying authorization like this:
import requests
r = requests.get('https://api.github.com', verify = False)
print r.read
and that gets rid of the error so you can read the data without a problem.
Related Topics
How to Get the Ip Address from a Nic (Network Interface Controller) in Python
Repeating Elements of a List N Times
How to Locate Element of Credit Card Number Using Selenium Python
How to Get Indices of a Sorted Array in Python
Running Selenium with Headless Chrome Webdriver
Read and Write CSV Files Including Unicode with Python 2.7
How to Trim Whitespace from a String
How to Pass a Method as a Parameter in Python
Matplotlib Scatter Plot Legend
Why Does Assigning to My Global Variables Not Work in Python
How to Save a Trained Model in Pytorch
How to Convert This List of Dictionaries to a CSV File
Create a List with Initial Capacity in Python
How to Pass Arguments in Pytest by Command Line
What Is the Internal Precision of Numpy.Float128
How to Filter Rows Containing a String Pattern from a Pandas Dataframe
Adding a Module (Specifically Pymorph) to Spyder (Python Ide)