Does Python Urllib2 Automatically Uncompress Gzip Data Fetched from Webpage

Does python urllib2 automatically uncompress gzip data fetched from webpage?

  1. How can I tell if the data at a URL is gzipped?

This checks if the content is gzipped and decompresses it:

from StringIO import StringIO
import gzip

request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()

  1. Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

No. The urllib2 doesn't automatically uncompress the data because the 'Accept-Encoding' header is not set by the urllib2 but by you using: request.add_header('Accept-Encoding','gzip, deflate')

python urllib2 returns garbage

This page is returned in gzip encoding.

(Try printing out response.headers['content-encoding'] to verify this.)

Most likely the web-site doesn't respect 'Accept-Encoding' field in request and suggests that the client supports gzip (most modern browsers do).

urllib2 doesn't support deflating, but you can use gzip module for that as described e.g. in this thread: Does python urllib2 automatically uncompress gzip data fetched from webpage? .

Urllib2 get garbled string instead of page source

Here is a possible way to get the source information using requests and BeautifulSoup

import requests 
from bs4 import BeautifulSoup

#Url to request
url = "http://finance.sina.com.cn/china/20150905/065523161502.shtml"
r = requests.get(url)

#Use BeautifulSoup to organise the 'requested' content
soup=BeautifulSoup(r.content,"lxml")
print soup

In Python, how do I decode GZIP encoding?

I use zlib to decompress gzipped content from web.

import zlib
import urllib

f=urllib.request.urlopen(url)
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)

Why does python urllib2 urlopen return something different from the browser on API call

You are retrieving GZIPped, compressed data; the server expressly tells you it does with Content-Encoding: gzip. Either use the zlib library to decompress the data:

import zlib

decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
data = decomp.decompress(val)

or use a library that supports transparent decompression if the response headers indicate compression has been used, like requests.

Urllib2 request for data from an HTTPS RSS feed returns garbage characters

Well I didn't solve the problem with urllib2 but I figured out that you can use requests without specifying authorization like this:

import requests
r = requests.get('https://api.github.com', verify = False)
print r.read

and that gets rid of the error so you can read the data without a problem.



Related Topics



Leave a reply



Submit