A Good Way to Get the Charset/Encoding of an Http Response in Python

A good way to get the charset/encoding of an HTTP response in Python

To parse http header you could use cgi.parse_header():

_, params = cgi.parse_header('text/html; charset=utf-8')
print params['charset'] # -> utf-8

Or using the response object:

response = urllib2.urlopen('http://example.com')
response_encoding = response.headers.getparam('charset')
# or in Python 3: response.headers.get_content_charset(default)

In general the server may lie about the encoding or do not report it at all (the default depends on content-type) or the encoding might be specified inside the response body e.g., <meta> element in html documents or in xml declaration for xml documents. As a last resort the encoding could be guessed from the content itself.

You could use requests to get Unicode text:

import requests # pip install requests

r = requests.get(url)
unicode_str = r.text # may use `chardet` to auto-detect encoding

Or BeautifulSoup to parse html (and convert to Unicode as a side-effect):

from bs4 import BeautifulSoup # pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed
# ...

Or bs4.UnicodeDammit directly for arbitrary content (not necessarily an html):

from bs4 import UnicodeDammit

dammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# -> Sacré bleu!
print(dammit.original_encoding)
# -> utf-8

python requests.get() returns improperly decoded text instead of UTF-8?

From requests documentation:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.

How does requests determine the encoding of a reponse?

Requests extracts the encoding from the response's Content-Type header's charset parameter. If no charset is found in the header and the content-type is of type "text", ISO-8859-1 (latin-1) is assumed. Otherwise the response's apparent_encoding property is evaluated and used as the value of r.encoding.

apparent_encoding is determined by using the chardet library to guess the encoding of the response body.

In the case of the URL in the question, the encoding is declared in the Content-Type header

>>> r.headers['Content-Type']
'text/html; charset=gbk'

so r.apparent_encoding is not evaluated until it is explicitly accessed by executing print(r.apparent_encoding).

In this particular case, chardet seems to get it wrong: the response's text attribute can be encoded with the gbk codec, but not with GB2312.

scrape with correct character encoding (python requests + beautifulsoup)

In general, instead of using r.content which is the byte string received, use r.text which is the decoded content using the encoding determined by requests.

In this case requests will use UTF-8 to decode the incoming byte string because this is the encoding reported by the server in the Content-Type header:

import requests

r = requests.get('http://fm4-archiv.at/files.php?cat=106')

>>> type(r.content) # raw content
<class 'bytes'>
>>> type(r.text) # decoded to unicode
<class 'str'>
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'

>>> soup = BeautifulSoup(r.text, 'lxml')

That will fix the "Wildlöwenpfleger" problem, however, other parts of the page then begin to break, for example:

>>> soup = BeautifulSoup(r.text, 'lxml')     # using decoded string... should work
>>> soup.find_all('a')[39]
<a href="details.php?file=1882">Der Wildlöwenpfleger</a>
>>> soup.find_all('a')[10]
<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)

shows that "Wildlöwenpfleger" is fixed but now "übergeben" and others in the second link are broken.

It appears that multiple encodings are used in the one HTML document. The first link uses UTF-8 encoding:

>>> r.content[8013:8070].decode('iso-8859-1')
'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'

>>> r.content[8013:8070].decode('utf8')
'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'

but the second link uses ISO-8859-1 encoding:

>>> r.content[2868:3132].decode('iso-8859-1')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

>>> r.content[2868:3132].decode('utf8', 'replace')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

Obviously it is incorrect to use multiple encodings in the same HTML document. Other than contacting the document's author and asking for a correction, there is not much that you can easily do to handle the mixed encoding. Perhaps you can run chardet.detect() over the data as you process it, but it's not going to be pleasant.

How to determine the encoding of text

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

Parsing HTTP Response in Python

When I printed response.read() I noticed that b was preprended to the string (e.g. b'{"a":1,..). The "b" stands for bytes and serves as a declaration for the type of the object you're handling. Since, I knew that a string could be converted to a dict by using json.loads('string'), I just had to convert the byte type to a string type. I did this by decoding the response to utf-8 decode('utf-8'). Once it was in a string type my problem was solved and I was easily able to iterate over the dict.

I don't know if this is the fastest or most 'pythonic' way of writing this but it works and theres always time later of optimization and improvement! Full code for my solution:

from urllib.request import urlopen
import json

# Get the dataset
url = 'http://www.quandl.com/api/v1/datasets/FRED/GDP.json'
response = urlopen(url)

# Convert bytes to string type and string type to dict
string = response.read().decode('utf-8')
json_obj = json.loads(string)

print(json_obj['source_name']) # prints the string with 'source_name' key

what is the default encoding when python Requests post data is string type?

If you actually try your example you will find:

$ python
Python 3.7.2 (default, Jan 29 2019, 13:41:02)
[Clang 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> payload = '''
... 工作报告
... 总体情况:良好
... '''
>>> r = requests.post("http://127.0.0.1:8888/post", data=payload)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/venv/lib/python3.7/site-packages/requests/api.py", line 116, in post
return request('post', url, data=data, json=json, **kwargs)
File "/tmp/venv/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/tmp/venv/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/tmp/venv/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/tmp/venv/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/tmp/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/tmp/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/tmp/venv/lib/python3.7/http/client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/tmp/venv/lib/python3.7/http/client.py", line 1274, in _send_request
body = _encode(body, 'body')
File "/tmp/venv/lib/python3.7/http/client.py", line 160, in _encode
(name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 2-5: Body ('工作报告') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

As described in Detecting the character encoding of an HTTP POST request the default encoding for HTTP POST is ISO-8859-1 aka Latin-1. And as the error message right at the end of the traceback tells you, you can force it by encoding to an UTF-8 bytes string; but then of course your server needs to be expecting UTF-8, too; or you will simply be sending useless Latin-1 mojibake.

There is no way in the POST interface itself to enforce this, but your server could in fact require clients to explicitly specify their content encoding by using the charset parameter; maybe return a specific 5xx error code with an explicit error message if it's missing.

Somewhat less disciplinedly, you could have your server attempt to decode incoming POST requests as UTF-8, and reject the POST if that fails.

How to download any(!) webpage with correct charset in python?

I would use html5lib for this.

urllib2 getparam charset returns None for some sites

conn.headers.getparam('charset') doesn't parse html content (<meta> tag) it looks only in http headers (e.g., Content-Type).

You could use an html parser to get the character encoding if it is not specified in http headers.



Related Topics



Leave a reply



Submit