Python Requests.Get() Returns Improperly Decoded Text Instead of Utf-8

python requests.get() returns improperly decoded text instead of UTF-8?

From requests documentation:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.

UTF-8 text from website is decoded improperly when using python 3 and requests, works well with Python 2 and mechanize

I'm able to get the lyrics properly with this code in python3.x:

url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())

Printing (truncated)

>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
'扉開けば\u3000捻れた昼の夜\r\n昨日どうやって帰った\u3000体だけ...'

A few things strike me as odd there, notably the \r\n (windows line ending) and \u3000 (IDEOGRAPHIC SPACE) but that's probably not the problem

The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data")

Sending multipart form data is a little bit strange in requests, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files but have a "None" filename. "for humans" hah!

url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})

And the text is not mangled now!

>>> BeautifulSoup(resp2.text).find('body').get_text()
'\n扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)...'

(Note that this code should work in either python2 or python3)

Python requests response encoded in utf-8 but cannot be decoded

The problem is this line in your request headers:

"accept-encoding": "gzip, deflate, br",

That br requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests doesn't understand Brotli natively.

You can pip install brotli and register the decompresser or just call it manually on res.content. But a simpler solution is to just remove the br:

"accept-encoding": "gzip, deflate",

… and then you should get gzip, which you and requests already know how to handle.

There are weird characters even though it's encoded utf-8

The Problem:

import requests
r = requests.get('link')
print(r.encoding)

Output: ISO-8859-1

The server is not sending the appropriate header, requests doesn't parse <meta charset="utf-8" />, so it defaults to ISO-8859-1.

Solution 1: Tell requests what encoding to use

r.encoding = 'utf-8'
html_text = r.text

Solution 2: Do the decoding yourself

html_text = r.content.decode('utf-8')

Solution 3: Have requests take a guess

r.encoding = r.apparent_encoding
html_text = r.text

In any case, html_text will now contain the (correctly decoded) html source and can be fed to BeautifulSoup.

The encoding setting of BeautifulSoup didn't help, because at that point you already had a wrongly decoded string!

How to fix incorrectly UTF-8 decoded string?

The requests library tries to guess the encoding of the response.
It's possible requests is decoding the response as cp1252 (aka Windows-1252).

I'mg guessing this because if you take that text and encode it back to cp1252 and then decode it as utf-8, you'll see the correct text:

>>> 'criança'.encode('cp1252').decode('utf-8')
'criança'

Based on that, I'd guess that if you ask your response object what encoding it guessed, it'll tell you cp1252:

>>> response.encoding
'cp1252'

Forcing requests to decode as utf-8 instead, like this, will probably fix your issue:

>>> response.encoding = 'utf-8'

response.text is printing only special symbols for a plain-text response

For evaluating a response from an arbitrary GET request, you should always evaluate the response.headers.

The header with key Content-Type tells you something about the MIME type like text/html or application/json of a response and its encoding like UTF-8.

In your case the result of response.headers['Content-Type'] probably would return "text/html; charset=UTF-8".

So you know, that you need to decode the response from UTF-8 as Parvat. R commented by r.content.decode('utf-8').

Here we can

  • either use response.encoding to dynamically decode the response.text based on response's given encoding
  • or we can simply use response.content to get the bytes as binary representation (e.g. b'\x833\x01')

Since you claim the response was text/HTML (as seen in browser), you could simply decode the textual representation and append it to the text-file:

s = requests.Session()
r = s.get(url,headers = headers)
print(r.text)

if (r.status_code == 200):
print("Generated Successfully")

# detect encoding and decode respectively
print("Response encoding", r.encoding)
body_text = r.text.decode(r.encoding)
with open("Alt.txt", 'a') as f:
f.write(str(body_text) + '\n') # print body as string to file
else:
print("BAD Request " + str(r.status_code))
s.cookies.clear()

See also:
python requests.get() returns improperly decoded text instead of UTF-8?

How to get Chinese content using request lib

The content property returns bytes data, not text. You can turn it into text by decoding it:

result = con.content.decode('utf-8')

This will return unicode text.

Alternatively you can use the text property instead:

result = con.text

However, baidu doesn't send a correct charset header, so con.text will use the wrong encoding and return garbage. You can fix this by manually setting the encoding property though:

con.encoding = 'utf-8'
result = con.text


Related Topics



Leave a reply



Submit