python requests.get() returns improperly decoded text instead of UTF-8?
From requests documentation:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.
Regarding the differences between requests
and urllib.urlopen
- they probably use different ways to guess the encoding. Thats all.
UTF-8 text from website is decoded improperly when using python 3 and requests, works well with Python 2 and mechanize
I'm able to get the lyrics properly with this code in python3.x:
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())
Printing (truncated)
>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
'扉開けば\u3000捻れた昼の夜\r\n昨日どうやって帰った\u3000体だけ...'
A few things strike me as odd there, notably the \r\n
(windows line ending) and \u3000
(IDEOGRAPHIC SPACE) but that's probably not the problem
The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data"
)
Sending multipart form data is a little bit strange in requests
, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files
but have a "None
" filename. "for humans" hah!
url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})
And the text is not mangled now!
>>> BeautifulSoup(resp2.text).find('body').get_text()
'\n扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)...'
(Note that this code should work in either python2 or python3)
Python requests response encoded in utf-8 but cannot be decoded
The problem is this line in your request headers:
"accept-encoding": "gzip, deflate, br",
That br
requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests
doesn't understand Brotli natively.
You can pip install brotli
and register the decompresser or just call it manually on res.content
. But a simpler solution is to just remove the br
:
"accept-encoding": "gzip, deflate",
… and then you should get gzip
, which you and requests
already know how to handle.
There are weird characters even though it's encoded utf-8
The Problem:
import requests
r = requests.get('link')
print(r.encoding)
Output: ISO-8859-1
The server is not sending the appropriate header, requests
doesn't parse <meta charset="utf-8" />
, so it defaults to ISO-8859-1.
Solution 1: Tell requests what encoding to use
r.encoding = 'utf-8'
html_text = r.text
Solution 2: Do the decoding yourself
html_text = r.content.decode('utf-8')
Solution 3: Have requests take a guess
r.encoding = r.apparent_encoding
html_text = r.text
In any case, html_text
will now contain the (correctly decoded) html source and can be fed to BeautifulSoup.
The encoding setting of BeautifulSoup
didn't help, because at that point you already had a wrongly decoded string!
How to fix incorrectly UTF-8 decoded string?
The requests
library tries to guess the encoding of the response.
It's possible requests
is decoding the response as cp1252
(aka Windows-1252).
I'mg guessing this because if you take that text and encode it back to cp1252
and then decode it as utf-8
, you'll see the correct text:
>>> 'criança'.encode('cp1252').decode('utf-8')
'criança'
Based on that, I'd guess that if you ask your response object what encoding it guessed, it'll tell you cp1252
:
>>> response.encoding
'cp1252'
Forcing requests
to decode as utf-8
instead, like this, will probably fix your issue:
>>> response.encoding = 'utf-8'
response.text is printing only special symbols for a plain-text response
For evaluating a response from an arbitrary GET request, you should always evaluate the response.headers
.
The header with key Content-Type
tells you something about the MIME type like text/html
or application/json
of a response and its encoding like UTF-8
.
In your case the result of response.headers['Content-Type']
probably would return "text/html; charset=UTF-8"
.
So you know, that you need to decode the response from UTF-8
as Parvat. R commented by r.content.decode('utf-8')
.
Here we can
- either use
response.encoding
to dynamically decode theresponse.text
based on response's given encoding - or we can simply use
response.content
to get the bytes as binary representation (e.g.b'\x833\x01'
)
Since you claim the response was text/HTML (as seen in browser), you could simply decode the textual representation and append it to the text-file:
s = requests.Session()
r = s.get(url,headers = headers)
print(r.text)
if (r.status_code == 200):
print("Generated Successfully")
# detect encoding and decode respectively
print("Response encoding", r.encoding)
body_text = r.text.decode(r.encoding)
with open("Alt.txt", 'a') as f:
f.write(str(body_text) + '\n') # print body as string to file
else:
print("BAD Request " + str(r.status_code))
s.cookies.clear()
See also:
python requests.get() returns improperly decoded text instead of UTF-8?
How to get Chinese content using request lib
The content
property returns bytes data, not text. You can turn it into text by decoding it:
result = con.content.decode('utf-8')
This will return unicode
text.
Alternatively you can use the text
property instead:
result = con.text
However, baidu doesn't send a correct charset header, so con.text
will use the wrong encoding and return garbage. You can fix this by manually setting the encoding
property though:
con.encoding = 'utf-8'
result = con.text
Related Topics
Generating File to Download with Django
Class Inheritance in Python 3.7 Dataclasses
How to Access a Dictionary Key Value Present Inside a List
Using Cprofile Results with Kcachegrind
Python Regular Expression Re.Match, Why This Code Does Not Work
How to Write Tests for the Argparse Portion of a Python Module
Operation on Every Pair of Element in a List
Numpy: Get Random Set of Rows from 2D Array
Python/Beautifulsoup - How to Remove All Tags from an Element
Differencebetween I = I + 1 and I += 1 in a 'For' Loop
How to Find Out Whether a File Is at Its 'Eof'
Making an Executable in Cython
Editing Workbooks with Rich Text in Openpyxl
Pandas Finding Local Max and Min
How to Get Tkinter Canvas to Dynamically Resize to Window Width
Can You Give a Django App a Verbose Name for Use Throughout the Admin