How to convert a string to utf-8 in Python
In Python 2
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string.
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding.
In Python 3
All strings are unicode. The unicode
function does not exist anymore. See answer from @Noumenon
Convert string of unknown encoding to UTF-8
"Träume groß"
is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.
A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:
print('"Träume groß"'.encode('cp1252').decode('utf8'))
gives as expected:
"Träume groß"
But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.
Python – How do I convert an ASCII string into UTF-8?
If the input string contains the raw byte ordinals (such as \xc3\xa9
/é
instead of é
) use latin1
to encode it to bytes verbatim, then decode with the desired encoding.
>>> "pasé".encode('latin1').decode()
'pasé'
Converting unicode string to utf-8
After fighting with python for over an hour, I decided to look for a solution in another language. This is how my goal can be achieved in C#:
var s = "\u00c4\u008d";
var newS = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
File.WriteAllText(@"D:\tmp\test.txt", newS, Encoding.UTF8);
Finally! The file now contains č
.
I therefore got inspired by this approach in C# and managed to come up with the following (seemingly) equivalent solution in Python:
>>> s = u"\u00c4\u008d"
>>> arr = bytearray(map(ord, s))
>>> print arr.decode("utf-8")
č
I'm not sure how good this solution is but it seems to work in my case.
Decoding string to UTF-8 if string is already defined as r'string' instead of b'string'
Try this:
import re
re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), stringa.encode('ascii')).decode('utf-8')
Explanation:
We use the regular expression rb'\\([0-7]{3})'
(which matches a literal backslash \
followed by exactly 3 octal digits) and replace each occurrence by taking the three digit code (match[1]
), interpreting that as a number written in octal (int(_, 8)
), and then replacing the original escape sequence with a single byte (bytes([_])
).
We need to operate over bytes because the escape codes are of raw bytes, not unicode characters. Only after we "unescaped" those sequences, can we decode the UTF-8 to a string.
Is Python 2.7 actually converting my string to UTF-8 or is the definition of isalnum() different across different machines?
Strings (type str
) in Python 2.7 are bytes. When you read text from a file, you get bytes, with possibly the line endings changed. Therefore, s
is not an instance of type unicode
.
On a str
, tests like isalnum()
assume that the string is ASCII text. ASCII is defined only for codes 0 to 127. Python has no idea, and can have no idea, what characters are represented by values outside this range, because the encoding is not known. é
is not an ASCII character and therefore is not considered alphanumeric.
What you want to do is decode the byte string you've read to a Unicode string:
u = s.decode("utf8")
(assuming the string is written to the file in UTF8 encoding; if that doesn't work, you can try latin1
or cp437
... the latter is what my terminal gives me on Windows 10)
When you do that, u[0].isalnum()
is True
and isinstance(u, unicode)
is also True
.
Python 3 works a little differently. You have to tell Python what encoding to use when you open the file. Then it translates the strings to Unicode from that encoding as you read them. All strings in Python 3 are Unicode; there's a separate type, bytes
, for byte strings. You probably ought to use Python 3 for a lot of different reasons, but its more coherent handling of text is certainly one of those reasons.
How to convert a string of utf-8 bytes into a unicode emoji in python
Yes, I encountered the same problem when trying to decode a Facebook message dump. Here's how I solved it:
string = "\u00f0\u009f\u0098\u0086".encode("latin-1").decode("utf-8")
# ''
Here's why:
- This emoji takes 4 bytes to encode in UTF-8 (
F0 9F 98 86
, check at the bottom of this page) - Facebook could have used UTF-8 for the JSON file but they instead chose printable ASCII only. So it encodes those 4 bytes as
\u00F0\u009F\u0098\u0086
encode("latin-1")
was a convenient way to convert these encodings back to the raw bytes.decode("utf-8")
convert the raw bytes into a Unicode character.
Decode a utf8 string in python
Pretty unclear question. However, the following code snippet could help (inline comments show partial progress report):
receive_string = "b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
vietnamese_txt = (receive_string
.encode() # b"b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
.decode('unicode_escape') # "b'vô Ä\x91á»\x8bch thiên hạ'"
.encode('latin1').decode() # "b'vô địch thiên hạ'"
.lstrip('b').strip("'")) # 'vô địch thiên hạ'
print(vietnamese_txt) # vô địch thiên hạ
vô địch thiên hạ
How do I convert unicode string with cp1252 characters into UTF-8 with Python?
It seems your string was decoded with latin1
(as it is of type unicode
)
- To convert it back to the bytes it originally was, you need to encode using that encoding (
latin1
) - Then to get text back (
unicode
) you must decode using the proper codec (cp1252
) - finally, if you want to get to
utf-8
bytes you must encode using theUTF-8
codec.
In code:
>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June
Depending on whether the API takes text (unicode
) or bytes
, 3. may not be necessary.
Related Topics
Python's JSON Module, Converts Int Dictionary Keys to Strings
Convert List to Tuple in Python
In-Memory Size of a Python Structure
What's the Difference Between _Builtin_ and _Builtins_
File Not Found Error When Launching a Subprocess Containing Piped Commands
How to Find the Current Os in Python
Python Urllib2 Basic Auth Problem
How to Get a Thread Safe Print in Python 2.6
Writing to MySQL Database with Pandas Using SQLalchemy, To_Sql
How to Use Jupyter Notebooks in a Conda Environment
Numpy - Create Matrix with Rows of Vector
Pil: Convert Bytearray to Image
It Is More Efficient to Use If-Return-Return or If-Else-Return
Importerror: No Module Named Crypto.Cipher
Downloading with Chrome Headless and Selenium