How to Convert a String to Utf-8 in Python

How to convert a string to utf-8 in Python

In Python 2

>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)

^ This is the difference between a byte string (plain_string) and a unicode string.

>>> s = "Hello!"
>>> u = unicode(s, "utf-8")

^ Converting to unicode and specifying the encoding.

In Python 3

All strings are unicode. The unicode function does not exist anymore. See answer from @Noumenon

Convert string of unknown encoding to UTF-8

"Träume groß" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.

A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:

print('"Träume groß"'.encode('cp1252').decode('utf8'))

gives as expected:

"Träume groß"

But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.

Python – How do I convert an ASCII string into UTF-8?

If the input string contains the raw byte ordinals (such as \xc3\xa9/é instead of é) use latin1 to encode it to bytes verbatim, then decode with the desired encoding.

>>> "pasé".encode('latin1').decode()
'pasé'

Converting unicode string to utf-8

After fighting with python for over an hour, I decided to look for a solution in another language. This is how my goal can be achieved in C#:

var s = "\u00c4\u008d";
var newS = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
File.WriteAllText(@"D:\tmp\test.txt", newS, Encoding.UTF8);

Finally! The file now contains č.

I therefore got inspired by this approach in C# and managed to come up with the following (seemingly) equivalent solution in Python:

>>> s = u"\u00c4\u008d"
>>> arr = bytearray(map(ord, s))
>>> print arr.decode("utf-8")
č

I'm not sure how good this solution is but it seems to work in my case.

Decoding string to UTF-8 if string is already defined as r'string' instead of b'string'

Try this:

import re
re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), stringa.encode('ascii')).decode('utf-8')

Explanation:

We use the regular expression rb'\\([0-7]{3})' (which matches a literal backslash \ followed by exactly 3 octal digits) and replace each occurrence by taking the three digit code (match[1]), interpreting that as a number written in octal (int(_, 8)), and then replacing the original escape sequence with a single byte (bytes([_])).

We need to operate over bytes because the escape codes are of raw bytes, not unicode characters. Only after we "unescaped" those sequences, can we decode the UTF-8 to a string.

Is Python 2.7 actually converting my string to UTF-8 or is the definition of isalnum() different across different machines?

Strings (type str) in Python 2.7 are bytes. When you read text from a file, you get bytes, with possibly the line endings changed. Therefore, s is not an instance of type unicode.

On a str, tests like isalnum() assume that the string is ASCII text. ASCII is defined only for codes 0 to 127. Python has no idea, and can have no idea, what characters are represented by values outside this range, because the encoding is not known. é is not an ASCII character and therefore is not considered alphanumeric.

What you want to do is decode the byte string you've read to a Unicode string:

u = s.decode("utf8")

(assuming the string is written to the file in UTF8 encoding; if that doesn't work, you can try latin1 or cp437... the latter is what my terminal gives me on Windows 10)

When you do that, u[0].isalnum() is True and isinstance(u, unicode) is also True.

Python 3 works a little differently. You have to tell Python what encoding to use when you open the file. Then it translates the strings to Unicode from that encoding as you read them. All strings in Python 3 are Unicode; there's a separate type, bytes, for byte strings. You probably ought to use Python 3 for a lot of different reasons, but its more coherent handling of text is certainly one of those reasons.

How to convert a string of utf-8 bytes into a unicode emoji in python

Yes, I encountered the same problem when trying to decode a Facebook message dump. Here's how I solved it:

string = "\u00f0\u009f\u0098\u0086".encode("latin-1").decode("utf-8")
# ''

Here's why:

  1. This emoji takes 4 bytes to encode in UTF-8 (F0 9F 98 86, check at the bottom of this page)
  2. Facebook could have used UTF-8 for the JSON file but they instead chose printable ASCII only. So it encodes those 4 bytes as \u00F0\u009F\u0098\u0086
  3. encode("latin-1") was a convenient way to convert these encodings back to the raw bytes.
  4. decode("utf-8") convert the raw bytes into a Unicode character.

Decode a utf8 string in python

Pretty unclear question. However, the following code snippet could help (inline comments show partial progress report):

receive_string = "b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
vietnamese_txt = (receive_string
.encode() # b"b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
.decode('unicode_escape') # "b'vô Ä\x91á»\x8bch thiên hạ'"
.encode('latin1').decode() # "b'vô địch thiên hạ'"
.lstrip('b').strip("'")) # 'vô địch thiên hạ'

print(vietnamese_txt) # vô địch thiên hạ
vô địch thiên hạ

How do I convert unicode string with cp1252 characters into UTF-8 with Python?

It seems your string was decoded with latin1 (as it is of type unicode)

  1. To convert it back to the bytes it originally was, you need to encode using that encoding (latin1)
  2. Then to get text back (unicode) you must decode using the proper codec (cp1252)
  3. finally, if you want to get to utf-8 bytes you must encode using the UTF-8 codec.

In code:

>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June

Depending on whether the API takes text (unicode) or bytes, 3. may not be necessary.



Related Topics



Leave a reply



Submit