How to Decode a Unicode Character in a String

How to decode a Unicode character in a string

Regex.Unescape did the trick:

System.Text.RegularExpressions.Regex.Unescape(@"Sch\u00f6nen");

Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen" is already "Schönen". You need @ in front of string to treat \u00f6 as part of the string.

How to decode escaped Unicode characters?

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

How to decode a unicode string Python

You need to call encode function and not decode function, as item is already decoded.

Like this:

decoded_value = item.encode('utf-8')

Decoding Unicode in Python

There are actually two types of strings we could be dealing with here.

The first is a Python Unicode string, where the string is already a set of unicode points.

This is what it looks like in Python:

>>> x = u"\u1129\u1129"
>>> x
u'\u1129\u1129'

You can actually just print this to the screen, because the Python print function usually uses an encoding that supports this. (I believe it is sys.stdout.encoding)

>>> print x
ᄩᄩ

If you wish to encode this, you should probably use the utf-8 encoding, which supports all known Unicode characters. However, you will still need the print function to print it as a readable character.

But, this kind of string is easy to print! I doubt you would have any trouble outputting this to the screen. Which is why I believe you have the second type of string:


The second type of string is a Unicode-escaped string, which can be found in things like Java .properties files (where they force you to use some single-byte variant of ascii encoding). This is what it looks like in Python:

>>> escapedString = "\\u05D4\\u05D4\\u05D4"
>>> print escapedString
\u05D4\u05D4\u05D4

And then because whoever designed these files was ignorant of Unicode and the basic essentials of character encoding, it's our job to turn these escaped code points into readable characters.

>>> pythonUnicode = escapedString.decode("unicode-escape")
# This turns escaped unicode code points into Python unicode code points
>>> print pythonUnicode
ההה

And it looks like we have readable characters!


However, you should be careful if you have characters outside the Basic Multilingual Plane (U+0 to U+FFFF). There are different ways to encode characters that extend past the basic two bytes. For example:

Python escapes extended characters with \U (note capital U) and an eight-char.

>>> print "\\U0001D11E".decode("unicode-escape")
br>>>> print u"\U0001D11E"
br>

But the rfc specifies a different kind of escape:

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

So make sure you know where your data comes from!

Python: How to translate UTF8 String containing unicode decoded characters ( Ok\u00c9 to Oké )

Ive found the problem. The encoding decoding was wrong. The text came in as Windows-1252 encoding.

I've use

import chardet
chardet.detect(var3.encode())

to detect the proper encoding, and the did a

var3 = 'OK\u00c9'.encode('utf8').decode('Windows-1252').encode('utf8').decode('utf8')

conversion to eventually get it in the right format!

How to decode a string containing backslash-encoded Unicode characters?

You can use strconv.Unquote for this:

u := `M\u00fcnchen`
s, err := strconv.Unquote(`"` + u + `"`)
if err != nil {
// ..
}
fmt.Printf("%v\n", s)

Outputs:

München


Related Topics



Leave a reply



Submit