How to decode a Unicode character in a string
Regex.Unescape
did the trick:
System.Text.RegularExpressions.Regex.Unescape(@"Sch\u00f6nen");
Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen"
is already "Schönen"
. You need @
in front of string to treat \u00f6
as part of the string.
How to decode escaped Unicode characters?
The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):
("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
Outputs:
'ä'
This works as follows:
- The string that contains only ascii-characters
'\'
,'u'
,'0'
,'0'
,'c'
, etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly) - Use a decoder that interprets the
'\u00c3'
escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded withISO-8859-1
/'latin-1'
, so... - encode it again with
'latin-1'
- Decode it "properly" this time, as UTF-8
Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.
How to decode a unicode string Python
You need to call encode
function and not decode
function, as item
is already decoded.
Like this:
decoded_value = item.encode('utf-8')
Decoding Unicode in Python
There are actually two types of strings we could be dealing with here.
The first is a Python Unicode string, where the string is already a set of unicode points.
This is what it looks like in Python:
>>> x = u"\u1129\u1129"
>>> x
u'\u1129\u1129'
You can actually just print this to the screen, because the Python print function usually uses an encoding that supports this. (I believe it is sys.stdout.encoding)
>>> print x
ᄩᄩ
If you wish to encode this, you should probably use the utf-8 encoding, which supports all known Unicode characters. However, you will still need the print function to print it as a readable character.
But, this kind of string is easy to print! I doubt you would have any trouble outputting this to the screen. Which is why I believe you have the second type of string:
The second type of string is a Unicode-escaped string, which can be found in things like Java .properties files (where they force you to use some single-byte variant of ascii encoding). This is what it looks like in Python:
>>> escapedString = "\\u05D4\\u05D4\\u05D4"
>>> print escapedString
\u05D4\u05D4\u05D4
And then because whoever designed these files was ignorant of Unicode and the basic essentials of character encoding, it's our job to turn these escaped code points into readable characters.
>>> pythonUnicode = escapedString.decode("unicode-escape")
# This turns escaped unicode code points into Python unicode code points
>>> print pythonUnicode
ההה
And it looks like we have readable characters!
However, you should be careful if you have characters outside the Basic Multilingual Plane (U+0 to U+FFFF). There are different ways to encode characters that extend past the basic two bytes. For example:
Python escapes extended characters with \U
(note capital U) and an eight-char.
>>> print "\\U0001D11E".decode("unicode-escape")
br>>>> print u"\U0001D11E"
br>
But the rfc specifies a different kind of escape:
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
So make sure you know where your data comes from!
Python: How to translate UTF8 String containing unicode decoded characters ( Ok\u00c9 to Oké )
Ive found the problem. The encoding decoding was wrong. The text came in as Windows-1252 encoding.
I've use
import chardet
chardet.detect(var3.encode())
to detect the proper encoding, and the did a
var3 = 'OK\u00c9'.encode('utf8').decode('Windows-1252').encode('utf8').decode('utf8')
conversion to eventually get it in the right format!
How to decode a string containing backslash-encoded Unicode characters?
You can use strconv.Unquote
for this:
u := `M\u00fcnchen`
s, err := strconv.Unquote(`"` + u + `"`)
if err != nil {
// ..
}
fmt.Printf("%v\n", s)
Outputs:
München
Related Topics
Editing Dictionary Values in a Foreach Loop
Deserializing into a List Without a Container Element in Xml
Foreach VS Somelist.Foreach(){}
Dictionary Returning a Default Value If the Key Does Not Exist
Understanding Async/Await in C#
Display Image from Database in Asp MVC
Splitting a String/Number Every Nth Character/Number
Abstract Classes VS Interfaces
Getting the Absolute Path of the Executable, Using C#
How to Get the Index of an Element in an Ienumerable
How to Assign a Func<> Conditionally Between Lambdas Using the Conditional Ternary Operator
How to Get an Oauth 2.0 Authentication Token in C#
Forms Authentication Across Sub-Domains
Retrieve Current Url from C# Windows Forms Application
How to List All Processes Running in Windows
How to Use a MySQL User Defined Variable in a .Net MySQLcommand
How to Deploy Application with SQL Server Database on Clients