How to Decode a String with Escaped Unicode

How to decode escaped Unicode characters?

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

How do I decode a string with escaped unicode?

Edit (2017-10-12):

@MechaLynx and @Kevin-Weber note that unescape() is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent is a drop-in replacement. For broader compatibility, use the below instead:

decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'

Original answer:

unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'

You can offload all the work to JSON.parse

How to decode partially escaped unicode string in python (mixed unicode and escaped unicode)?

A simple and fast solution is to use re.sub to match \u and exactly four hexadecimal digits, and convert those digits into a Unicode code point:

import re

s = r"blah bl\uah \u20ac € b\u20aclah\u12blah blah"
print(s)

s = re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)
print(s)

Output:

blah bl\uah \u20ac € b\u20aclah\u12blah blah
blah bl\uah € € b€lah\u12blah blah

Changing string with escaped Unicode to normal Unicode

If I try reproducing your issue:

s="reb\\u016bke";
print(s);
# reb\u016bke
print(repr(s));
# 'reb\\u016bke'
print(s.encode().decode('unicode-escape'));
# rebūke

How to decode a string containing backslash-encoded Unicode characters?

You can use strconv.Unquote for this:

u := `M\u00fcnchen`
s, err := strconv.Unquote(`"` + u + `"`)
if err != nil {
// ..
}
fmt.Printf("%v\n", s)

Outputs:

München

Convert data with escaped unicode characters to string

Assuming your data has the same content as something like this:

let data = #"Pla\u010daj Izbri\u0161i"#.data(using: .utf8)!
print(data as NSData) //->{length = 24, bytes = 0x506c615c7530313064616a20497a6272695c753031363169}

You can decode it in this way:

    public func decode(data: Data) throws -> String {
guard let text = String(data: data, encoding: .utf8) else {
throw SomeError()
}

let transform = StringTransform(rawValue: "Any-Hex/Java")
return text.applyingTransform(transform, reverse: true) ?? text
}

But, if you really get this sort of data from the web api, you should better tell the api engineer to use some normal encoding scheme.

How to determine if a string is escaped unicode

str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_ascii(s):
return all(ord(c) < 128 for c in s)

def is_escaped_unicode(str):
#how do I determine if this is escaped unicode?
if is_ascii(str): # escaped unicode is ascii
return True
return False

for str in arr_all_strings:
if is_escaped_unicode(str):
str = str.decode("unicode-escape")
print str

The following code will work for your case.

Explain:

  • All string in str_escaped is in Ascii range.

  • Char in str_unicode do not contain in Ascii range.

Decoding escaped unicode in Python 3 from a non-ascii string

I was still very new to Python when I asked this question. Now I understand that these fallback mechanisms are just meant for handling unexpected errors, not something to save and restore data. If you really need a simple and reliable way to encode single unicode characters in ASCII, have a look at the quote and unquote functions from the urllib.parse module.



Related Topics



Leave a reply



Submit