How to Decode a String with Escaped Unicode

How to decode escaped Unicode characters?

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):




This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

How do I decode a string with escaped unicode?

Edit (2017-10-12):

@MechaLynx and @Kevin-Weber note that unescape() is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent is a drop-in replacement. For broader compatibility, use the below instead:

> ''

Original answer:

> ''

You can offload all the work to JSON.parse

How to decode partially escaped unicode string in python (mixed unicode and escaped unicode)?

A simple and fast solution is to use re.sub to match \u and exactly four hexadecimal digits, and convert those digits into a Unicode code point:

import re

s = r"blah bl\uah \u20ac € b\u20aclah\u12blah blah"

s = re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(,16)),s)


blah bl\uah \u20ac € b\u20aclah\u12blah blah
blah bl\uah € € b€lah\u12blah blah

Changing string with escaped Unicode to normal Unicode

If I try reproducing your issue:

# reb\u016bke
# 'reb\\u016bke'
# rebūke

How to decode a string containing backslash-encoded Unicode characters?

You can use strconv.Unquote for this:

u := `M\u00fcnchen`
s, err := strconv.Unquote(`"` + u + `"`)
if err != nil {
// ..
fmt.Printf("%v\n", s)



Convert data with escaped unicode characters to string

Assuming your data has the same content as something like this:

let data = #"Pla\u010daj Izbri\u0161i" .utf8)!
print(data as NSData) //->{length = 24, bytes = 0x506c615c7530313064616a20497a6272695c753031363169}

You can decode it in this way:

    public func decode(data: Data) throws -> String {
guard let text = String(data: data, encoding: .utf8) else {
throw SomeError()

let transform = StringTransform(rawValue: "Any-Hex/Java")
return text.applyingTransform(transform, reverse: true) ?? text

But, if you really get this sort of data from the web api, you should better tell the api engineer to use some normal encoding scheme.

How to determine if a string is escaped unicode

str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_ascii(s):
return all(ord(c) < 128 for c in s)

def is_escaped_unicode(str):
#how do I determine if this is escaped unicode?
if is_ascii(str): # escaped unicode is ascii
return True
return False

for str in arr_all_strings:
if is_escaped_unicode(str):
str = str.decode("unicode-escape")
print str

The following code will work for your case.


  • All string in str_escaped is in Ascii range.

  • Char in str_unicode do not contain in Ascii range.

Decoding escaped unicode in Python 3 from a non-ascii string

I was still very new to Python when I asked this question. Now I understand that these fallback mechanisms are just meant for handling unexpected errors, not something to save and restore data. If you really need a simple and reliable way to encode single unicode characters in ASCII, have a look at the quote and unquote functions from the urllib.parse module.

Related Topics

Leave a reply
