How to decode escaped Unicode characters?
The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):
("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
Outputs:
'ä'
This works as follows:
- The string that contains only ascii-characters
'\'
,'u'
,'0'
,'0'
,'c'
, etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly) - Use a decoder that interprets the
'\u00c3'
escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded withISO-8859-1
/'latin-1'
, so... - encode it again with
'latin-1'
- Decode it "properly" this time, as UTF-8
Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.
How do I decode a string with escaped unicode?
Edit (2017-10-12):
@MechaLynx and @Kevin-Weber note that unescape()
is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent
is a drop-in replacement. For broader compatibility, use the below instead:
decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
Original answer:
unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
You can offload all the work to JSON.parse
How to decode partially escaped unicode string in python (mixed unicode and escaped unicode)?
A simple and fast solution is to use re.sub
to match \u
and exactly four hexadecimal digits, and convert those digits into a Unicode code point:
import re
s = r"blah bl\uah \u20ac € b\u20aclah\u12blah blah"
print(s)
s = re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)
print(s)
Output:
blah bl\uah \u20ac € b\u20aclah\u12blah blah
blah bl\uah € € b€lah\u12blah blah
Changing string with escaped Unicode to normal Unicode
If I try reproducing your issue:
s="reb\\u016bke";
print(s);
# reb\u016bke
print(repr(s));
# 'reb\\u016bke'
print(s.encode().decode('unicode-escape'));
# rebūke
How to decode a string containing backslash-encoded Unicode characters?
You can use strconv.Unquote
for this:
u := `M\u00fcnchen`
s, err := strconv.Unquote(`"` + u + `"`)
if err != nil {
// ..
}
fmt.Printf("%v\n", s)
Outputs:
München
Convert data with escaped unicode characters to string
Assuming your data
has the same content as something like this:
let data = #"Pla\u010daj Izbri\u0161i"#.data(using: .utf8)!
print(data as NSData) //->{length = 24, bytes = 0x506c615c7530313064616a20497a6272695c753031363169}
You can decode it in this way:
public func decode(data: Data) throws -> String {
guard let text = String(data: data, encoding: .utf8) else {
throw SomeError()
}
let transform = StringTransform(rawValue: "Any-Hex/Java")
return text.applyingTransform(transform, reverse: true) ?? text
}
But, if you really get this sort of data from the web api, you should better tell the api engineer to use some normal encoding scheme.
How to determine if a string is escaped unicode
str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]
def is_ascii(s):
return all(ord(c) < 128 for c in s)
def is_escaped_unicode(str):
#how do I determine if this is escaped unicode?
if is_ascii(str): # escaped unicode is ascii
return True
return False
for str in arr_all_strings:
if is_escaped_unicode(str):
str = str.decode("unicode-escape")
print str
The following code will work for your case.
Explain:
All string in str_escaped is in Ascii range.
Char in str_unicode do not contain in Ascii range.
Decoding escaped unicode in Python 3 from a non-ascii string
I was still very new to Python when I asked this question. Now I understand that these fallback mechanisms are just meant for handling unexpected errors, not something to save and restore data. If you really need a simple and reliable way to encode single unicode characters in ASCII, have a look at the quote
and unquote
functions from the urllib.parse
module.
Related Topics
Defining a JavaScript Prototype
Browsers, Time Zones, Chrome 67 Error (Historic Timezone Changes)
When to Use Vanilla JavaScript VS. Jquery
When You Pass 'This' as an Argument
Why Doesn't a JavaScript Return Statement Work When the Return Value Is on a New Line
What Is the Reason JavaScript Settimeout Is So Inaccurate
Passing Custom Props to Router Component in React-Router V4
How to Mock Functions in the Same Module Using Jest
Call a JavaScript Function at a Specific Time of Day
How to Refresh a Page Using JavaScript
How to Pass Data from a Page to Another Page Using React Router
Reactjs - Does Render Get Called Any Time "Setstate" Is Called
Why JavaScript Treats a Number as Octal If It Has a Leading Zero
Es6 Destructuring Function Parameter - Naming Root Object
I Know That Callback Function Runs Asynchronously, But Why