Convert Escaped Unicode Character Back to Actual Character

Convert escaped Unicode character back to actual character

try

str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);

from Apache Commons Lang

Convert escaped Unicode character back to actual character in PostgreSQL

One old trick is using parser for this purpose:

postgres=# select e'Telefon\u00ED kontakty';
?column?
-------------------
Telefoní kontakty
(1 row)

CREATE OR REPLACE FUNCTION public.unescape(text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
EXECUTE format('SELECT e''%s''', $1) INTO result;
RETURN result;
END;
$function$

It works, but it is SQL injection vulnerable - so you should to sanitize input text first!

Here is less readable, but safe version - but you have to manually specify one char as escape symbol:

CREATE OR REPLACE FUNCTION public.unescape(text, text) 
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
EXECUTE format('SELECT U&%s UESCAPE %s',
quote_literal(replace($1, '\u','^')),
quote_literal($2)) INTO result;
RETURN result;
END;
$function$

Result

postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');
unescape
-----------------
Odpovědná osoba
(1 row)

How can I convert a String in ASCII(Unicode Escaped) to Unicode(UTF-8) if I am reading from a file?

final String str = new String("Diogo Pi\u00e7arra - Tu E Eu".getBytes(), 
Charset.forName("UTF-8"));

Result:

Sample Image

Try to use getBytes() method without parameters (defaultCharset will be used here). But it's not necessary. The conversion is not required:

final String str = "Diogo Pi\u00e7arra - Tu E Eu";

You'll have same result.

How to decode escaped Unicode characters?

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

Convert data with escaped unicode characters to string

Assuming your data has the same content as something like this:

let data = #"Pla\u010daj Izbri\u0161i"#.data(using: .utf8)!
print(data as NSData) //->{length = 24, bytes = 0x506c615c7530313064616a20497a6272695c753031363169}

You can decode it in this way:

    public func decode(data: Data) throws -> String {
guard let text = String(data: data, encoding: .utf8) else {
throw SomeError()
}

let transform = StringTransform(rawValue: "Any-Hex/Java")
return text.applyingTransform(transform, reverse: true) ?? text
}

But, if you really get this sort of data from the web api, you should better tell the api engineer to use some normal encoding scheme.

How to replace escaped unicode characters with proper unicode characters?

Your regex extracts JSON strings from a webpage:

searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

Those " chacarters you removed were actually significant. The \uxxxx escape syntax here is specific to JSON (and Javascript) syntax; they are closely related to Python's use but different (not much, but it matters when you have non-BMP codepoints).

You can trivially decode them as JSON , if you keep the quotes in there:

searched_results = map(json.loads, re.findall(r"(?<=,\"ou\":)\"[^\s]+[\w]\"(?=,\"ow\")", results_source))

Better still would be to use a HTML library to parse the page. When using BeautifulSoup, you can get the data with:

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(results_source, 'html.parser')
search_results = [json.loads(t.text)['ou'] for t in soup.select('.rg_meta')]

This loads the text contents of each <div class="rg_meta" ...> element as JSON data, and extracts the ou key from each of the resulting dictionaries. No regular expressions required.

Java doesn't decode passed string (with unicode)

In Java source code, \uD83D is an escape code: The compiler replaces it with one code unit.

If you see \uD83D in your database, it's not an escape code, it's the sequence of six individual characters '\' 'u' 'D' '8' '3' 'D'.

What's the right way to fix this and make sure you get the same output anyway?

One thing you must ask is why did the text "\uD83D" get to the database in the first place. Text stored in a database should not be mangled in this way. It sounds like there is a bug at the data entry.

If there's no way to fix the data entry, and you want to replace the text "\uD83D" with a single character just like the Java compiler would, that has already been covered in other questions, see for example Convert escaped Unicode character back to actual character



Related Topics



Leave a reply



Submit