Convert escaped Unicode character back to actual character
try
str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);
from Apache Commons Lang
Convert escaped Unicode character back to actual character in PostgreSQL
One old trick is using parser for this purpose:
postgres=# select e'Telefon\u00ED kontakty';
?column?
-------------------
Telefoní kontakty
(1 row)
CREATE OR REPLACE FUNCTION public.unescape(text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
EXECUTE format('SELECT e''%s''', $1) INTO result;
RETURN result;
END;
$function$
It works, but it is SQL injection vulnerable - so you should to sanitize input text first!
Here is less readable, but safe version - but you have to manually specify one char as escape symbol:
CREATE OR REPLACE FUNCTION public.unescape(text, text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
EXECUTE format('SELECT U&%s UESCAPE %s',
quote_literal(replace($1, '\u','^')),
quote_literal($2)) INTO result;
RETURN result;
END;
$function$
Result
postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');
unescape
-----------------
Odpovědná osoba
(1 row)
How can I convert a String in ASCII(Unicode Escaped) to Unicode(UTF-8) if I am reading from a file?
final String str = new String("Diogo Pi\u00e7arra - Tu E Eu".getBytes(),
Charset.forName("UTF-8"));
Result:
Try to use getBytes()
method without parameters (defaultCharset
will be used here). But it's not necessary. The conversion is not required:
final String str = "Diogo Pi\u00e7arra - Tu E Eu";
You'll have same result.
How to decode escaped Unicode characters?
The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):
("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
Outputs:
'ä'
This works as follows:
- The string that contains only ascii-characters
'\'
,'u'
,'0'
,'0'
,'c'
, etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly) - Use a decoder that interprets the
'\u00c3'
escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded withISO-8859-1
/'latin-1'
, so... - encode it again with
'latin-1'
- Decode it "properly" this time, as UTF-8
Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.
Convert data with escaped unicode characters to string
Assuming your data
has the same content as something like this:
let data = #"Pla\u010daj Izbri\u0161i"#.data(using: .utf8)!
print(data as NSData) //->{length = 24, bytes = 0x506c615c7530313064616a20497a6272695c753031363169}
You can decode it in this way:
public func decode(data: Data) throws -> String {
guard let text = String(data: data, encoding: .utf8) else {
throw SomeError()
}
let transform = StringTransform(rawValue: "Any-Hex/Java")
return text.applyingTransform(transform, reverse: true) ?? text
}
But, if you really get this sort of data from the web api, you should better tell the api engineer to use some normal encoding scheme.
How to replace escaped unicode characters with proper unicode characters?
Your regex extracts JSON strings from a webpage:
searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)
Those "
chacarters you removed were actually significant. The \uxxxx
escape syntax here is specific to JSON (and Javascript) syntax; they are closely related to Python's use but different (not much, but it matters when you have non-BMP codepoints).
You can trivially decode them as JSON , if you keep the quotes in there:
searched_results = map(json.loads, re.findall(r"(?<=,\"ou\":)\"[^\s]+[\w]\"(?=,\"ow\")", results_source))
Better still would be to use a HTML library to parse the page. When using BeautifulSoup, you can get the data with:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(results_source, 'html.parser')
search_results = [json.loads(t.text)['ou'] for t in soup.select('.rg_meta')]
This loads the text contents of each <div class="rg_meta" ...>
element as JSON data, and extracts the ou
key from each of the resulting dictionaries. No regular expressions required.
Java doesn't decode passed string (with unicode)
In Java source code, \uD83D
is an escape code: The compiler replaces it with one code unit.
If you see \uD83D
in your database, it's not an escape code, it's the sequence of six individual characters '\' 'u' 'D' '8' '3' 'D'.
What's the right way to fix this and make sure you get the same output anyway?
One thing you must ask is why did the text "\uD83D" get to the database in the first place. Text stored in a database should not be mangled in this way. It sounds like there is a bug at the data entry.
If there's no way to fix the data entry, and you want to replace the text "\uD83D" with a single character just like the Java compiler would, that has already been covered in other questions, see for example Convert escaped Unicode character back to actual character
Related Topics
Why Use a Reentrantlock If One Can Use Synchronized(This)
How to Read JSON File into Java with Simple JSON Library
Return Generated PDF Using Spring MVC
Adding Image to Jbutton with Foreground Label
How to Get the Last Value of an Arraylist
Printing Java Collections Nicely (Tostring Doesn't Return Pretty Output)
How to Multiply Strings in Java to Repeat Sequences
Java: "Final" System.Out, System.In and System.Err
Convert Escaped Unicode Character Back to Actual Character
Maven Parent Pom VS Modules Pom
How to Create 2 Separate Log Files with One Log4J Config File
Embed a Jre in a Windows Executable
How to Sort Date Which Is in String Format in Java
Incompatible Magic Value 1008813135