Convert escaped Unicode character back to actual character
try
str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);
from Apache Commons Lang
Convert escaped Unicode character back to actual character in PostgreSQL
One old trick is using parser for this purpose:
postgres=# select e'Telefon\u00ED kontakty';
?column?
-------------------
Telefoní kontakty
(1 row)
CREATE OR REPLACE FUNCTION public.unescape(text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
EXECUTE format('SELECT e''%s''', $1) INTO result;
RETURN result;
END;
$function$
It works, but it is SQL injection vulnerable - so you should to sanitize input text first!
Here is less readable, but safe version - but you have to manually specify one char as escape symbol:
CREATE OR REPLACE FUNCTION public.unescape(text, text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
EXECUTE format('SELECT U&%s UESCAPE %s',
quote_literal(replace($1, '\u','^')),
quote_literal($2)) INTO result;
RETURN result;
END;
$function$
Result
postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');
unescape
-----------------
Odpovědná osoba
(1 row)
How can I convert a String in ASCII(Unicode Escaped) to Unicode(UTF-8) if I am reading from a file?
final String str = new String("Diogo Pi\u00e7arra - Tu E Eu".getBytes(),
Charset.forName("UTF-8"));
Result:
Try to use getBytes()
method without parameters (defaultCharset
will be used here). But it's not necessary. The conversion is not required:
final String str = "Diogo Pi\u00e7arra - Tu E Eu";
You'll have same result.
How to escape unicode special chars in string and write it to UTF encoded file
Another solution, not relying on the built-in repr()
but rather implementing it from scratch:
orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'
enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)
print(enc)
Differences:
- Encodes only using
\u
, never any other sequence, whereasrepr()
uses about a third of the alphabet (so for example the BEL character will be encoded as\u0007
rather than\a
) - Upper-case encoding, as specified (
\u00FC
rather than\u00fc
) - Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
- It does not take care of any pre-existing
\u
sequences, whereasrepr()
turns those into\\u
; could be extended, perhaps to encode\
as\u005C
:enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)
Automatically escape unicode characters
After digging into some documentation about iconv
, I think you can accomplish this using only the base
package. But you need to pay extra attention to the encoding of the string.
On a system with UTF-8 encoding:
> stri_escape_unicode("你好世界")
[1] "\\u4f60\\u597d\\u4e16\\u754c"
# use big endian
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 4f 60 59 7d 4e 16 75 4c
> x <- "•"
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 20 22
But, if you are on a system with latin1
encoding, things may go wrong.
> x <- "•"
> y <- "\u2022"
> identical(x, y)
[1] FALSE
> stri_escape_unicode(x)
[1] "\\u0095" # <- oops!
# culprit
> Encoding(x)
[1] "latin1"
# and it causes problem for iconv
> iconv(x, Encoding(x), "Unicode")
Error in iconv(x, Encoding(x), "Unicode") :
unsupported conversion from 'latin1' to 'Unicode' in codepage 1252
> iconv(x, Encoding(x), "UTF-16BE")
Error in iconv(x, Encoding(x), "UTF-16BE") :
embedded nul in string: '\0•'
It is safer to cast the string into UTF-8 before converting to Unicode:
> iconv(enc2utf8(enc2native(x)), "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 20 22
EDIT: This may cause some problems for strings already in UTF-8 encoding on some particular systems. Maybe it's safer to check the encoding before conversion.
> Encoding("•")
[1] "latin1"
> enc2native("•")
[1] "•"
> enc2native("\u2022")
[1] "•"
# on a Windows with default latin1 encoding
> Encoding("测试")
[1] "UTF-8"
> enc2native("测试")
[1] "" # <- BAD!
For some characters or lanuages, UTF-16
may not be enough. So probably you should be using UTF-32
since
The UTF-32 form of a character is a direct representation of its codepoint.
Based on above trial and error, below is probably one safer escape function we can write:
unicode_escape <- function(x, endian="big") {
if (Encoding(x) != 'UTF-8') {
x <- enc2utf8(enc2native(x))
}
to.enc <- ifelse(endian == 'big', 'UTF-32BE', 'UTF-32LE')
bytes <- strtoi(unlist(iconv(x, "UTF-8", "UTF-32BE", toRaw=T)), base=16)
# there may be some better way to do thibs.
runes <- matrix(bytes, nrow=4)
escaped <- apply(runes, 2, function(rb) {
nonzero.bytes <- rb[rb > 0]
ifelse(length(nonzero.bytes) > 1,
# convert back to hex
paste("\\u", paste(as.hexmode(nonzero.bytes), collapse=""), sep=""),
rawToChar(as.raw(nonzero.bytes))
)
})
paste(escaped, collapse="")
}
Tests:
> unicode_escape("•••ERROR!!!•••")
[1] "\\u2022\\u2022\\u2022ERROR!!!\\u2022\\u2022\\u2022"
> unicode_escape("Hello word! 你好世界!")
[1] "Hello word! \\u4f60\\u597d\\u4e16\\u754c!"
> "\u4f60\u597d\u4e16\u754c"
[1] "你好世界"
Related Topics
Uiscrollview Scroll to Bottom Programmatically
How to Implement Re-Ordering of Coredata Records
Uitapgesturerecognizer - Single Tap and Double Tap
Issue Using Cccrypt (Commoncrypt) in Swift
iPhone Reboot Programmatically
Get Notified When Uitableview Has Finished Asking for Data
Objective C Nsstring* Property Retain Count Oddity
How to Flip Uiimage Horizontally
Codesign Error: Provisioning Profile Cannot Be Found After Deleting Expired Profile
Core Data and iOS 7: Different Behavior of Persistent Store
Programmatically Open Maps App in iOS 6
Disable Uiscrollview Scrolling When Uitextfield Becomes First Responder
Nsinternalinconsistencyexception', Reason: 'Could Not Load Nib in Bundle: 'Nsbundle
How to Detect Whether a User Has an iPhone 6 Plus in Standard or Zoomed Mode