Converting a \U Escaped Unicode String to Ascii

Converting a \u escaped Unicode string to ASCII

Use parse, but don't evaluate the results:

x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1

Convert a Unicode string to an escaped ASCII string

This goes back and forth to and from the \uXXXX format.

class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";

Console.WriteLine( unicodeString );

string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );

string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}

static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}

static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
@"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}

Outputs:

This function contains a unicode character pi (π)

This function contains a unicode character pi (\u03a0)

This function contains a unicode character pi (π)

Convert non-escaped unicode string to unicode

These are essentially UTF-16 code points, so this would do (this approach is not very efficient, but I assume optimization isn't the main goal):

Regex.Replace(
"u0393u03a5u039du0391u0399u039au0391",
"u[0-9a-f]{4}",
m => "" + (char) int.Parse(m.Value.Substring(1), NumberStyles.AllowHexSpecifier)
)

This can't deal with the ambiguity of un-escaped "regular" characters in the string: dufface would effectively get turned into d + \uffac + e, which is probably not right. It will correctly handle surrogates, though (ud83dudc96 is ).

Using the technique in this answer is another option:

Regex.Unescape(@"u0393u03a5u039du0391u0399u039au0391".Replace(@"\", @"\\").Replace("u", @"\u"))

The extra \ escaping is there just in case the string should contain any backslashes already, which could be wrongly interpreted as escape sequences.

Convert UTF-8 Unicode string to ASCII Unicode escaped String

This is the kind of simple code Jon Skeet had in mind in his comment:

final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
final char ch = in.charAt(i);
if (ch <= 127) out.append(ch);
else out.append("\\u").append(String.format("%04x", (int)ch));
}
System.out.println(out.toString());

As Jon said, surrogate pairs will be represented as a pair of \u escapes.

C++ convert ASII escaped unicode string into utf8 string

(\u03a0 is the Unicode code point for GREEK CAPITAL LETTER PI whose UTF-8 encoding is 0xCE 0xA0)

You need to:

  1. Get the number 0x03a0 from the string "\u03a0": drop the backslash and the u and parse 03a0 as hex, into a wchar_t. Repeat until you get a (wide) string.
  2. Convert 0x3a0 into UTF-8. C++11 has a codecvt_utf8 that may help.

How do you convert unicode string to escapes in bash?

All bash method -

echo ãçé |
while read -n 1 u
do [[ -n "$u" ]] && printf '\\u%04x' "'$u"
done

That leading apostrophe is a printf formatting/interpretation guide.

From the GNU man page online:

If the leading character of a numeric argument is ‘"’ or ‘'’ then its value is the numeric value of the immediately following character. Any remaining characters are silently ignored if the POSIXLY_CORRECT environment variable is set; otherwise, a warning is printed. For example, ‘printf "%d" "'a"’ outputs ‘97’ on hosts that use the ASCII character set, since ‘a’ has the numeric value 97 in ASCII.

That lets us pass the character to printf for numeric interpretations such as %d or %03o, or here, %04x.

The [[ -n "$u" ]] is because there's a null trailing byte that will otherwise be appended as \u0000.

Output:

$:     echo ãçé |
> while read -n 1 u
> do [[ -n "$u" ]] && printf '\\u%04x' "'$u"
> done
\u00e3\u00e7\u00e9

Without the null byte check -

$: echo ãçé | while read -n 1 u; do printf '\\u%04x' "'$u";done
\u00e3\u00e7\u00e9\u0000

Convert escaped Unicode character back to actual character

try

str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);

from Apache Commons Lang

How to escape unicode special chars in string and write it to UTF encoded file

Another solution, not relying on the built-in repr() but rather implementing it from scratch:

orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)

print(enc)

Differences:

  • Encodes only using \u, never any other sequence, whereas repr() uses about a third of the alphabet (so for example the BEL character will be encoded as \u0007 rather than \a)
  • Upper-case encoding, as specified (\u00FC rather than \u00fc)
  • Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
  • It does not take care of any pre-existing \u sequences, whereas repr() turns those into \\u; could be extended, perhaps to encode \ as \u005C:
    enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)

How to decode escaped Unicode characters?

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

Convert Python escaped Unicode sequences to UTF-8

Small demo using Python 3. If you don't dump to JSON using ensure_ascii=False, non-ASCII will be written to JSON with Unicode escape codes. That doesn't affect the ability to load the JSON, but it is less readable in the .json file itself.

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '<element>50\u20ac</element'
>>> html
'<element>50€</element'
>>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
... json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z

Content of out.json (UTF-8-encoded):

"50€"


Related Topics



Leave a reply



Submit