Convert a Unicode String to an Escaped Ascii String

Convert a Unicode string to an escaped ASCII string

This goes back and forth to and from the \uXXXX format.

class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";

Console.WriteLine( unicodeString );

string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );

string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}

static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}

static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
@"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}

Outputs:

This function contains a unicode character pi (π)

This function contains a unicode character pi (\u03a0)

This function contains a unicode character pi (π)

How to convert an ascii string with escape characters to its unicode equivalent

'\x9a' doesn’t have any escape characters in it. The escape is part of the string literal and the bytes represented are just one: [0x9a]. The encoding might be Windows-1252, because that’s common and has š at 0x9a, but you really have to know what it is. To decode as Windows-1252:

good_string = bad_string.decode('cp2512')

If what you actually have is '\\x9a' (one backslash, three other characters), then you’ll need to convert it to the above form first. The right way to do this depends on how the escapes managed to get there in the first place. If it’s from a Python string literal, use string-escape first:

good_string = bad_string.decode('string-escape').decode('cp2512')

Convert non-escaped unicode string to unicode

These are essentially UTF-16 code points, so this would do (this approach is not very efficient, but I assume optimization isn't the main goal):

Regex.Replace(
"u0393u03a5u039du0391u0399u039au0391",
"u[0-9a-f]{4}",
m => "" + (char) int.Parse(m.Value.Substring(1), NumberStyles.AllowHexSpecifier)
)

This can't deal with the ambiguity of un-escaped "regular" characters in the string: dufface would effectively get turned into d + \uffac + e, which is probably not right. It will correctly handle surrogates, though (ud83dudc96 is ).

Using the technique in this answer is another option:

Regex.Unescape(@"u0393u03a5u039du0391u0399u039au0391".Replace(@"\", @"\\").Replace("u", @"\u"))

The extra \ escaping is there just in case the string should contain any backslashes already, which could be wrongly interpreted as escape sequences.

How can I convert a String in ASCII(Unicode Escaped) to Unicode(UTF-8) if I am reading from a file?

final String str = new String("Diogo Pi\u00e7arra - Tu E Eu".getBytes(), 
Charset.forName("UTF-8"));

Result:

Sample Image

Try to use getBytes() method without parameters (defaultCharset will be used here). But it's not necessary. The conversion is not required:

final String str = "Diogo Pi\u00e7arra - Tu E Eu";

You'll have same result.

How to escape unicode special chars in string and write it to UTF encoded file

Another solution, not relying on the built-in repr() but rather implementing it from scratch:

orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)

print(enc)

Differences:

  • Encodes only using \u, never any other sequence, whereas repr() uses about a third of the alphabet (so for example the BEL character will be encoded as \u0007 rather than \a)
  • Upper-case encoding, as specified (\u00FC rather than \u00fc)
  • Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
  • It does not take care of any pre-existing \u sequences, whereas repr() turns those into \\u; could be extended, perhaps to encode \ as \u005C:
    enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)

Convert Unicode to escaped Unicode programmatically

It seems you're looking for an escaping that first converts a Unicode codepoint (32-bit integer value) to UTF-16 encoding (multiple 16-bit values), which is the encoding Java uses internally for strings.

Then each 16-bit value uses an escaping syntax as in Java or Javascript.

public static String encodeCodepoint(int codePoint) {

char[] chars = Character.toChars(codePoint);
StringBuilder sb = new StringBuilder();
for (char ch : chars) {
sb.append(String.format("\\u%04X", (int)ch));
}
return sb.toString();
}

The following code:

System.out.println(encodeCodepoint(0x1f604));

outputs:

\uD83D\uDE04

How to decode escaped Unicode characters?

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

Converting a \u escaped Unicode string to ASCII

Use parse, but don't evaluate the results:

x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1

C++ convert ASII escaped unicode string into utf8 string

(\u03a0 is the Unicode code point for GREEK CAPITAL LETTER PI whose UTF-8 encoding is 0xCE 0xA0)

You need to:

  1. Get the number 0x03a0 from the string "\u03a0": drop the backslash and the u and parse 03a0 as hex, into a wchar_t. Repeat until you get a (wide) string.
  2. Convert 0x3a0 into UTF-8. C++11 has a codecvt_utf8 that may help.


Related Topics



Leave a reply



Submit