How to Convert Unicode Escape Sequences to Unicode Characters in a .Net String

How do I convert Unicode escape sequences to Unicode characters in a .NET string?

The answer is simple and works well with strings up to at least several thousand characters.

Example 1:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString() );

Example 2:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, delegate (Match match) { return ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } );

The first example shows the replacement being made using a lambda expression (C# 3.0) and the second uses a delegate which should work with C# 2.0.

To break down what's going on here, first we create a regular expression:

new Regex( @"\\[uU]([0-9A-F]{4})" );

Then we call Replace() with the string 'result' and an anonymous method (lambda expression in the first example and the delegate in the second - the delegate could also be a regular method) that converts each regular expression that is found in the string.

The Unicode escape is processed like this:

((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); });

Get the string representing the number part of the escape (skip the first two characters).

match.Value.Substring(2)

Parse that string using Int32.Parse() which takes the string and the number format that the Parse() function should expect which in this case is a hex number.

NumberStyles.HexNumber

Then we cast the resulting number to a Unicode character:

(char)

And finally we call ToString() on the Unicode character which gives us its string representation which is the value passed back to Replace():

.ToString()

Note: Instead of grabbing the text to be converted with a Substring call you could use the match parameter's GroupCollection, and a subexpressions in the regular expression to capture just the number ('2320'), but that's more complicated and less readable.

Replace Unicode escape sequences in a string

You could use a regular expression to parse the file:

private static Regex _regex = new Regex(@"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);

public string Decoder(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}

And then:

string data = Decoder(File.ReadAllText("test.txt"));

convert unicode escape sequences to string

Your escape sequences do not start with a \ like "\u00fd" so you Regex should be only

"[uU]([0-9A-F]{4})"

...

VB.NET, I can't convert Unicode escape sequences to text

You can use Regex.Unescape.

For example,

Dim s = "sa3444444d4ds\u0040outllok.com"
Console.WriteLine(Regex.Unescape(s))

outputs:

sa3444444d4ds@outllok.com

Credit to Tim Patrick for showing this in the Visual Studio Magazine article Overcoming Escape Sequence Envy in Visual Basic and C#.

Convert non-escaped unicode string to unicode

These are essentially UTF-16 code points, so this would do (this approach is not very efficient, but I assume optimization isn't the main goal):

Regex.Replace(
"u0393u03a5u039du0391u0399u039au0391",
"u[0-9a-f]{4}",
m => "" + (char) int.Parse(m.Value.Substring(1), NumberStyles.AllowHexSpecifier)
)

This can't deal with the ambiguity of un-escaped "regular" characters in the string: dufface would effectively get turned into d + \uffac + e, which is probably not right. It will correctly handle surrogates, though (ud83dudc96 is ).

Using the technique in this answer is another option:

Regex.Unescape(@"u0393u03a5u039du0391u0399u039au0391".Replace(@"\", @"\\").Replace("u", @"\u"))

The extra \ escaping is there just in case the string should contain any backslashes already, which could be wrongly interpreted as escape sequences.

HOWTO : convert unicode character representation in string to the actual unicode character

You want to have a Char holding the value of the private-use code point U+F641.

You can do so by parsing it as the hexadecimal value it represents:

var input = "f641";
int p = int.Parse(input, System.Globalization.NumberStyles.HexNumber); // 63041

And then convert it to a Char:

char c = (char)p;

Depending on the range of possible code points, you may not have enough space in a char to store the code point, so as @Panagiotis indicates, use Char.ConvertFromUtf32(int):

string chars = Char.ConvertFromUtf32(p);

But then you'll have a string, not a single char.

how to parse string containing Unicode ID's as well as plain text for display in datagrid view

You can use Regex.Unescape() to convert the unicode escaped char (\uXXXX) to a string.

If you receive \U instead of \u, you also need to perform that substitution, since \U is not recognized as a valid escape sequence.

Dim input as String = "Castle: \Ud83d\Udc40Jerusal\U00e9m.Miles"
Dim result As String = Regex.Unescape(input.Replace("\U", "\u")).

This prints (it may depend on the Font used):

Castle: Jerusalém.Miles

As a note, you might also have used the wrong encoding when you decoded the input stream.

How to convert string with Unicode literal characters in it to a Unicode string

There are a number of ways to do this, however this might work for you.

Disclaimer: it's assumed your string looks like this in your db, Universidad de M\u00e1laga

var test1 = "Universidad de M\\u00e1laga";  
var test2 = Regex.Unescape(test1);
Console.WriteLine(test1);
Console.WriteLine(test2);

Output

Universidad de M\u00e1laga
Universidad de Málaga

Note : This maybe pointing to an overall structural or design problem with this entire situation. Though, who knows what APIs give you back

Full demo here



Related Topics



Leave a reply



Submit