Converting a \u escaped Unicode string to ASCII
Use parse, but don't evaluate the results:
x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1
Convert a Unicode string to an escaped ASCII string
This goes back and forth to and from the \uXXXX format.
class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";
Console.WriteLine( unicodeString );
string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );
string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}
static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}
static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
@"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}
Outputs:
This function contains a unicode character pi (π)
This function contains a unicode character pi (\u03a0)
This function contains a unicode character pi (π)
Convert non-escaped unicode string to unicode
These are essentially UTF-16 code points, so this would do (this approach is not very efficient, but I assume optimization isn't the main goal):
Regex.Replace(
"u0393u03a5u039du0391u0399u039au0391",
"u[0-9a-f]{4}",
m => "" + (char) int.Parse(m.Value.Substring(1), NumberStyles.AllowHexSpecifier)
)
This can't deal with the ambiguity of un-escaped "regular" characters in the string: dufface
would effectively get turned into d
+ \uffac
+ e
, which is probably not right. It will correctly handle surrogates, though (ud83dudc96
is ).
Using the technique in this answer is another option:
Regex.Unescape(@"u0393u03a5u039du0391u0399u039au0391".Replace(@"\", @"\\").Replace("u", @"\u"))
The extra \
escaping is there just in case the string should contain any backslashes already, which could be wrongly interpreted as escape sequences.
Convert UTF-8 Unicode string to ASCII Unicode escaped String
This is the kind of simple code Jon Skeet had in mind in his comment:
final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
final char ch = in.charAt(i);
if (ch <= 127) out.append(ch);
else out.append("\\u").append(String.format("%04x", (int)ch));
}
System.out.println(out.toString());
As Jon said, surrogate pairs will be represented as a pair of \u
escapes.
C++ convert ASII escaped unicode string into utf8 string
(\u03a0 is the Unicode code point for GREEK CAPITAL LETTER PI whose UTF-8 encoding is 0xCE 0xA0)
You need to:
- Get the number 0x03a0 from the string "\u03a0": drop the backslash and the u and parse 03a0 as hex, into a wchar_t. Repeat until you get a (wide) string.
- Convert 0x3a0 into UTF-8. C++11 has a codecvt_utf8 that may help.
How do you convert unicode string to escapes in bash?
All bash method -
echo ãçé |
while read -n 1 u
do [[ -n "$u" ]] && printf '\\u%04x' "'$u"
done
That leading apostrophe is a printf formatting/interpretation guide.
From the GNU man page online:
If the leading character of a numeric argument is ‘"’ or ‘'’ then its value is the numeric value of the immediately following character. Any remaining characters are silently ignored if the POSIXLY_CORRECT environment variable is set; otherwise, a warning is printed. For example, ‘printf "%d" "'a"’ outputs ‘97’ on hosts that use the ASCII character set, since ‘a’ has the numeric value 97 in ASCII.
That lets us pass the character to printf for numeric interpretations such as %d or %03o, or here, %04x.
The [[ -n "$u" ]]
is because there's a null trailing byte that will otherwise be appended as \u0000
.
Output:
$: echo ãçé |
> while read -n 1 u
> do [[ -n "$u" ]] && printf '\\u%04x' "'$u"
> done
\u00e3\u00e7\u00e9
Without the null byte check -
$: echo ãçé | while read -n 1 u; do printf '\\u%04x' "'$u";done
\u00e3\u00e7\u00e9\u0000
Convert escaped Unicode character back to actual character
try
str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);
from Apache Commons Lang
How to escape unicode special chars in string and write it to UTF encoded file
Another solution, not relying on the built-in repr()
but rather implementing it from scratch:
orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'
enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)
print(enc)
Differences:
- Encodes only using
\u
, never any other sequence, whereasrepr()
uses about a third of the alphabet (so for example the BEL character will be encoded as\u0007
rather than\a
) - Upper-case encoding, as specified (
\u00FC
rather than\u00fc
) - Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
- It does not take care of any pre-existing
\u
sequences, whereasrepr()
turns those into\\u
; could be extended, perhaps to encode\
as\u005C
:enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)
How to decode escaped Unicode characters?
The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):
("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
Outputs:
'ä'
This works as follows:
- The string that contains only ascii-characters
'\'
,'u'
,'0'
,'0'
,'c'
, etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly) - Use a decoder that interprets the
'\u00c3'
escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded withISO-8859-1
/'latin-1'
, so... - encode it again with
'latin-1'
- Decode it "properly" this time, as UTF-8
Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.
Convert Python escaped Unicode sequences to UTF-8
Small demo using Python 3. If you don't dump to JSON using ensure_ascii=False
, non-ASCII will be written to JSON with Unicode escape codes. That doesn't affect the ability to load the JSON, but it is less readable in the .json file itself.
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '<element>50\u20ac</element'
>>> html
'<element>50€</element'
>>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
... json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z
Content of out.json (UTF-8-encoded):
"50€"
Related Topics
Is There a Logical Way to Think About List Indexing
How to Clear Only a Few Specific Objects from the Workspace
In Ggplot2, What Do the End of the Boxplot Lines Represent
Replace All Values in a Matrix <0.1 with 0
Ggplot2: Connecting Points in Polar Coordinates with a Straight Line 2
Can Dplyr Join on Multiple Columns or Composite Key
How to Get a Barplot with Several Variables Side by Side Grouped by a Factor
How to Index an Element of a List Object in R
R: Replace Multiple Values in Multiple Columns of Dataframes with Na
Moving Columns Within a Data.Frame() Without Retyping
Finding Row Index Containing Maximum Value Using R
Rmarkdown: How to End Tabbed Content
Split One Row into Multiple Rows
Calculate Correlation with Cor(), Only for Numerical Columns
Add a Horizontal Line to Plot and Legend in Ggplot2