How to convert a Unicode character to its ASCII equivalent
Okay, let's elaborate. Both csgero and bzlm pointed in the right direction.
Because of blzm's reply I looked up the Windows-1252 page on wiki and found that it's called a codepage. The wikipedia article for Code page which stated the following:
No formal standard existed for these ‘extended character sets’; IBM merely referred to the variants as code pages, as it had always done for variants of EBCDIC encodings.
This led me to codepage 437:
n ASCII-compatible code pages, the lower 128 characters maintained their standard US-ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS computers built for the North American market, for example, used code page 437, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters.
So, codepage 437 was the codepage I was calling 'extended ASCII', it had the ê as character 136 so I looked up some other chars as well and they seem right.
csgero came with the Encoding.GetEncoding() hint, I used it to create the following statement which solves my problem:
byte[] bytes = Encoding.GetEncoding(437).GetBytes("ê");
Is there a way to convert unicode to the nearest ASCII equivalent?
What you are asking is called transliteration.
Try the Unidecode library.
Convert single Unicode character to ASCII character
UTF-8 specific interpretation:
I assume you have the unicode point represented in hexadecimal in UTF-8 stored as a string in a variable (c). And you want to determine the corresponding character. Then the following code snippet shows how to do it:
>>> import binascii
>>> cp2chr = lambda c: binascii.unhexlify(c.zfill(len(c) + (len(c) & 1))).decode('utf-8')
>>> cp2chr('C484')
'Ą'
Explanation: zfill
prepends a zero if the number of characters is odd. binascii.unhexlify
basically takes two characters each, interprets them as hexadecimal numbers and make them one byte. All those bytes are merged to a bytes array. Finally str.decode('utf-8')
interprets those bytes as UTF-8 encoded data and returns it as string.
>>> cp2chr('00C4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 1: unexpected end of data
Your provided example, however, is not valid UTF-8 data. See Wikipedia's UTF-8 byte structure table to identify valid byte sequences. C4
has bit structure 11000100
, is therefore a continuation byte and requires another character afterwards.
Encoding independent interpretation:
So you might be looking for interpretation of unicode points independent of the encoding. Then you are looking for the raw_unicode_escape
encoding:
>>> cp2chr = lambda c: (b'\\u' + c.encode('ascii')).decode('raw_unicode_escape')
>>> cp2chr('00C4')
'Ä'
Explanation: raw_unicode_escape
convert the unicode escape sequences given in a byte string and returns it as string: b'\\u00C4'.decode('raw_unicode_escape')
gives Ä
. This is what python does internally if you write \uSOMETHING
in your source code.
How to convert unicode character to its escaped ascii equivalent in c#
They are pretty much the same, at least for display purposes. HttpUtility.HtmlEncode
is using decimal encoding, which is in the format DECIMAL;
while your original version is in hexadecimal encoding, i.e. in the format HEX;
. Since fc
in hex is 252
in decimal, the two are equivalent.
If you really need to get the hex-encoded version, then consider parsing out the decimal and converting it to hex before stuffing it back in to the HEX;
format. Something like
string unicode = "ü";
string decimalEncoded = HttpUtility.HtmlEncode(unicode);
int decimal = int.Parse(decimalEncoded.Substring(2, decimalEncoded.Length - 3);
string hexEncoded = string.Format("{0:X};", decimal);
Python - Unicode to ASCII conversion
The Unicode characters u'\xce0'
and u'\xc9'
do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')
).
See str.encode
, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open
function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
Converting one UNICODE character to two ASCII ones
Okay, a thing I wanted to do is to convert unicode to utf-8
Related Topics
A Generic Error Occurred in Gdi+ in Bitmap.Save Method
Thread.Sleep for Less Than 1 Millisecond
How to Handle Click Event in Button Column in Datagridview
Download Image from the Site in .Net/C#
Should C# Have Multiple Inheritance
Xml.Loaddata - Data at the Root Level Is Invalid. Line 1, Position 1
Xml-Selectnodes with Default-Namespace via Xmlnamespacemanager Not Working as Expected
Finding an Enum Value by Its Description Attribute
How to Install a Windows Service Programmatically in C#
Setting Canvas Properties in an Itemscontrol Datatemplate
Windows Phone 8.1 Universal App Terminates on Navigating Back from Second Page
Dropdownlist in MVC 4 with Razor
Capture Screenshot Including Semitransparent Windows in .Net