How to Convert a Unicode Character to Its Ascii Equivalent

How to convert a Unicode character to its ASCII equivalent

Okay, let's elaborate. Both csgero and bzlm pointed in the right direction.

Because of blzm's reply I looked up the Windows-1252 page on wiki and found that it's called a codepage. The wikipedia article for Code page which stated the following:

No formal standard existed for these ‘extended character sets’; IBM merely referred to the variants as code pages, as it had always done for variants of EBCDIC encodings.

This led me to codepage 437:

n ASCII-compatible code pages, the lower 128 characters maintained their standard US-ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS computers built for the North American market, for example, used code page 437, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters.

So, codepage 437 was the codepage I was calling 'extended ASCII', it had the ê as character 136 so I looked up some other chars as well and they seem right.

csgero came with the Encoding.GetEncoding() hint, I used it to create the following statement which solves my problem:

byte[] bytes = Encoding.GetEncoding(437).GetBytes("ê");

Is there a way to convert unicode to the nearest ASCII equivalent?

What you are asking is called transliteration.

Try the Unidecode library.

Convert single Unicode character to ASCII character

UTF-8 specific interpretation:
I assume you have the unicode point represented in hexadecimal in UTF-8 stored as a string in a variable (c). And you want to determine the corresponding character. Then the following code snippet shows how to do it:

>>> import binascii
>>> cp2chr = lambda c: binascii.unhexlify(c.zfill(len(c) + (len(c) & 1))).decode('utf-8')
>>> cp2chr('C484')
'Ą'

Explanation: zfill prepends a zero if the number of characters is odd. binascii.unhexlify basically takes two characters each, interprets them as hexadecimal numbers and make them one byte. All those bytes are merged to a bytes array. Finally str.decode('utf-8') interprets those bytes as UTF-8 encoded data and returns it as string.

>>> cp2chr('00C4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 1: unexpected end of data

Your provided example, however, is not valid UTF-8 data. See Wikipedia's UTF-8 byte structure table to identify valid byte sequences. C4 has bit structure 11000100, is therefore a continuation byte and requires another character afterwards.

Encoding independent interpretation:
So you might be looking for interpretation of unicode points independent of the encoding. Then you are looking for the raw_unicode_escape encoding:

>>> cp2chr = lambda c: (b'\\u' + c.encode('ascii')).decode('raw_unicode_escape') 
>>> cp2chr('00C4')
'Ä'

Explanation: raw_unicode_escape convert the unicode escape sequences given in a byte string and returns it as string: b'\\u00C4'.decode('raw_unicode_escape') gives Ä. This is what python does internally if you write \uSOMETHING in your source code.

How to convert unicode character to its escaped ascii equivalent in c#

They are pretty much the same, at least for display purposes. HttpUtility.HtmlEncode is using decimal encoding, which is in the format &#DECIMAL; while your original version is in hexadecimal encoding, i.e. in the format &#xHEX;. Since fc in hex is 252 in decimal, the two are equivalent.

If you really need to get the hex-encoded version, then consider parsing out the decimal and converting it to hex before stuffing it back in to the &#xHEX; format. Something like

string unicode = "ü";
string decimalEncoded = HttpUtility.HtmlEncode(unicode);
int decimal = int.Parse(decimalEncoded.Substring(2, decimalEncoded.Length - 3);
string hexEncoded = string.Format("&#x{0:X};", decimal);

Python - Unicode to ASCII conversion

The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:

>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e

All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).

See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.


As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:

>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'

The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.

Converting one UNICODE character to two ASCII ones

Okay, a thing I wanted to do is to convert unicode to utf-8



Related Topics



Leave a reply



Submit