Where Is Python's "Best Ascii for This Unicode" Database

Where is Python's best ASCII for this Unicode database?

Unidecode looks like a complete solution. It converts fancy quotes to ascii quotes, accented latin characters to unaccented and even attempts transliteration to deal with characters that don't have ASCII equivalents. That way your users don't have to see a bunch of ? when you had to pass their text through a legacy 7-bit ascii system.

>>> from unidecode import unidecode
>>> print unidecode(u"\u5317\u4EB0")
Bei Jing

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

How to convert fancy/artistic unicode text to ASCII?

import unicodedata
strings = [
  ' ',
  ' ',
  ' ',
  ' ',
  'ｔｈｕｇ ｌｉｆｅ']
for x in strings:
  print(unicodedata.normalize( 'NFKC', x), x)

Output: .\62803325.py

thug life br>thug life br>thug life br>thug life br>thug life ｔｈｕｇ ｌｉｆｅ

Resources:

unicodedata — Unicode Database
Normalization forms for Unicode text

Parse CSV files for Unicode values in Python

To collect all the non-ASCII characters in a file into a list you can do this:

non_ascii_chars = []
with open('myfile.csv') as f:
    for line in f:
        for char in line:
            if ord(char) > 127:
                non_ascii_chars.append(char)

The ord built-in function returns the Unicode codepoint of a character; ASCII characters have codepoints in the range 0 - 127.

A more succinct version, using a list comprehension:

with open('myfile.csv') as f:
    non_ascii_chars = [char for line in f for char in line if ord(char) > 127]

To write the collected characters to a file:

with open('non_ascii_chars.txt', 'w', encoding='utf-8') as f:
    f.write(''.join(non_ascii_chars))

You need to find the encoding for which is used for your data before it's inserted into the database. Let's assume it's UTF-8 since that's the most common.

In that case you will want to UTF-8 decode instead of ascii decode. You didn't provide any code, so I'm assuming you have "data".decode(). Try "data".decode("utf-8"), and if your data was encoded using this encoding, it will work.

Where Is Python's "Best Ascii for This Unicode" Database