Non-Ascii Characters in Matplotlib

How to make the python interpreter correctly handle non-ASCII characters in string operations?

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

  • The source file must be saved using the correct encoding in your text editor as well.

  • In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.

  • s.replace(u"Â ", u"") will also fail if s is not a unicode string.

  • string.replace returns a new string and does not edit in place, so make sure you're using the return value as well

Non-ASCII characters in Matplotlib

This problem may actually have a couple of different causes:

The default font does not include these glyphs

You may change the default font using the following (before any plotting is done!)

matplotlib.rc('font', family='Arial')

In some versions of matplotlib you'll have to set the family:

matplotlib.rc('font', **{'sans-serif' : 'Arial',
'family' : 'sans-serif'})

(Note that because sans-serif contains a hyphen inside the **{} syntax, it is actually necessary.)

The first command changes the sans-serif font family to contain only one font (in my case it was Arial), the second sets the default font family to sans-serif.

Other options are included in the documentation.

You have improperly created/passed string objects into Matplotlib

Even if the font contains proper glyphs, if you forgot to use u to create Unicode constants, Matplotlib will have this behaviour:

plt.xlabel("Średnia odległość między stacjami wsparcia a modelowaną [km]")

So you need to add u:

plt.xlabel(u"Średnia odległość między stacjami wsparcia a modelowaną [km]")

Another cause is that you forgot to put a UTF-8 magic comment on top of the file (I read that this might be the source of the problem):

# -*- coding: utf-8 -*-

How to add non ASCII characters in a python list?

You want to mark these as Unicode strings.

mylist = [u"अ,ब,क"]

Depending on what you want to accomplish, if the data is just a single string, it might not need to be in a list. Or perhaps you want a list of strings?

mylist = [u"अ", u"ब", u"क"]

Python 3 brings a lot of relief to working with Unicode (and doesn't need the u sigil in front of Unicode strings, because all strings are Unicode), and should definitely be your learning target unless you are specifically tasked with maintaining legacy software after Python 2 is officially abandoned at the end of this year.

Regardless of your Python version, there may still be issues with displaying Unicode on your system, in particular on older systems and on Windows.

If you are unfamiliar with encoding issues, you'll want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and perhaps the Python-specific Pragmatic Unicode.

Python 3.8: Escape non-ascii characters as unicode

You could do something like this:

charList=[]
s1 = "Bürgerhaus"

for i in [ord(x) for x in s1]:
# Keep ascii characters, unicode characters 'encoded' as their ordinal in hex
if i < 128: # not sure if that is right or can be made easier!
charList.append(chr(i))
else:
charList.append('\\u%04x' % i )

res = ''.join(charList)
print(f"Mixed up sting: {res}")

for myStr in (res, s1):
if '\\u' in myStr:
print(myStr.encode().decode('unicode-escape'))
else:
print(myStr)

Out:

Mixed up sting: B\u00fcrgerhaus
Bürgerhaus
Bürgerhaus

Explanation:

We are going to covert each character to it's corresponding Unicode code point.

print([(c, ord(c)) for c in s1])
[('B', 66), ('ü', 252), ('r', 114), ('g', 103), ('e', 101), ('r', 114), ('h', 104), ('a', 97), ('u', 117), ('s', 115)]

Regular ASCII characters decimal values are < 128, bigger values, like Eur-Sign, german Umlauts ... got values >= 128 (detailed table here).

Now, we are going to 'encoded' all characters >= 128 with their corresponding unicode representation.

How to remove non-ascii characters from a list

If you want to post-process your list, you can apply encode('ascii', 'ignore') over it:

my_list = [
'Central Park\u202c',
'Top of the Rock',
'Statue of Liberty\u202c',
'Brooklyn Bridge'
]
my_list = [e.encode('ascii', 'ignore').decode("utf-8") for e in my_list]
print(my_list)

And the output should be:

['Central Park', 'Top of the Rock', 'Statue of Liberty', 'Brooklyn Bridge']

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

Python string / bytes encoding with non ascii characters

You could use something like this:

json_file_path = 'your_json_file.json'

with open(json_file_path, 'r', encoding='utf-8') as j:
# Remove problematic "b\ character
j = j.read().replace('\"b\\',"");
# Process json
contents = json.loads(j)

# Decode string to process correctly double backslashes
output = contents['content'].encode('utf-8').decode('unicode_escape')

print(output)
# Output
# Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.


Related Topics



Leave a reply



Submit