How to make the python interpreter correctly handle non-ASCII characters in string operations?
Python 2 uses ascii
as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8
as the default encoding for source files, so this is less of an issue.
See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
The above is in the docs, but this also works:
# coding: utf-8
Additional considerations:
The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a
u
before it, as ins.replace(u"Â ", u"")
But in Python 3, just use quotes. In Python 2, you canfrom __future__ import unicode_literals
to obtain the Python 3 behavior, but be aware this affects the entire current module.s.replace(u"Â ", u"")
will also fail ifs
is not a unicode string.string.replace
returns a new string and does not edit in place, so make sure you're using the return value as well
Non-ASCII characters in Matplotlib
This problem may actually have a couple of different causes:
The default font does not include these glyphs
You may change the default font using the following (before any plotting is done!)
matplotlib.rc('font', family='Arial')
In some versions of matplotlib you'll have to set the family:
matplotlib.rc('font', **{'sans-serif' : 'Arial',
'family' : 'sans-serif'})
(Note that because sans-serif
contains a hyphen inside the **{}
syntax, it is actually necessary.)
The first command changes the sans-serif
font family to contain only one font (in my case it was Arial), the second sets the default font family to sans-serif
.
Other options are included in the documentation.
You have improperly created/passed string objects into Matplotlib
Even if the font contains proper glyphs, if you forgot to use u
to create Unicode constants, Matplotlib will have this behaviour:
plt.xlabel("Średnia odległość między stacjami wsparcia a modelowaną [km]")
So you need to add u
:
plt.xlabel(u"Średnia odległość między stacjami wsparcia a modelowaną [km]")
Another cause is that you forgot to put a UTF-8 magic comment on top of the file (I read that this might be the source of the problem):
# -*- coding: utf-8 -*-
How to add non ASCII characters in a python list?
You want to mark these as Unicode strings.
mylist = [u"अ,ब,क"]
Depending on what you want to accomplish, if the data is just a single string, it might not need to be in a list. Or perhaps you want a list of strings?
mylist = [u"अ", u"ब", u"क"]
Python 3 brings a lot of relief to working with Unicode (and doesn't need the u
sigil in front of Unicode strings, because all strings are Unicode), and should definitely be your learning target unless you are specifically tasked with maintaining legacy software after Python 2 is officially abandoned at the end of this year.
Regardless of your Python version, there may still be issues with displaying Unicode on your system, in particular on older systems and on Windows.
If you are unfamiliar with encoding issues, you'll want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and perhaps the Python-specific Pragmatic Unicode.
Python 3.8: Escape non-ascii characters as unicode
You could do something like this:
charList=[]
s1 = "Bürgerhaus"
for i in [ord(x) for x in s1]:
# Keep ascii characters, unicode characters 'encoded' as their ordinal in hex
if i < 128: # not sure if that is right or can be made easier!
charList.append(chr(i))
else:
charList.append('\\u%04x' % i )
res = ''.join(charList)
print(f"Mixed up sting: {res}")
for myStr in (res, s1):
if '\\u' in myStr:
print(myStr.encode().decode('unicode-escape'))
else:
print(myStr)
Out:
Mixed up sting: B\u00fcrgerhaus
Bürgerhaus
Bürgerhaus
Explanation:
We are going to covert each character to it's corresponding Unicode code point.
print([(c, ord(c)) for c in s1])
[('B', 66), ('ü', 252), ('r', 114), ('g', 103), ('e', 101), ('r', 114), ('h', 104), ('a', 97), ('u', 117), ('s', 115)]
Regular ASCII characters decimal values are < 128, bigger values, like Eur-Sign, german Umlauts ... got values >= 128 (detailed table here).
Now, we are going to 'encoded' all characters >= 128 with their corresponding unicode representation.
How to remove non-ascii characters from a list
If you want to post-process your list, you can apply encode('ascii', 'ignore')
over it:
my_list = [
'Central Park\u202c',
'Top of the Rock',
'Statue of Liberty\u202c',
'Brooklyn Bridge'
]
my_list = [e.encode('ascii', 'ignore').decode("utf-8") for e in my_list]
print(my_list)
And the output should be:
['Central Park', 'Top of the Rock', 'Statue of Liberty', 'Brooklyn Bridge']
Replace non-ASCII characters with a single space
Your ''.join()
expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)
Note the +
there.
Python string / bytes encoding with non ascii characters
You could use something like this:
json_file_path = 'your_json_file.json'
with open(json_file_path, 'r', encoding='utf-8') as j:
# Remove problematic "b\ character
j = j.read().replace('\"b\\',"");
# Process json
contents = json.loads(j)
# Decode string to process correctly double backslashes
output = contents['content'].encode('utf-8').decode('unicode_escape')
print(output)
# Output
# Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.
Related Topics
Getting a MAChine's External Ip Address with Python
How to Display a Pandas Data Frame with Pyqt5/Pyside2
How to Create an Encrypted Zip File
How to Load Files Using Pickle and Multiple Modules
How to Request a Url in Python and Not Follow Redirects
Why Does Defining _Getitem_ on a Class Make It Iterable in Python
Python - Typeerror: 'Int' Object Is Not Iterable
Plot a Histogram Such That Bar Heights Sum to 1 (Probability)
How to Dynamically Add/Remove Periodic Tasks to Celery (Celerybeat)
How to Get System Timezone Setting and Pass It to Pytz.Timezone
Group by & Count Function in SQLalchemy
Creating Spark Data Structure from Multiline Record