How to make the python interpreter correctly handle non-ASCII characters in string operations?
Python 2 uses ascii
as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8
as the default encoding for source files, so this is less of an issue.
See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
The above is in the docs, but this also works:
# coding: utf-8
Additional considerations:
The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a
u
before it, as ins.replace(u"Â ", u"")
But in Python 3, just use quotes. In Python 2, you canfrom __future__ import unicode_literals
to obtain the Python 3 behavior, but be aware this affects the entire current module.s.replace(u"Â ", u"")
will also fail ifs
is not a unicode string.string.replace
returns a new string and does not edit in place, so make sure you're using the return value as well
Replace non-ASCII characters with a single space
Your ''.join()
expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)
Note the +
there.
Convert non-ascii character to ascii character
test.encode('cp1252').decode('utf-8')
I've tried this and it works. I took it from here
How to check if a string in Python is in ASCII?
def is_ascii(s):
return all(ord(c) < 128 for c in s)
Removing strings containing ASCII
The code below is equivalent to your current code, except that for a contiguous sequence of characters outside the range of US-ASCII, it will replace the whole sequence with a single space (ASCII 32).
import re
re.sub(r'[^\x00-\x7f]+', " ", inputString)
Do note that control characters are allowed by the code above, and also the code in the question.
python3-qt4 discards non-ascii characters from unicode strings in QApplication contructor
This looks like a bug in PyQt, as it is handled correctly by PySide.
And there seems no reason why the args can't be properly encoded before being passed to the application constructor:
>>> import os
>>> from PyQt4 import QtCore
>>> args = ['føø', 'bær']
>>> app = QtCore.QCoreApplication([os.fsencode(arg) for arg in args])
>>> app.arguments()
['føø', 'bær']
If you want to see this get fixed, please report it on the PyQt Mailing List.
Related Topics
Concatenating Two One-Dimensional Numpy Arrays
Convert Column to Date Format (Pandas Dataframe)
Force Python to Forego Native SQLite3 and Use the (Installed) Latest SQLite3 Version
Building Python with Ssl Support in Non-Standard Location
Is It Ok to Use Dashes in Python Files When Trying to Import Them
Regex to Extract Urls from Href Attribute in HTML with Python
Matplotlib: Specify Format of Floats for Tick Labels
Using Beautifulsoup to Extract Text Without Tags
Calling the "Source" Command from Subprocess.Popen
Purpose of Calling Function Without Brackets Python
How to Print Variable and String on Same Line in Python
Call a Function with Argument List in Python
Why Can't Python Find Shared Objects That Are in Directories in Sys.Path
How to Link Pycharm with Pyspark
Difference Between Multiple If's and Elif'S