How to Make the Python Interpreter Correctly Handle Non-Ascii Characters in String Operations

How to make the python interpreter correctly handle non-ASCII characters in string operations?

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

  • The source file must be saved using the correct encoding in your text editor as well.

  • In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.

  • s.replace(u"Â ", u"") will also fail if s is not a unicode string.

  • string.replace returns a new string and does not edit in place, so make sure you're using the return value as well

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

Convert non-ascii character to ascii character

test.encode('cp1252').decode('utf-8')

I've tried this and it works. I took it from here

How to check if a string in Python is in ASCII?

def is_ascii(s):
return all(ord(c) < 128 for c in s)

Removing strings containing ASCII

The code below is equivalent to your current code, except that for a contiguous sequence of characters outside the range of US-ASCII, it will replace the whole sequence with a single space (ASCII 32).

import re
re.sub(r'[^\x00-\x7f]+', " ", inputString)

Do note that control characters are allowed by the code above, and also the code in the question.

python3-qt4 discards non-ascii characters from unicode strings in QApplication contructor

This looks like a bug in PyQt, as it is handled correctly by PySide.

And there seems no reason why the args can't be properly encoded before being passed to the application constructor:

>>> import os
>>> from PyQt4 import QtCore
>>> args = ['føø', 'bær']
>>> app = QtCore.QCoreApplication([os.fsencode(arg) for arg in args])
>>> app.arguments()
['føø', 'bær']

If you want to see this get fixed, please report it on the PyQt Mailing List.



Related Topics



Leave a reply



Submit