Stripping Non Printable Characters from a String in Python

Stripping non printable characters from a string in python

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601

Edit Adding suggestions from the comments.

Remove non-ascii and special characters from a string Python

You can try using simple Regex and .replace() -

import re

my_string = "Bjørn 10.2.3"
new_string = re.sub('[^A-z0-9 -]', '', my_string).replace(" ", " ")
print (new_string)

Output:

Bjrn 1023

How can I remove non-ASCII characters but leave periods and spaces?

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

How do I get rid of non-printable characters?

Instead of using:

with open('output.txt', 'w') as self._open_file:

Try using:

import codecs

with codecs.open('output.txt', 'w', 'utf-8')

This way the new file is opened with the correct utf-8 encoding.

Replace specific control/non-printable characters from string

Making the following change (replacing allSQL value with the output of .replace method of String object) produces the desired output:

NON_PRINTABLE = """\r\n\t"""
def parseSQLFile(filename):
    with open(filename, 'r') as sqlFile:
        allSQL = sqlFile.read()
        # replace specific control characters with spaces to prevent sql compiler errors
        for char in NON_PRINTABLE:
            allSQL = allSQL.replace(char,' ') #updating allSQL with returned value

    return allSQL

output:

'select  s.id  ,s.src_cnt  ,s.out_file  from  kpi_index_ros.composites s  ,kpi_index_ros.kpi_index_rosoards d where  1 = 1  and s.kpi_index_rosoard_id (+) = d.id  and d.active = 1 ;'

As of the second part of your question - regarding efficiency of such approach, you should probably refer to benchmark results in this answer.

Stripping Non Printable Characters from a String in Python