Stripping non printable characters from a string in python
Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
import unicodedata, re, itertools, sys
all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For Python2
import unicodedata, re, sys
all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
Cc
(control): 65Cf
(format): 161Cs
(surrogate): 2048Co
(private-use): 137468Cn
(unassigned): 836601
Edit Adding suggestions from the comments.
Remove non-ascii and special characters from a string Python
You can try using simple Regex and .replace()
-
import re
my_string = "Bjørn 10.2.3"
new_string = re.sub('[^A-z0-9 -]', '', my_string).replace(" ", " ")
print (new_string)
Output:
Bjrn 1023
How can I remove non-ASCII characters but leave periods and spaces?
You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))
How do I get rid of non-printable characters?
Instead of using:
with open('output.txt', 'w') as self._open_file:
Try using:
import codecs
with codecs.open('output.txt', 'w', 'utf-8')
This way the new file is opened with the correct utf-8
encoding.
Replace specific control/non-printable characters from string
Making the following change (replacing allSQL value with the output of .replace method of String object) produces the desired output:
NON_PRINTABLE = """\r\n\t"""
def parseSQLFile(filename):
with open(filename, 'r') as sqlFile:
allSQL = sqlFile.read()
# replace specific control characters with spaces to prevent sql compiler errors
for char in NON_PRINTABLE:
allSQL = allSQL.replace(char,' ') #updating allSQL with returned value
return allSQL
output:
'select s.id ,s.src_cnt ,s.out_file from kpi_index_ros.composites s ,kpi_index_ros.kpi_index_rosoards d where 1 = 1 and s.kpi_index_rosoard_id (+) = d.id and d.active = 1 ;'
As of the second part of your question - regarding efficiency of such approach, you should probably refer to benchmark results in this answer.
Related Topics
Hiding Axis Text in Matplotlib Plots
Check If File Has a CSV Format With Python
Importing Large Tab-Delimited .Txt File into Python
Why Does Tkinter Image Not Show Up If Created in a Function
Get Rid of Columns With Null Value in Json Output
How to Populate New Column Based on Values in Other Columns
How to Hide Tkinter Python Gui
Sys.Path Different in Jupyter and Python - How to Import Own Modules in Jupyter
Python: Filenotfounderror: [Winerror 3] the System Cannot Find the Path Specified: ''
Subtracting Values Across Grouped Data Frames in Pandas
Pyspark - Pass List as Parameter to Udf
In Dictionary, Converting the Value from String to Integer
Defining and Calling a Function Within a Python Class
Convert Np.Array of Type Float64 to Type Uint8 Scaling Values
Convert a Python Int into a Big-Endian String of Bytes
Python Replace Empty Strings in a List With Values from a Different List