Removing control characters from a string in python
There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…)
function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".
This snippet removes all control characters from a string.
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
Examples of unicode categories:
>>> from unicodedata import category
>>> category('\r') # carriage return --> Cc : control character
'Cc'
>>> category('\0') # null character ---> Cc : control character
'Cc'
>>> category('\t') # tab --------------> Cc : control character
'Cc'
>>> category(' ') # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A') # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',') # comma -----------> Po : punctuation
'Po'
>>>
Remove specific characters from a string in Python
Strings in Python are immutable (can't be changed). Because of this, the effect of line.replace(...)
is just to create a new string, rather than changing the old one. You need to rebind (assign) it to line
in order to have that variable take the new value, with those characters removed.
Also, the way you are doing it is going to be kind of slow, relatively. It's also likely to be a bit confusing to experienced pythonators, who will see a doubly-nested structure and think for a moment that something more complicated is going on.
Starting in Python 2.6 and newer Python 2.x versions *, you can instead use str.translate
, (see Python 3 answer below):
line = line.translate(None, '!@#$')
or regular expression replacement with re.sub
import re
line = re.sub('[!@#$]', '', line)
The characters enclosed in brackets constitute a character class. Any characters in line
which are in that class are replaced with the second parameter to sub
: an empty string.
Python 3 answer
In Python 3, strings are Unicode. You'll have to translate a little differently. kevpie mentions this in a comment on one of the answers, and it's noted in the documentation for str.translate
.
When calling the translate
method of a Unicode string, you cannot pass the second parameter that we used above. You also can't pass None
as the first parameter. Instead, you pass a translation table (usually a dictionary) as the only parameter. This table maps the ordinal values of characters (i.e. the result of calling ord
on them) to the ordinal values of the characters which should replace them, or—usefully to us—None
to indicate that they should be deleted.
So to do the above dance with a Unicode string you would call something like
translation_table = dict.fromkeys(map(ord, '!@#$'), None)
unicode_line = unicode_line.translate(translation_table)
Here dict.fromkeys
and map
are used to succinctly generate a dictionary containing
{ord('!'): None, ord('@'): None, ...}
Even simpler, as another answer puts it, create the translation table in place:
unicode_line = unicode_line.translate({ord(c): None for c in '!@#$'})
Or, as brought up by Joseph Lee, create the same translation table with str.maketrans
:
unicode_line = unicode_line.translate(str.maketrans('', '', '!@#$'))
* for compatibility with earlier Pythons, you can create a "null" translation table to pass in place of None
:
import string
line = line.translate(string.maketrans('', ''), '!@#$')
Here string.maketrans
is used to create a translation table, which is just a string containing the characters with ordinal values 0 to 255.
Deleting specific control characters(\n \r \t) from a string
I think the fastest way is to use str.translate()
:
import string
s = "a\nb\rc\td"
print s.translate(string.maketrans("\n\t\r", " "))
prints
a b c d
EDIT: As this once again turned into a discussion about performance, here some numbers. For long strings, translate()
is way faster than using regular expressions:
s = "a\nb\rc\td " * 1250000
regex = re.compile(r'[\n\r\t]')
%timeit t = regex.sub(" ", s)
# 1 loops, best of 3: 1.19 s per loop
table = string.maketrans("\n\t\r", " ")
%timeit s.translate(table)
# 10 loops, best of 3: 29.3 ms per loop
That's about a factor 40.
Remove all special characters, punctuation and spaces from string
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum
:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.
Stripping non printable characters from a string in python
Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
import unicodedata, re, itertools, sys
all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For Python2
import unicodedata, re, sys
all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
Cc
(control): 65Cf
(format): 161Cs
(surrogate): 2048Co
(private-use): 137468Cn
(unassigned): 836601
Edit Adding suggestions from the comments.
How to remove special characters from a string before specific character?
You can use
df['NEW_EMAIL'] = df['EMAIL'].str.replace(r'[._-](?=[^@]*@)', '', regex=True)
See the regex demo. Details:
[._-]
- a.
,_
or-
char(?=[^@]*@)
- a positive lookahead that requires the presence of any zero or more chars other than@
and then a@
char immediately to the right of the current location.
If you need to replace/remove any special char, you should use
df['NEW_EMAIL'] = df['EMAIL'].str.replace(r'[\W_](?=[^@]*@)', '', regex=True)
See a Pandas test:
>>> import pandas as pd
>>> df = pd.DataFrame({'EMAIL':['ab_cd_123@email.com', 'ab_cd.12-3@email.com']})
>>> df['EMAIL'].str.replace(r'[._-](?=[^@]*@)', '', regex=True)
0 abcd123@email.com
1 abcd123@email.com
Name: EMAIL, dtype: object
How can I remove special characters from a list of elements in python?
Use the str.translate()
method to apply the same translation table to all strings:
removetable = str.maketrans('', '', '@#%')
out_list = [s.translate(removetable) for s in my_list]
The str.maketrans()
static method is a helpful tool to produce the translation map; the first two arguments are empty strings because you are not replacing characters, only removing. The third string holds all characters you want to remove.
Demo:
>>> my_list = ["on@3", "two#", "thre%e"]
>>> removetable = str.maketrans('', '', '@#%')
>>> [s.translate(removetable) for s in my_list]
['on3', 'two', 'three']
Related Topics
How to Crop to Largest Interior Bounding Box in Opencv
Activating Anaconda Environment in VScode
Finding Moving Average from Data Points in Python
How to Get Value from Form Field in Django Framework
CSV New-Line Character Seen in Unquoted Field Error
How to Implement Linear Interpolation
Python Subprocess and User Interaction
Python List Comprehension - Want to Avoid Repeated Evaluation
Printing a List of Objects of User Defined Class
Web Scraping Dynamic Content with Python
How to Annotate Types of Multiple Return Values
How to Straighten a Rotated Rectangle Area of an Image Using Opencv in Python
Ipython Notebook Clear Cell Output in Code
Why Does CSVwriter.Writerow() Put a Comma After Each Character
How to Get Char from String by Index
Django Rest Framework Post Nested Objects
Pandas Groupby.Size VS Series.Value_Counts VS Collections.Counter with Multiple Series