Stripping Everything But Alphanumeric Chars from a String in Python

Stripping everything but alphanumeric chars from a string in Python

I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string string.printable (part of the built-in string module). The use of compiled '[\W_]+' and pattern.sub('', str) was found to be fastest.

$ python -m timeit -s \
"import string" \
"''.join(ch for ch in string.printable if ch.isalnum())"
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
"import string" \
"filter(str.isalnum, string.printable)"
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]+', '', string.printable)"
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
"import re, string; pattern = re.compile('[\W_]+')" \
"pattern.sub('', string.printable)"
100000 loops, best of 3: 11.2 usec per loop

Python, remove all non-alphabet chars from string

Use re.sub

import re

regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)

regex = re.compile('[,\.!?]') #etc.

How to keep only alphanumeric and space, and also ignore non-ASCII?

re.sub(r'[^A-Za-z0-9 ]+', '', s)

(Edit) To clarify:
The [] create a list of chars. The ^ negates the list. A-Za-z are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.

Most Pythonic was to strip all non-alphanumeric leading characters from string

If you want to remove leading non-alpha/numeric values:

while not s[0].isalnum(): s = s[1:]

If you want to remove only leading non-alphabet characters:

while not s[0].isalpha(): s = s[1:]

Sample:

s = '!@#yourname!@#'
while not s[0].isalpha(): s = s[1:]
print(s)

Output:

yourname!@#

Replace all non-alphanumeric characters in a string

Regex to the rescue!

import re

s = re.sub('[^0-9a-zA-Z]+', '*', s)

Example:

>>> re.sub('[^0-9a-zA-Z]+', '*', 'h^&ell`.,|o w]{+orld')
'h*ell*o*w*orld'

Pandas remove non-alphanumeric characters from string column

You can use regex for this.

df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']

Python regex to remove alphanumeric characters without removing words at the end of the string

If it should be the last word in a string and there are always multiple words, you might use:

[ \t]+(?=[a-zA-Z0-9/]{5})[a-zA-Z/]*[0-9][a-zA-Z0-9/]*[A-Za-z]$
  • [ \t]+ Match 1+ spaces or tabs
  • (?=[a-zA-Z0-9/]{5}) Assert at least 5 chars of any of the listed
  • [a-zA-Z/]* Match 0+ times any of the listed
  • [0-9] Match a digit
  • [a-zA-Z0-9/]* Match 0+ times any of the listed in the character class
  • [A-Za-z] Match a char a-zA-Z
  • $ End of string

Regex demo

In the replacement use an empty string.

How to remove non-alpha-numeric characters from strings within a dataframe column in Python?

Use str.replace.

df
strings
0 a#bc1!
1 a(b$c

df.strings.str.replace('[^a-zA-Z]', '')
0 abc
1 abc
Name: strings, dtype: object

To retain alphanumeric characters (not just alphabets as your expected output suggests), you'll need:

df.strings.str.replace('\W', '')
0 abc1
1 abc
Name: strings, dtype: object


Related Topics



Leave a reply



Submit