Stripping everything but alphanumeric chars from a string in Python
I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string string.printable
(part of the built-in string
module). The use of compiled '[\W_]+'
and pattern.sub('', str)
was found to be fastest.
$ python -m timeit -s \
"import string" \
"''.join(ch for ch in string.printable if ch.isalnum())"
10000 loops, best of 3: 57.6 usec per loop
$ python -m timeit -s \
"import string" \
"filter(str.isalnum, string.printable)"
10000 loops, best of 3: 37.9 usec per loop
$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop
$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]+', '', string.printable)"
100000 loops, best of 3: 15 usec per loop
$ python -m timeit -s \
"import re, string; pattern = re.compile('[\W_]+')" \
"pattern.sub('', string.printable)"
100000 loops, best of 3: 11.2 usec per loop
Python, remove all non-alphabet chars from string
Use re.sub
import re
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)
regex = re.compile('[,\.!?]') #etc.
How to keep only alphanumeric and space, and also ignore non-ASCII?
re.sub(r'[^A-Za-z0-9 ]+', '', s)
(Edit) To clarify:
The []
create a list of chars. The ^
negates the list. A-Za-z
are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.
Most Pythonic was to strip all non-alphanumeric leading characters from string
If you want to remove leading non-alpha/numeric values:
while not s[0].isalnum(): s = s[1:]
If you want to remove only leading non-alphabet characters:
while not s[0].isalpha(): s = s[1:]
Sample:
s = '!@#yourname!@#'
while not s[0].isalpha(): s = s[1:]
print(s)
Output:
yourname!@#
Replace all non-alphanumeric characters in a string
Regex to the rescue!
import re
s = re.sub('[^0-9a-zA-Z]+', '*', s)
Example:
>>> re.sub('[^0-9a-zA-Z]+', '*', 'h^&ell`.,|o w]{+orld')
'h*ell*o*w*orld'
Pandas remove non-alphanumeric characters from string column
You can use regex for this.
df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']
Python regex to remove alphanumeric characters without removing words at the end of the string
If it should be the last word in a string and there are always multiple words, you might use:
[ \t]+(?=[a-zA-Z0-9/]{5})[a-zA-Z/]*[0-9][a-zA-Z0-9/]*[A-Za-z]$
[ \t]+
Match 1+ spaces or tabs(?=[a-zA-Z0-9/]{5})
Assert at least 5 chars of any of the listed[a-zA-Z/]*
Match 0+ times any of the listed[0-9]
Match a digit[a-zA-Z0-9/]*
Match 0+ times any of the listed in the character class[A-Za-z]
Match a char a-zA-Z$
End of string
Regex demo
In the replacement use an empty string.
How to remove non-alpha-numeric characters from strings within a dataframe column in Python?
Use str.replace
.
df
strings
0 a#bc1!
1 a(b$c
df.strings.str.replace('[^a-zA-Z]', '')
0 abc
1 abc
Name: strings, dtype: object
To retain alphanumeric characters (not just alphabets as your expected output suggests), you'll need:
df.strings.str.replace('\W', '')
0 abc1
1 abc
Name: strings, dtype: object
Related Topics
Http Requests and JSON Parsing in Python
Find Full Path of the Python Interpreter
How to Locate Element Using Selenium Chrome Webdriver in Python Selenium
Dump a Numpy Array into a CSV File
Working with Utf-8 Encoding in Python Source
How to Create a Trie in Python
Remove Duplicates by Columns A, Keeping the Row with the Highest Value in Column B
Multiprocessing: How to Share a Dict Among Multiple Processes
What Is the Id( ) Function Used For
Accessing Elements of Python Dictionary by Index
How to Tail a Log File in Python
How to Preserve Timezone When Parsing Date/Time Strings with Strptime()
How to Parse Dates with -0400 Timezone String in Python
How to Use 'Return' to Get Back Multiple Values from a for Loop? How to Put Them in a List
Get Spotify Currently Playing Track
How to Convert a Nested Python Dict to Object
How to Access Variables from Different Classes in Tkinter
Matplotlib Scatterplot; Color as a Function of a Third Variable