Python, remove all non-alphabet chars from string
Use re.sub
import re
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)regex = re.compile('[,\.!?]') #etc.
How to remove non-alpha-numeric characters from strings within a dataframe column in Python?
Use str.replace
.
df
strings
0 a#bc1!
1 a(b$c
df.strings.str.replace('[^a-zA-Z]', '')
0 abc
1 abc
Name: strings, dtype: object
To retain alphanumeric characters (not just alphabets as your expected output suggests), you'll need:
df.strings.str.replace('\W', '')
0 abc1
1 abc
Name: strings, dtype: object
Remove non-alphabetic characters from a list of lists and maintain structure
you want to replace non-space or alphanum chars, and trim/lowercase the string. Regex are pretty efficient for those replacements, chained with str.strip
.
Rebuild the nested lists in a double list comp:
import re
csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
result = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in csvarticles]
print(result)
prints:[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]
If you're using Python, replace lower
by casefold
to handle speciale locale chars Stripping everything but alphanumeric chars from a string in Python
I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string string.printable
(part of the built-in string
module). The use of compiled '[\W_]+'
and pattern.sub('', str)
was found to be fastest.
$ python -m timeit -s \
"import string" \
"''.join(ch for ch in string.printable if ch.isalnum())"
10000 loops, best of 3: 57.6 usec per loop
$ python -m timeit -s \
"import string" \
"filter(str.isalnum, string.printable)"
10000 loops, best of 3: 37.9 usec per loop
$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop
$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]+', '', string.printable)"
100000 loops, best of 3: 15 usec per loop
$ python -m timeit -s \
"import re, string; pattern = re.compile('[\W_]+')" \
"pattern.sub('', string.printable)"
100000 loops, best of 3: 11.2 usec per loop
Most Pythonic was to strip all non-alphanumeric leading characters from string
If you want to remove leading non-alpha/numeric values:
while not s[0].isalnum(): s = s[1:]
If you want to remove only leading non-alphabet characters:while not s[0].isalpha(): s = s[1:]
Sample:s = '!@#yourname!@#'
while not s[0].isalpha(): s = s[1:]
print(s)
Output:yourname!@#
How to remove every word with non alphabetic characters
Using regular expressions to match only letters (and underscores), you can do this:
import re
s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()
tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
How to remove every non-alphabetic character in Python 3
According to 3.3 docs:
str.isalpha()
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.
So isalpha()
includes all foreign accented characters as well as the acsii letters which you want.
The easiest way to isolate these may be to import string.ascii_letters
which is a string of all lower and upper case ASCII letters, then
>>> from string import ascii_letters
>>> for element in chars:
>>> if element in ascii_letters:
>>> print(element)
Related Topics
Differencebetween Np.Array() and Np.Asarray()
Python: Can't Pickle Type X, Attribute Lookup Failed
Python and Operator on Two Boolean Lists - How
Why How to Not Create a Wheel in Python
Appending to the Same List from Different Processes Using Multiprocessing
Understanding the Python with Statement and Context Managers
Python: Multiplication Override
Passing Double Quote Shell Commands in Python to Subprocess.Popen()
Can't Use '\1' Backreference to Capture-Group in a Function Call in Re.Sub() Repr Expression
Find the Date for the First Monday After a Given Date
Python Using Variables from Another File
Start a Flask Application in Separate Thread
How to Calculate the Inverse of the Normal Cumulative Distribution Function in Python