Python, Remove All Non-Alphabet Chars from String

Python, remove all non-alphabet chars from string

Use re.sub

import re

regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)

regex = re.compile('[,\.!?]') #etc.

How to remove non-alpha-numeric characters from strings within a dataframe column in Python?

Use str.replace.

df
strings
0 a#bc1!
1 a(b$c

df.strings.str.replace('[^a-zA-Z]', '')
0 abc
1 abc
Name: strings, dtype: object

To retain alphanumeric characters (not just alphabets as your expected output suggests), you'll need:

df.strings.str.replace('\W', '')
0 abc1
1 abc
Name: strings, dtype: object

Remove non-alphabetic characters from a list of lists and maintain structure

you want to replace non-space or alphanum chars, and trim/lowercase the string. Regex are pretty efficient for those replacements, chained with str.strip.

Rebuild the nested lists in a double list comp:

import re

csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]

result = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in csvarticles]

print(result)

prints:

[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]

If you're using Python, replace lower by casefold to handle speciale locale chars

Stripping everything but alphanumeric chars from a string in Python

I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string string.printable (part of the built-in string module). The use of compiled '[\W_]+' and pattern.sub('', str) was found to be fastest.

$ python -m timeit -s \
"import string" \
"''.join(ch for ch in string.printable if ch.isalnum())"
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
"import string" \
"filter(str.isalnum, string.printable)"
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
"import re, string" \
"re.sub('[\W_]+', '', string.printable)"
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
"import re, string; pattern = re.compile('[\W_]+')" \
"pattern.sub('', string.printable)"
100000 loops, best of 3: 11.2 usec per loop

Most Pythonic was to strip all non-alphanumeric leading characters from string

If you want to remove leading non-alpha/numeric values:

while not s[0].isalnum(): s = s[1:]

If you want to remove only leading non-alphabet characters:

while not s[0].isalpha(): s = s[1:]

Sample:

s = '!@#yourname!@#'
while not s[0].isalpha(): s = s[1:]
print(s)

Output:

yourname!@#

How to remove every word with non alphabetic characters

Using regular expressions to match only letters (and underscores), you can do this:

import re

s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

How to remove every non-alphabetic character in Python 3

According to 3.3 docs:

str.isalpha()
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.

So isalpha() includes all foreign accented characters as well as the acsii letters which you want.

The easiest way to isolate these may be to import string.ascii_letters which is a string of all lower and upper case ASCII letters, then

>>> from string import ascii_letters
>>> for element in chars:
>>> if element in ascii_letters:
>>> print(element)


Related Topics



Leave a reply



Submit