Remove Characters Except Digits from String Using Python

Remove characters except digits from string using Python?

In Python 2.*, by far the fastest approach is the .translate method:

>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>>

string.maketrans makes a translation table (a string of length 256) which in this case is the same as ''.join(chr(x) for x in range(256)) (just faster to make;-). .translate applies the translation table (which here is irrelevant since all essentially means identity) AND deletes characters present in the second argument -- the key part.

.translate works very differently on Unicode strings (and strings in Python 3 -- I do wish questions specified which major-release of Python is of interest!) -- not quite this simple, not quite this fast, though still quite usable.

Back to 2.*, the performance difference is impressive...:

$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 7.9 usec per loop

Speeding things up by 7-8 times is hardly peanuts, so the translate method is well worth knowing and using. The other popular non-RE approach...:

$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop

is 50% slower than RE, so the .translate approach beats it by over an order of magnitude.

In Python 3, or for Unicode, you need to pass .translate a mapping (with ordinals, not characters directly, as keys) that returns None for what you want to delete. Here's a convenient way to express this for deletion of "everything but" a few characters:

import string

class Del:
def __init__(self, keep=string.digits):
self.comp = dict((ord(c),c) for c in keep)
def __getitem__(self, k):
return self.comp.get(k)

DD = Del()

x='aaa12333bb445bb54b5b52'
x.translate(DD)

also emits '1233344554552'. However, putting this in xx.py we have...:

$ python3.1 -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop

...which shows the performance advantage disappears, for this kind of "deletion" tasks, and becomes a performance decrease.

Strip all non-numeric characters (except for .) from a string in Python

You can use a regular expression (using the re module) to accomplish the same thing. The example below matches runs of [^\d.] (any character that's not a decimal digit or a period) and replaces them with the empty string. Note that if the pattern is compiled with the UNICODE flag the resulting string could still include non-ASCII numbers. Also, the result after removing "non-numeric" characters is not necessarily a valid number.

>>> import re
>>> non_decimal = re.compile(r'[^\d.]+')
>>> non_decimal.sub('', '12.34fe4e')
'12.344'

Is there a way to remove all characters except letters in a string in Python?

Given

s = '@#24A-09=wes()&8973o**_##me'  # contains letters 'Awesome'    

You can filter out non-alpha characters with a generator expression:

result = ''.join(c for c in s if c.isalpha())

Or filter with filter:

result = ''.join(filter(str.isalpha, s))    

Or you can substitute non-alpha with blanks using re.sub:

import re
result = re.sub(r'[^A-Za-z]', '', s)

Removing non numeric characters from a string

The easiest way is with a regexp

import re
a = 'lkdfhisoe78347834 (())&/&745 '
result = re.sub('[^0-9]','', a)

print result
>>> '78347834745'

Removing all non-numeric characters from string in Python

>>> import re
>>> re.sub("[^0-9]", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'

How to remove all non-numeric characters (except operators) from a string in Python?

You can use regex and re.sub() to make substitutions.

For example:

expression = re.sub("[^\d+-/÷%\*]*", "", text)

will eliminate everything that is not a number or any of +-/÷%*. Obviously, is up to you to make a comprehensive list of the operators you want to keep.

That said, I'm going to paste here @KarlKnechtel's comment, literally:

Do not use eval() for anything that could possibly receive input from outside the program in any form. It is a critical security risk that allows the creator of that input to execute arbitrary code on your computer.

Remove all numbers except for the ones combined to string using python regex

Right, depending on your desired result showing underscores at all or not, try to use re.findall and raw-string notation. You currently use a character class that makes no sense:



\b(?!(?:ORIGIN|[_\d]+)\b)\w+

See an online demo



  • \b - Word-boundary;
  • (?!(?:ORIGIN|[_\d]+)\b) - Negative lookahead with nested non-capture group to match either ORIGIN or 1+ underscore/digit combinations before a trailing word-boundary;
  • \w+ - 1+ word-characters.


import re

text_file = """ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//"""

new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))
print(new_txt, len(new_txt))

Prints:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Remove all the characters and numbers except comma

Possible solution is the following:

# pip install pandas

import pandas as pd
pd.set_option('display.max_colwidth', 200)

# set test data and create dataframe
data = {"text": ['100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene','Polypropylene plastic', '100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene', 'Bamboo, Clear nitrocellulose lacquer', 'Willow, Stain, Solid wood, Polypropylene plastic, Stainless steel, Steel, Galvanized, Steel, 100% polypropylene', 'Banana fibres, Clear lacquer', 'Polypropylene plastic (min. 20% recycled)']}
df = pd.DataFrame(data)

def cleanup(txt):
re_pattern = re.compile(r"[^a-z, ()]", re.I)
return re.sub(re_pattern, "", txt).replace(" ", " ").strip()

df['text_cleaned'] = df['text'].apply(cleanup)
df

Returns

Sample Image

Removing numbers from string

Would this work for your situation?

>>> s = '12abcd405'
>>> result = ''.join([i for i in s if not i.isdigit()])
>>> result
'abcd'

This makes use of a list comprehension, and what is happening here is similar to this structure:

no_digits = []
# Iterate through the string, adding non-numbers to the no_digits list
for i in s:
if not i.isdigit():
no_digits.append(i)

# Now join all elements of the list with '',
# which puts all of the characters together.
result = ''.join(no_digits)

As @AshwiniChaudhary and @KirkStrauser point out, you actually do not need to use the brackets in the one-liner, making the piece inside the parentheses a generator expression (more efficient than a list comprehension). Even if this doesn't fit the requirements for your assignment, it is something you should read about eventually :) :

>>> s = '12abcd405'
>>> result = ''.join(i for i in s if not i.isdigit())
>>> result
'abcd'

Python Regex - remove all . and special characters EXCEPT the decimal point

Use a capturing group to capture only the decimal numbers and at the same time match special chars (ie. not of space and word characters).

Upon replacement, just refer to the capturing group in-order to make use of only the captured chars. ie. the whole match would be removed and replaced by the decimal number if exists.

s = 'What? The Census Says It’s Counted 99.9 Percent of Households. Don’t Be Fooled.'
import re
rgx = re.compile(r'(\d\.\d)|[^\s\w]')
rgx.sub(lambda x: x.group(1), s)
# 'What The Census Says Its Counted 99.9 Percent of Households Dont Be Fooled'

OR

Match all the dots except the one exists between the numbers and all chars except special chars and then finally replace those match chars with an empty string.

re.sub(r'(?!<\d)\.(?!\d)|[^\s\w.]', '', s)
# 'What The Census Says Its Counted 99.9 Percent of Households Dont Be Fooled'



Related Topics



Leave a reply



Submit