Remove characters except digits from string using Python?
In Python 2.*, by far the fastest approach is the .translate
method:
>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>>
string.maketrans
makes a translation table (a string of length 256) which in this case is the same as ''.join(chr(x) for x in range(256))
(just faster to make;-). .translate
applies the translation table (which here is irrelevant since all
essentially means identity) AND deletes characters present in the second argument -- the key part.
.translate
works very differently on Unicode strings (and strings in Python 3 -- I do wish questions specified which major-release of Python is of interest!) -- not quite this simple, not quite this fast, though still quite usable.
Back to 2.*, the performance difference is impressive...:
$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 7.9 usec per loop
Speeding things up by 7-8 times is hardly peanuts, so the translate
method is well worth knowing and using. The other popular non-RE approach...:
$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop
is 50% slower than RE, so the .translate
approach beats it by over an order of magnitude.
In Python 3, or for Unicode, you need to pass .translate
a mapping (with ordinals, not characters directly, as keys) that returns None
for what you want to delete. Here's a convenient way to express this for deletion of "everything but" a few characters:
import string
class Del:
def __init__(self, keep=string.digits):
self.comp = dict((ord(c),c) for c in keep)
def __getitem__(self, k):
return self.comp.get(k)
DD = Del()
x='aaa12333bb445bb54b5b52'
x.translate(DD)
also emits '1233344554552'
. However, putting this in xx.py we have...:
$ python3.1 -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop
...which shows the performance advantage disappears, for this kind of "deletion" tasks, and becomes a performance decrease.
Strip all non-numeric characters (except for .) from a string in Python
You can use a regular expression (using the re
module) to accomplish the same thing. The example below matches runs of [^\d.]
(any character that's not a decimal digit or a period) and replaces them with the empty string. Note that if the pattern is compiled with the UNICODE
flag the resulting string could still include non-ASCII numbers. Also, the result after removing "non-numeric" characters is not necessarily a valid number.
>>> import re
>>> non_decimal = re.compile(r'[^\d.]+')
>>> non_decimal.sub('', '12.34fe4e')
'12.344'
Is there a way to remove all characters except letters in a string in Python?
Given
s = '@#24A-09=wes()&8973o**_##me' # contains letters 'Awesome'
You can filter out non-alpha characters with a generator expression:
result = ''.join(c for c in s if c.isalpha())
Or filter with filter
:
result = ''.join(filter(str.isalpha, s))
Or you can substitute non-alpha with blanks using re.sub
:
import re
result = re.sub(r'[^A-Za-z]', '', s)
Removing non numeric characters from a string
The easiest way is with a regexp
import re
a = 'lkdfhisoe78347834 (())&/&745 '
result = re.sub('[^0-9]','', a)
print result
>>> '78347834745'
Removing all non-numeric characters from string in Python
>>> import re
>>> re.sub("[^0-9]", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'
How to remove all non-numeric characters (except operators) from a string in Python?
You can use regex and re.sub()
to make substitutions.
For example:
expression = re.sub("[^\d+-/÷%\*]*", "", text)
will eliminate everything that is not a number or any of +-/÷%*
. Obviously, is up to you to make a comprehensive list of the operators you want to keep.
That said, I'm going to paste here @KarlKnechtel's comment, literally:
Do not use eval() for anything that could possibly receive input from outside the program in any form. It is a critical security risk that allows the creator of that input to execute arbitrary code on your computer.
Remove all numbers except for the ones combined to string using python regex
Right, depending on your desired result showing underscores at all or not, try to use re.findall
and raw-string notation. You currently use a character class that makes no sense:
\b(?!(?:ORIGIN|[_\d]+)\b)\w+
See an online demo
\b
- Word-boundary;(?!(?:ORIGIN|[_\d]+)\b)
- Negative lookahead with nested non-capture group to match eitherORIGIN
or 1+ underscore/digit combinations before a trailing word-boundary;\w+
- 1+ word-characters.
import re
text_file = """ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
//"""
new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))
print(new_txt, len(new_txt))
Prints:
malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
Remove all the characters and numbers except comma
Possible solution is the following:
# pip install pandas
import pandas as pd
pd.set_option('display.max_colwidth', 200)
# set test data and create dataframe
data = {"text": ['100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene','Polypropylene plastic', '100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene', 'Bamboo, Clear nitrocellulose lacquer', 'Willow, Stain, Solid wood, Polypropylene plastic, Stainless steel, Steel, Galvanized, Steel, 100% polypropylene', 'Banana fibres, Clear lacquer', 'Polypropylene plastic (min. 20% recycled)']}
df = pd.DataFrame(data)
def cleanup(txt):
re_pattern = re.compile(r"[^a-z, ()]", re.I)
return re.sub(re_pattern, "", txt).replace(" ", " ").strip()
df['text_cleaned'] = df['text'].apply(cleanup)
df
Returns
Removing numbers from string
Would this work for your situation?
>>> s = '12abcd405'
>>> result = ''.join([i for i in s if not i.isdigit()])
>>> result
'abcd'
This makes use of a list comprehension, and what is happening here is similar to this structure:
no_digits = []
# Iterate through the string, adding non-numbers to the no_digits list
for i in s:
if not i.isdigit():
no_digits.append(i)
# Now join all elements of the list with '',
# which puts all of the characters together.
result = ''.join(no_digits)
As @AshwiniChaudhary and @KirkStrauser point out, you actually do not need to use the brackets in the one-liner, making the piece inside the parentheses a generator expression (more efficient than a list comprehension). Even if this doesn't fit the requirements for your assignment, it is something you should read about eventually :) :
>>> s = '12abcd405'
>>> result = ''.join(i for i in s if not i.isdigit())
>>> result
'abcd'
Python Regex - remove all . and special characters EXCEPT the decimal point
Use a capturing group to capture only the decimal numbers and at the same time match special chars (ie. not of space and word characters).
Upon replacement, just refer to the capturing group in-order to make use of only the captured chars. ie. the whole match would be removed and replaced by the decimal number if exists.
s = 'What? The Census Says It’s Counted 99.9 Percent of Households. Don’t Be Fooled.'
import re
rgx = re.compile(r'(\d\.\d)|[^\s\w]')
rgx.sub(lambda x: x.group(1), s)
# 'What The Census Says Its Counted 99.9 Percent of Households Dont Be Fooled'
OR
Match all the dots except the one exists between the numbers and all chars except special chars and then finally replace those match chars with an empty string.
re.sub(r'(?!<\d)\.(?!\d)|[^\s\w.]', '', s)
# 'What The Census Says Its Counted 99.9 Percent of Households Dont Be Fooled'
Related Topics
Bloomberg Server API and Ruby/Python
What's the Ruby Equivalent of Python's Os.Walk
List Comprehension in Haskell, Python and Ruby
Pyobjc VS Rubycocoa for MAC Development: Which Is More Mature
Restrictons of Python Compared to Ruby: Lambda'S
Create Static Graphics Files (Png, Gif, Jpg) Using Ruby or Python
Efficient Ways to Duplicate Array/List in Python
Differencebetween Ruby and Python Versions Of"Self"
Python's Equivalent for Ruby's Define_Method
Python Equivalent of Ruby's Each_Slice(Count)
How Can One Find the Unicode Codepoints That a Font Has Glyphs For, on a Debian-Based System
Python Equivalent of Ruby's .Select
Different Yaml Array Representations