Pandas Extract Number from String

Pandas Extract Number from String

Give it a regex capture group:

df.A.str.extract('(\d+)')

Gives you:

0      1
1 NaN
2 10
3 100
4 0
Name: A, dtype: object

How to Extract Numbers from String Column in Pandas with decimal?

If you want to match the numbers followed by OZ You could write the pattern as:

(\d*\.?\d+)\s*OZ\b

Explanation

  • ( Capture group 1 (the value will be picked up be str.extract)
  • \d*\.?\d+ Match optional digits, optional dot and 1+ digits
  • ) Close group 1
  • \s*OZ\b Match optional whitspace chars and then OZ followed by a word boundary

See a regex demo.

import pandas as pd

data= [
"tld los 16OZ",
"HSJ14 OZ",
"hqk 28.3 OZ",
"rtk .7 OZ",
"ahdd .92OZ",
"aje 0.22 OZ"
]

df = pd.DataFrame(data, columns=["Product"])
df['Numbers'] = df['Product'].str.extract(r'(\d*\.?\d+)\s*OZ\b')
print(df)

Output

        Product Numbers
0 tld los 16OZ 16
1 HSJ14 OZ 14
2 hqk 28.3 OZ 28.3
3 rtk .7 OZ .7
4 ahdd .92OZ .92
5 aje 0.22 OZ 0.22

Extract int from string in Pandas

You can convert to string and extract the integer using regular expressions.

df['B'].str.extract('(\d+)').astype(int)

Extract only numbers from string with python

Your regex doesn't do what you think it does. What you have is a character class, which matches any of the characters in the set ?: \t\r\n\f\v0-9+. So when the regex encounters the first non-matching character (P for your sample data) it stops. It's probably simpler to use replace to get rid of non-whitespace and digit characters:

df = pd.DataFrame({'data':['86531 86530 86529PIP 91897PIP']})
df['data'].str.replace('([^\s\d])', '', regex=True)

Which for your data will give:

86531 86530 86529 91897

How do I extract numbers from the strings in a pandas column of 'object'?

I would use str.extract here:

df['x'] = pd.to_numeric(df['x'].str.extract(r'^(\d+)'))

The challenge with trying to use a pure substring approach is that we don't necessarily know how many characters to take. Regex gets around this problem.

Extract only numbers and only string from pandas dataframe

Your code is on the right track, you just need to account for the decimals and the possibility of integers :

df_num['colors_num'] = df_num.Colors.str.extract(r'(\d+[.\d]*)')
df_num['animals_num'] = df_num.Animals.str.extract(r'(\d+[.\d]*)')
df_num['colors_str'] = df_num.Colors.str.replace(r'(\d+[.\d]*)','')
df_num['animals_text'] = df_num.Animals.str.replace(r'(\d+[.\d]*)','')

Colors Animals colors_num animals_num colors_str animals_text
0 lila1.5 hu11nd 1.5 11 lila hund
1 rosa2.5 12welpe 2.5 12 rosa welpe
2 gelb3.5 13katze 3.5 13 gelb katze
3 grün4 s14chlange 4 14 grün schlange
4 rot5 vo15gel 5 15 rot vogel
5 schwarz6 16papagei 6 16 schwarz papagei
6 grau7 ku17h 7 17 grau kuh
7 weiß8 18ziege 8 18 weiß ziege
8 braun9 19pferd 9 19 braun pferd
9 hellblau10 esel20 10 20 hellblau esel

Extract numbers from strings in python

Assuming you expect only one number per column, you could try using str.extract here:

df["some_col"] = df["some_col"].str.extract(r'(\d+(?:\.\d+)?)')

How to extract numbers from a string in Python?

If you only want to extract only positive integers, try the following:

>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]

I would argue that this is better than the regex example because you don't need another module and it's more readable because you don't need to parse (and learn) the regex mini-language.

This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, jmnas's answer below will do the trick.



Related Topics



Leave a reply



Submit