Extract Int from String in Pandas

Extract int from string in Pandas

You can convert to string and extract the integer using regular expressions.

df['B'].str.extract('(\d+)').astype(int)

Pandas Extract Number from String

Give it a regex capture group:

df.A.str.extract('(\d+)')

Gives you:

0      1
1    NaN
2     10
3    100
4      0
Name: A, dtype: object

How to Extract Numbers from String Column in Pandas with decimal?

If you want to match the numbers followed by OZ You could write the pattern as:

(\d*\.?\d+)\s*OZ\b

Explanation

( Capture group 1 (the value will be picked up be str.extract)
\d*\.?\d+ Match optional digits, optional dot and 1+ digits
) Close group 1
\s*OZ\b Match optional whitspace chars and then OZ followed by a word boundary

See a regex demo.

import pandas as pd

data= [
    "tld los 16OZ",
    "HSJ14 OZ",
    "hqk 28.3 OZ",
    "rtk .7 OZ",
    "ahdd .92OZ",
    "aje 0.22 OZ"
]

df = pd.DataFrame(data, columns=["Product"])
df['Numbers'] =  df['Product'].str.extract(r'(\d*\.?\d+)\s*OZ\b')
print(df)

Output

        Product Numbers
0  tld los 16OZ      16
1      HSJ14 OZ      14
2   hqk 28.3 OZ    28.3
3     rtk .7 OZ      .7
4    ahdd .92OZ     .92
5   aje 0.22 OZ    0.22

How to extract numbers from a string in Python?

If you only want to extract only positive integers, try the following:

>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]

I would argue that this is better than the regex example because you don't need another module and it's more readable because you don't need to parse (and learn) the regex mini-language.

This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, jmnas's answer below will do the trick.

How do I extract numbers from the strings in a pandas column of 'object'?

I would use str.extract here:

df['x'] = pd.to_numeric(df['x'].str.extract(r'^(\d+)'))

The challenge with trying to use a pure substring approach is that we don't necessarily know how many characters to take. Regex gets around this problem.

Extract numbers from strings in python

Assuming you expect only one number per column, you could try using str.extract here:

df["some_col"] = df["some_col"].str.extract(r'(\d+(?:\.\d+)?)')

Extract only numbers from string with python

Your regex doesn't do what you think it does. What you have is a character class, which matches any of the characters in the set ?: \t\r\n\f\v0-9+. So when the regex encounters the first non-matching character (P for your sample data) it stops. It's probably simpler to use replace to get rid of non-whitespace and digit characters:

df = pd.DataFrame({'data':['86531 86530 86529PIP 91897PIP']})
df['data'].str.replace('([^\s\d])', '', regex=True)

Which for your data will give:

86531 86530 86529 91897

Extract integers from string value in a pandas data frame cell

You can use str.extract with contains and loc with boolean indexing:

df1 = (df.AgeuponOutcome.str.extract('(\d+) (\w+)', expand=True))
df1.columns = ['a','b']
print (df1)
    a       b
0   1    year
1   1    year
2   2   years
3   3   weeks
4   2   years
5   1   month
6   3   weeks
7   3   weeks
8   5  months
9   1    year
10  2   years
11  2   years
12  4   years

print (df1.loc[df1.b.str.contains('month'), 'a'])
5    1
8    5
Name: a, dtype: object

print (df1.loc[df1.b.str.contains('year'), 'a'])
0     1
1     1
2     2
4     2
9     1
10    2
11    2
12    4
Name: a, dtype: object

If you need output as new columns:

df1['month'] = (df1.loc[df1.b.str.contains('month'), 'a'])
df1['year'] = (df1.loc[df1.b.str.contains('year'), 'a'])
df1['week'] = (df1.loc[df1.b.str.contains('week'), 'a'])
print (df1)
    a       b month year week
0   1    year   NaN    1  NaN
1   1    year   NaN    1  NaN
2   2   years   NaN    2  NaN
3   3   weeks   NaN  NaN    3
4   2   years   NaN    2  NaN
5   1   month     1  NaN  NaN
6   3   weeks   NaN  NaN    3
7   3   weeks   NaN  NaN    3
8   5  months     5  NaN  NaN
9   1    year   NaN    1  NaN
10  2   years   NaN    2  NaN
11  2   years   NaN    2  NaN
12  4   years   NaN    4  NaN

EDIT by comment:

You can use:

#convert to int
df1['a'] = df1.a.astype(int)

#divide by constant to column a
df1.loc[df1.b.str.contains('month'), 'a'] = df1.loc[df1.b.str.contains('month'), 'a'] / 12
df1.loc[df1.b.str.contains('week'), 'a'] = df1.loc[df1.b.str.contains('week'), 'a']  /52.1429
print (df1)
           a       b
0   1.000000    year
1   1.000000    year
2   2.000000   years
3   0.057534   weeks
4   2.000000   years
5   0.083333   month
6   0.057534   weeks
7   0.057534   weeks
8   0.416667  months
9   1.000000    year
10  2.000000   years
11  2.000000   years
12  4.000000   years

Extract numbers from string column from Pandas DF

Just so I understand, you're trying to avoid capturing decimal parts of numbers, right? (The (?:\.\d+)? part.)

First off, you need to use pd.Series.str.extractall if you want all the matches; extract stops after the first.

Using your df, try this code:

# Get a multiindexed dataframe using extractall
expanded = df.Info.str.extractall(r"(\d+(?:\.\d+)?)")

# Pivot the index labels
df_2 = expanded.unstack()

# Drop the multiindex
df_2.columns = df_2.columns.droplevel()

# Add the columns to the original dataframe (inplace or make a new df)
df_combined = pd.concat([df, df_2], axis=1)

Extract Int from String in Pandas