Extract int from string in Pandas
You can convert to string and extract the integer using regular expressions.
df['B'].str.extract('(\d+)').astype(int)
Pandas Extract Number from String
Give it a regex capture group:
df.A.str.extract('(\d+)')
Gives you:0 1
1 NaN
2 10
3 100
4 0
Name: A, dtype: object
How to Extract Numbers from String Column in Pandas with decimal?
If you want to match the numbers followed by OZ
You could write the pattern as:
(\d*\.?\d+)\s*OZ\b
Explanation(
Capture group 1 (the value will be picked up be str.extract)\d*\.?\d+
Match optional digits, optional dot and 1+ digits)
Close group 1\s*OZ\b
Match optional whitspace chars and thenOZ
followed by a word boundary
import pandas as pd
data= [
"tld los 16OZ",
"HSJ14 OZ",
"hqk 28.3 OZ",
"rtk .7 OZ",
"ahdd .92OZ",
"aje 0.22 OZ"
]
df = pd.DataFrame(data, columns=["Product"])
df['Numbers'] = df['Product'].str.extract(r'(\d*\.?\d+)\s*OZ\b')
print(df)
Output Product Numbers
0 tld los 16OZ 16
1 HSJ14 OZ 14
2 hqk 28.3 OZ 28.3
3 rtk .7 OZ .7
4 ahdd .92OZ .92
5 aje 0.22 OZ 0.22
How to extract numbers from a string in Python?
If you only want to extract only positive integers, try the following:
>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]
I would argue that this is better than the regex example because you don't need another module and it's more readable because you don't need to parse (and learn) the regex mini-language.This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, jmnas's answer below will do the trick.
How do I extract numbers from the strings in a pandas column of 'object'?
I would use str.extract
here:
df['x'] = pd.to_numeric(df['x'].str.extract(r'^(\d+)'))
The challenge with trying to use a pure substring approach is that we don't necessarily know how many characters to take. Regex gets around this problem. Extract numbers from strings in python
Assuming you expect only one number per column, you could try using str.extract
here:
df["some_col"] = df["some_col"].str.extract(r'(\d+(?:\.\d+)?)')
Extract only numbers from string with python
Your regex doesn't do what you think it does. What you have is a character class, which matches any of the characters in the set ?: \t\r\n\f\v0-9+
. So when the regex encounters the first non-matching character (P
for your sample data) it stops. It's probably simpler to use replace
to get rid of non-whitespace and digit characters:
df = pd.DataFrame({'data':['86531 86530 86529PIP 91897PIP']})
df['data'].str.replace('([^\s\d])', '', regex=True)
Which for your data will give:86531 86530 86529 91897
Extract integers from string value in a pandas data frame cell
You can use str.extract
with contains
and loc
with boolean indexing
:
df1 = (df.AgeuponOutcome.str.extract('(\d+) (\w+)', expand=True))
df1.columns = ['a','b']
print (df1)
a b
0 1 year
1 1 year
2 2 years
3 3 weeks
4 2 years
5 1 month
6 3 weeks
7 3 weeks
8 5 months
9 1 year
10 2 years
11 2 years
12 4 years
print (df1.loc[df1.b.str.contains('month'), 'a'])
5 1
8 5
Name: a, dtype: object
print (df1.loc[df1.b.str.contains('year'), 'a'])
0 1
1 1
2 2
4 2
9 1
10 2
11 2
12 4
Name: a, dtype: object
If you need output as new columns:df1['month'] = (df1.loc[df1.b.str.contains('month'), 'a'])
df1['year'] = (df1.loc[df1.b.str.contains('year'), 'a'])
df1['week'] = (df1.loc[df1.b.str.contains('week'), 'a'])
print (df1)
a b month year week
0 1 year NaN 1 NaN
1 1 year NaN 1 NaN
2 2 years NaN 2 NaN
3 3 weeks NaN NaN 3
4 2 years NaN 2 NaN
5 1 month 1 NaN NaN
6 3 weeks NaN NaN 3
7 3 weeks NaN NaN 3
8 5 months 5 NaN NaN
9 1 year NaN 1 NaN
10 2 years NaN 2 NaN
11 2 years NaN 2 NaN
12 4 years NaN 4 NaN
EDIT by comment:You can use:
#convert to int
df1['a'] = df1.a.astype(int)
#divide by constant to column a
df1.loc[df1.b.str.contains('month'), 'a'] = df1.loc[df1.b.str.contains('month'), 'a'] / 12
df1.loc[df1.b.str.contains('week'), 'a'] = df1.loc[df1.b.str.contains('week'), 'a'] /52.1429
print (df1)
a b
0 1.000000 year
1 1.000000 year
2 2.000000 years
3 0.057534 weeks
4 2.000000 years
5 0.083333 month
6 0.057534 weeks
7 0.057534 weeks
8 0.416667 months
9 1.000000 year
10 2.000000 years
11 2.000000 years
12 4.000000 years
Extract numbers from string column from Pandas DF
Just so I understand, you're trying to avoid capturing decimal parts of numbers, right? (The (?:\.\d+)?
part.)
First off, you need to use pd.Series.str.extractall
if you want all the matches; extract
stops after the first.
Using your df
, try this code:
# Get a multiindexed dataframe using extractall
expanded = df.Info.str.extractall(r"(\d+(?:\.\d+)?)")
# Pivot the index labels
df_2 = expanded.unstack()
# Drop the multiindex
df_2.columns = df_2.columns.droplevel()
# Add the columns to the original dataframe (inplace or make a new df)
df_combined = pd.concat([df, df_2], axis=1)
Related Topics
How to Format a Date in Jinja2
Reading Tar File Contents Without Untarring It, in Python Script
Counting Letter Frequency in a String (Python)
In Tensorflow, Get the Names of All the Tensors in a Graph
How to Calculate the Inverse of the Normal Cumulative Distribution Function in Python
Python Spacing and Aligning Strings
Os.Path.Dirname(_File_) Returns Empty
How to Tell Pycharm What Type a Parameter Is Expected to Be
Vectorized Numpy Linspace for Multiple Start and Stop Values
Is It Better to Use "Is" or "==" for Number Comparison in Python
Asyncio Cancellederror and Keyboardinterrupt
How to Delete Specific Strings from a File
Overflowerror: Long Int Too Large to Convert to Float in Python
Why How to Not Create a Wheel in Python
Powersets in Python Using Itertools
How May I Override the Compiler (Gcc) Flags That Setup.Py Uses by Default