How to extract part of a string in Pandas column and make a new column
Use str.extract
with a regex and str.replace
to rename values:
dff['Version_short'] = dff['Name'].str.extract('_(V\d+)$').fillna('')
dff['Version_long'] = dff['Version_short'].str.replace('V', 'Version ')
Output:
>>> dff
col1 col3 Name Date Version_short Version_long
0 1 1 2a df a1asd_V1 2021-06-13 V1 Version 1
1 2 22 xcd a2asd_V3 2021-06-13 V3 Version 3
2 3 33 23vg aabsd_V1 2021-06-13 V1 Version 1
3 4 44 dfgdf_aabsd_V0 2021-06-14 V0 Version 0
4 5 55 a3as d_V1 2021-06-15 V1 Version 1
5 60 60 aa bsd_V3 2021-06-15 V3 Version 3
6 0 1 aasd_V4 2021-06-13 V4 Version 4
7 0 5 aabsd_V4 2021-06-16 V4 Version 4
8 6 6 aa_adn sd_V15 2021-06-13 V15 Version 15
9 3 3 NaN 2021-06-13
10 2 2 aasd_V12 2021-06-13 V12 Version 12
11 4 4 aasd120Abs 2021-06-16
How to extract entire part of string after certain character in dataframe column?
Use str.split
, and extract the last slice with -1
(also gracefully handles false cases):
df = pd.DataFrame(columns=[
'data.answers.1234567890.value.0987654321', 'blahblah.value.12345', 'foo'])
df.columns = df.columns.str.split('value.').str[-1]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')
Another alternative is splitting inside a listcomp:
df.columns = [x.split('value.')[-1] for x in df.columns]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')
Extract elements from data column (String) before and after character
I am not really sure if this is what you want, but it does the work:
regions = []
for i in df['Region'].str.split('.').str[0]:
regions.append(''.join([d for d in i if d.isdigit()]))
df['BGC Region'] = df['Strain'].str.split('_').str[2] + '_' + regions + '.region'
region_number = df['Region'].str.split('.').str[1]
for i, rn in enumerate(region_number):
if int(rn) < 10:
df['BGC Region'][i] += '00' + rn
elif int(rn) < 100:
df['BGC Region'][i] += '0' + rn
Extracting Specific Text From column in dataframe
We can use regex to extract the necessary part of the string.
Here we are checking for atleast one [A-C] and 0 or more[0-9]
data['extract'] = data.Description.str.extract(r'([A-C]+[0-9]*)')
or (based on need)
data['extract'] = data.Description.str.extract(r'([A-C]+[0-9]+)')
Output
Description extract
0 ABC12345679 132465 ABC12345679
1 Test ABC12346548 ABC12346548
2 Test ABC1231321 4645 ABC1231321
To Extract Substring from Column of DataFrame
Try with str.findall
:
>>> df["NE Name"].str.findall(r"/([^/]{4})")
0 [01HJ]
1 [01HL, 02HL, 03HL, 10HL]
2 [01HL, 02HL, 03HL, 10HL]
3 [01HL, 02HL, 03HL, 10HL]
4 [01HL, 02HL, 03HL, 10HL]
Name: NE Name, dtype: object
Input DataFrame:
>>> df
NE Name Subrack ID pattern
0 10100000/01HJ 0 01HJ
1 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 1 01HJ
2 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 0 01HJ
3 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 2 01HJ
4 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 3 01HJ
Python pandas: remove everything after a delimiter in a string
You can use pandas.Series.str.split
just like you would use split
normally. Just split on the string '::'
, and index the list that's created from the split
method:
>>> df = pd.DataFrame({'text': ["vendor a::ProductA", "vendor b::ProductA", "vendor a::Productb"]})
>>> df
text
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
>>> df['text_new'] = df['text'].str.split('::').str[0]
>>> df
text text_new
0 vendor a::ProductA vendor a
1 vendor b::ProductA vendor b
2 vendor a::Productb vendor a
Here's a non-pandas solution:
>>> df['text_new1'] = [x.split('::')[0] for x in df['text']]
>>> df
text text_new text_new1
0 vendor a::ProductA vendor a vendor a
1 vendor b::ProductA vendor b vendor b
2 vendor a::Productb vendor a vendor a
Edit: Here's the step-by-step explanation of what's happening in pandas
above:
# Select the pandas.Series object you want
>>> df['text']
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
Name: text, dtype: object
# using pandas.Series.str allows us to implement "normal" string methods
# (like split) on a Series
>>> df['text'].str
<pandas.core.strings.StringMethods object at 0x110af4e48>
# Now we can use the split method to split on our '::' string. You'll see that
# a Series of lists is returned (just like what you'd see outside of pandas)
>>> df['text'].str.split('::')
0 [vendor a, ProductA]
1 [vendor b, ProductA]
2 [vendor a, Productb]
Name: text, dtype: object
# using the pandas.Series.str method, again, we will be able to index through
# the lists returned in the previous step
>>> df['text'].str.split('::').str
<pandas.core.strings.StringMethods object at 0x110b254a8>
# now we can grab the first item in each list above for our desired output
>>> df['text'].str.split('::').str[0]
0 vendor a
1 vendor b
2 vendor a
Name: text, dtype: object
I would suggest checking out the pandas.Series.str docs, or, better yet, Working with Text Data in pandas.
Extracting text after a phrase and in between spaces from Pandas Dataframe
You get the match Jacobs as the pattern (\w+(?=\s+FLEX\s))
matches 1+ word characters asserting what is directly to the right is whitespace chars followed by FLEX.
Instead, you can use a pattern with a capture group to match 2 words after FLEX:
\bFLEX\s+(\w+\s+\w+)
Regex demo
Or a broader match:
\bFLEX\s+(\S+\s+\S+)
\bFLEX
A word boundary, matchFLEX
\s+
Match 1+ whitespace chars(\S+\s+\S+)
Capture group 1 match 1+ non whitespace chars, 1+ whitespace chars and again 1+ non whitespace chars
See a regex demo.
import pandas as pd
strings = ['QB Aaron Rodgers RB Josh Jacobs FLEX Davante Adams']
df = pd.DataFrame(strings, columns=["Lineup"])
df['Lineup'] = df["Lineup"].str.extract(r'\bFLEX\s+(\S+\s+\S+)')
print(df)
Output
Lineup
0 Davante Adams
If you want to match 2 or more words, you could use a repeating non capture group:
\bFLEX\s+(\w+(?:\s+\w+)+)
Extract a certain part of a string after a key phrase using pandas?
You can use the Series str.extract string method:
In [11]: df = pd.DataFrame([["(12:25) (No Huddle Shotgun) P.Manning pass short left to W.Welker pushed ob at DEN 34 for 10 yards (C.Graham)."]])
In [12]: df
Out[12]:
0
0 (12:25) (No Huddle Shotgun) P.Manning pass sho...
This will "extract" what's it the group (inside the parenthesis):
In [13]: df[0].str.extract("for (\d+)")
Out[13]:
0 10
Name: 0, dtype: object
In [14]: df[0].str.extract("for (\d+) yards")
Out[14]:
0 10
Name: 0, dtype: object
You'll need to convert to int, e.g. using astype(int)
.
Pandas DataFrame - Extract string between two strings and include the first delimiter
you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
- FILE is the what we begin the match on
- .* grabs any number of characters
- (?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/
Extracting number from string only when string is present in a dataframe
Use Series.str.extract
with the regex pattern r'(?:^|\s)(\d+)
:
(?:^|\s)
matches the beginning of the string ('^'
) or ('|'
) any whitespace character ('\s'
) without capturing it ((?:...)
)(\d+)
captures one or more digit (greedy)
df['Item Code'] = df['Item Code'].str.extract(r'(?:^|\s)(\d+)', expand=False)
Note that the values of 'Item Code' are still stings after the extraction. If you want to convert them to integers use Series.astype
.
df['Item Code'] = df['Item Code']str.extract(r'(?:\s|^)(\d+)', expand=False).astype(int)
Output
>>> df
ID Price Item Code
0 1 3.60 80986
1 2 4.30 45772
2 3 0.60 9778
3 4 9.78 48989
4 5 3.44 545
5 6 3.44 509
Related Topics
Extract Hours and Seconds from Posixct for Plotting Purposes in R
Group by Two Columns in Ggplot2
Complete Column with Group_By and Complete
Converting a Character String into a Date in R
Create an Id (Row Number) Column
How to 'Source()' and Continue After an Error
How to Nicely Annotate a Ggplot2 (Manual)
Ggplot Geom_Text Font Size Control
Spreading a Two Column Data Frame with Tidyr
Lapply-Ing with the "$" Function
Extract Text After "/" in a Data Frame Column
Ggplot Geom_Bar: Meaning of Aes(Group = 1)
Remove All Duplicate Rows Including the "Reference" Row
Find Value Corresponding to Maximum in Other Column
Importing CSV File into R - Numeric Values Read as Characters