Extract Text After "/" in a Data Frame Column

How to extract part of a string in Pandas column and make a new column

Use str.extract with a regex and str.replace to rename values:

dff['Version_short'] = dff['Name'].str.extract('_(V\d+)$').fillna('')
dff['Version_long'] = dff['Version_short'].str.replace('V', 'Version ')

Output:

>>> dff
col1 col3 Name Date Version_short Version_long
0 1 1 2a df a1asd_V1 2021-06-13 V1 Version 1
1 2 22 xcd a2asd_V3 2021-06-13 V3 Version 3
2 3 33 23vg aabsd_V1 2021-06-13 V1 Version 1
3 4 44 dfgdf_aabsd_V0 2021-06-14 V0 Version 0
4 5 55 a3as d_V1 2021-06-15 V1 Version 1
5 60 60 aa bsd_V3 2021-06-15 V3 Version 3
6 0 1 aasd_V4 2021-06-13 V4 Version 4
7 0 5 aabsd_V4 2021-06-16 V4 Version 4
8 6 6 aa_adn sd_V15 2021-06-13 V15 Version 15
9 3 3 NaN 2021-06-13
10 2 2 aasd_V12 2021-06-13 V12 Version 12
11 4 4 aasd120Abs 2021-06-16

How to extract entire part of string after certain character in dataframe column?

Use str.split, and extract the last slice with -1 (also gracefully handles false cases):

df = pd.DataFrame(columns=[
'data.answers.1234567890.value.0987654321', 'blahblah.value.12345', 'foo'])

df.columns = df.columns.str.split('value.').str[-1]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')

Another alternative is splitting inside a listcomp:

df.columns = [x.split('value.')[-1] for x in df.columns]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')

Extract elements from data column (String) before and after character

I am not really sure if this is what you want, but it does the work:

regions = []
for i in df['Region'].str.split('.').str[0]:
regions.append(''.join([d for d in i if d.isdigit()]))

df['BGC Region'] = df['Strain'].str.split('_').str[2] + '_' + regions + '.region'

region_number = df['Region'].str.split('.').str[1]
for i, rn in enumerate(region_number):
if int(rn) < 10:
df['BGC Region'][i] += '00' + rn
elif int(rn) < 100:
df['BGC Region'][i] += '0' + rn

Extracting Specific Text From column in dataframe

We can use regex to extract the necessary part of the string.

Here we are checking for atleast one [A-C] and 0 or more[0-9]

data['extract'] = data.Description.str.extract(r'([A-C]+[0-9]*)')

or (based on need)

data['extract'] = data.Description.str.extract(r'([A-C]+[0-9]+)')

Output

    Description             extract
0 ABC12345679 132465 ABC12345679
1 Test ABC12346548 ABC12346548
2 Test ABC1231321 4645 ABC1231321

To Extract Substring from Column of DataFrame

Try with str.findall:

>>> df["NE Name"].str.findall(r"/([^/]{4})")
0 [01HJ]
1 [01HL, 02HL, 03HL, 10HL]
2 [01HL, 02HL, 03HL, 10HL]
3 [01HL, 02HL, 03HL, 10HL]
4 [01HL, 02HL, 03HL, 10HL]
Name: NE Name, dtype: object
Input DataFrame:

>>> df
NE Name Subrack ID pattern
0 10100000/01HJ 0 01HJ
1 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 1 01HJ
2 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 0 01HJ
3 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 2 01HJ
4 10100000/01HL&10100000/02HL&10100000/03HL&10100000/10HL 3 01HJ

Python pandas: remove everything after a delimiter in a string

You can use pandas.Series.str.split just like you would use split normally. Just split on the string '::', and index the list that's created from the split method:

>>> df = pd.DataFrame({'text': ["vendor a::ProductA", "vendor b::ProductA", "vendor a::Productb"]})
>>> df
text
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
>>> df['text_new'] = df['text'].str.split('::').str[0]
>>> df
text text_new
0 vendor a::ProductA vendor a
1 vendor b::ProductA vendor b
2 vendor a::Productb vendor a

Here's a non-pandas solution:

>>> df['text_new1'] = [x.split('::')[0] for x in df['text']]
>>> df
text text_new text_new1
0 vendor a::ProductA vendor a vendor a
1 vendor b::ProductA vendor b vendor b
2 vendor a::Productb vendor a vendor a

Edit: Here's the step-by-step explanation of what's happening in pandas above:

# Select the pandas.Series object you want
>>> df['text']
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
Name: text, dtype: object

# using pandas.Series.str allows us to implement "normal" string methods
# (like split) on a Series
>>> df['text'].str
<pandas.core.strings.StringMethods object at 0x110af4e48>

# Now we can use the split method to split on our '::' string. You'll see that
# a Series of lists is returned (just like what you'd see outside of pandas)
>>> df['text'].str.split('::')
0 [vendor a, ProductA]
1 [vendor b, ProductA]
2 [vendor a, Productb]
Name: text, dtype: object

# using the pandas.Series.str method, again, we will be able to index through
# the lists returned in the previous step
>>> df['text'].str.split('::').str
<pandas.core.strings.StringMethods object at 0x110b254a8>

# now we can grab the first item in each list above for our desired output
>>> df['text'].str.split('::').str[0]
0 vendor a
1 vendor b
2 vendor a
Name: text, dtype: object

I would suggest checking out the pandas.Series.str docs, or, better yet, Working with Text Data in pandas.

Extracting text after a phrase and in between spaces from Pandas Dataframe

You get the match Jacobs as the pattern (\w+(?=\s+FLEX\s)) matches 1+ word characters asserting what is directly to the right is whitespace chars followed by FLEX.

Instead, you can use a pattern with a capture group to match 2 words after FLEX:

\bFLEX\s+(\w+\s+\w+)

Regex demo

Or a broader match:

\bFLEX\s+(\S+\s+\S+)
  • \bFLEX A word boundary, match FLEX
  • \s+ Match 1+ whitespace chars
  • (\S+\s+\S+) Capture group 1 match 1+ non whitespace chars, 1+ whitespace chars and again 1+ non whitespace chars

See a regex demo.

import pandas as pd

strings = ['QB Aaron Rodgers RB Josh Jacobs FLEX Davante Adams']
df = pd.DataFrame(strings, columns=["Lineup"])
df['Lineup'] = df["Lineup"].str.extract(r'\bFLEX\s+(\S+\s+\S+)')
print(df)

Output

          Lineup
0 Davante Adams

If you want to match 2 or more words, you could use a repeating non capture group:

\bFLEX\s+(\w+(?:\s+\w+)+)

Extract a certain part of a string after a key phrase using pandas?

You can use the Series str.extract string method:

In [11]: df = pd.DataFrame([["(12:25) (No Huddle Shotgun) P.Manning pass short left to W.Welker pushed ob at DEN 34 for 10 yards (C.Graham)."]])

In [12]: df
Out[12]:
0
0 (12:25) (No Huddle Shotgun) P.Manning pass sho...

This will "extract" what's it the group (inside the parenthesis):

In [13]: df[0].str.extract("for (\d+)")
Out[13]:
0 10
Name: 0, dtype: object

In [14]: df[0].str.extract("for (\d+) yards")
Out[14]:
0 10
Name: 0, dtype: object

You'll need to convert to int, e.g. using astype(int).

Pandas DataFrame - Extract string between two strings and include the first delimiter

you can accomplish this all within the regex without having to use string slicing.

df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
  • FILE is the what we begin the match on
  • .* grabs any number of characters
  • (?=) is a lookahead assertion that matches without
    consuming.

Handy regex tool https://pythex.org/

Extracting number from string only when string is present in a dataframe

Use Series.str.extract with the regex pattern r'(?:^|\s)(\d+):

  • (?:^|\s) matches the beginning of the string ('^') or ('|') any whitespace character ('\s') without capturing it ((?:...))
  • (\d+) captures one or more digit (greedy)
df['Item Code'] = df['Item Code'].str.extract(r'(?:^|\s)(\d+)', expand=False)

Note that the values of 'Item Code' are still stings after the extraction. If you want to convert them to integers use Series.astype.

df['Item Code'] = df['Item Code']str.extract(r'(?:\s|^)(\d+)', expand=False).astype(int)

Output

>>> df

ID Price Item Code
0 1 3.60 80986
1 2 4.30 45772
2 3 0.60 9778
3 4 9.78 48989
4 5 3.44 545
5 6 3.44 509


Related Topics



Leave a reply



Submit