Pandas, Remove Everything After Last '_'

Python pandas: remove everything after a delimiter in a string

You can use pandas.Series.str.split just like you would use split normally. Just split on the string '::', and index the list that's created from the split method:

>>> df = pd.DataFrame({'text': ["vendor a::ProductA", "vendor b::ProductA", "vendor a::Productb"]})
>>> df
text
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
>>> df['text_new'] = df['text'].str.split('::').str[0]
>>> df
text text_new
0 vendor a::ProductA vendor a
1 vendor b::ProductA vendor b
2 vendor a::Productb vendor a

Here's a non-pandas solution:

>>> df['text_new1'] = [x.split('::')[0] for x in df['text']]
>>> df
text text_new text_new1
0 vendor a::ProductA vendor a vendor a
1 vendor b::ProductA vendor b vendor b
2 vendor a::Productb vendor a vendor a

Edit: Here's the step-by-step explanation of what's happening in pandas above:

# Select the pandas.Series object you want
>>> df['text']
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
Name: text, dtype: object

# using pandas.Series.str allows us to implement "normal" string methods
# (like split) on a Series
>>> df['text'].str
<pandas.core.strings.StringMethods object at 0x110af4e48>

# Now we can use the split method to split on our '::' string. You'll see that
# a Series of lists is returned (just like what you'd see outside of pandas)
>>> df['text'].str.split('::')
0 [vendor a, ProductA]
1 [vendor b, ProductA]
2 [vendor a, Productb]
Name: text, dtype: object

# using the pandas.Series.str method, again, we will be able to index through
# the lists returned in the previous step
>>> df['text'].str.split('::').str
<pandas.core.strings.StringMethods object at 0x110b254a8>

# now we can grab the first item in each list above for our desired output
>>> df['text'].str.split('::').str[0]
0 vendor a
1 vendor b
2 vendor a
Name: text, dtype: object

I would suggest checking out the pandas.Series.str docs, or, better yet, Working with Text Data in pandas.

Pandas, remove everything after last '_'

Use a combination of str.rsplit and str.get for your desired outcome. str.rsplit simply splits a string from the end, while str.get gets the nth element of an iterator within a pd.Series object.


Answer

d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)

the n argument in rsplit limits number of splits in output so that you only keep everything before the last '_'.

Even though a solution using pd.Series.apply is almost half as fast, I like this one because is more expressive in it's syntax. If you want to use the pd.Series.apply solution (faster) check the timing part!

pandas documentation.


Example

strs = ['Stackoverflow_1234',
'Stack_Over_Flow_1234',
'Stackoverflow',
'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})

This results in

print(df)
SOURCE_NAME
0 Stackoverflow_1234
1 Stack_Over_Flow_1234
2 Stackoverflow
3 Stack_Overflow_1234

Using the proposed solution:

df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)

0 Stackoverflow
1 Stack_Over_Flow
2 Stackoverflow
3 Stack_Overflow
Name: SOURCE_NAME, dtype: object

Time

Interestingly, using pd.Series.str is not necessarily faster than using pd.Series.apply:

import pandas as pd

df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# increasing the number of rows x 100
df = pd.concat([df] * 100)

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to remove everything after the last occurence of a character in a Dataframe?

You could do:

df_temp = df.apply(lambda x: x.str.split('.').str[:-1].str.join('.'))

output:

            EQ1              EQ2             EQ3
0 Apple Oranage.eatable NaN
1 Pear.eatable Banana NaN
2 Orange Tomato Potato.eatable
3 Kiwi Pear Cabbage

see the string method docs

How can I remove string after last underscore in python dataframe?

pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))

Explaination:

df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})

Creates

    col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX

Use apply in order to loop through the column you want to edit.

I broke the string at _ and then joined all parts leaving the last part at _

df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)

Results:

    col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B

If your dataset contains values like AA (values without underscore).

Change the lambda like this

df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)

Remove values before and after special character

To remove the values that come before the '_' and after the '_' , essentially, keeping the middle, you can use .str.extract() with regex, as follows:

df['col1'] = df['col1'].str.extract(r'\w*?_([^_]*)(?:_)?')

Result:

print(df)

col1 col2
0 bu1 dd
1 lap d
2 lap d
3 bb dd

Edit

To extract also the digits at the end, you can do:

s_df = df['col1'].str.split('_', expand=True) 
s_df[2] = s_df[2].str.extract(r'(\d+)$').fillna('')
df['col1'] = s_df[1] + s_df[2]

Result:

print(df)

col1 col2
0 bu1 dd
1 lap1 d
2 lap2 d
3 bb1 dd

Remove ends of string entries in pandas DataFrame column

I think you can use str.replace with regex .txt$' ( $ - matches the end of the string):

import pandas as pd

df = pd.DataFrame({'A': {0: 2, 1: 1},
'C': {0: 5, 1: 1},
'B': {0: 4, 1: 2},
'filename': {0: "txt.txt", 1: "x.txt"}},
columns=['filename','A','B', 'C'])

print df
filename A B C
0 txt.txt 2 4 5
1 x.txt 1 2 1

df['filename'] = df['filename'].str.replace(r'.txt$', '')
print df
filename A B C
0 txt 2 4 5
1 x 1 2 1

df['filename'] = df['filename'].map(lambda x: str(x)[:-4])
print df
filename A B C
0 txt 2 4 5
1 x 1 2 1

df['filename'] = df['filename'].str[:-4]
print df
filename A B C
0 txt 2 4 5
1 x 1 2 1

EDIT:

rstrip can remove more characters, if the end of strings contains some characters of striped string (in this case ., t, x):

Example:

print df
filename A B C
0 txt.txt 2 4 5
1 x.txt 1 2 1

df['filename'] = df['filename'].str.rstrip('.txt')

print df
filename A B C
0 2 4 5
1 1 2 1

Remove Last instance of a character and rest of a string

result = my_string.rsplit('_', 1)[0]

Which behaves like this:

>>> my_string = 'foo_bar_one_two_three'
>>> print(my_string.rsplit('_', 1)[0])
foo_bar_one_two

See in the documentation entry for str.rsplit([sep[, maxsplit]]).

Removing everything after a char in a dataframe

Try:

countries['info'] = countries['info'].str.split('-').str[0]

Output:

     country        info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington


Related Topics



Leave a reply



Submit