Pandas, Remove Everything After Last '_'

Python pandas: remove everything after a delimiter in a string

You can use pandas.Series.str.split just like you would use split normally. Just split on the string '::', and index the list that's created from the split method:

>>> df = pd.DataFrame({'text': ["vendor a::ProductA", "vendor b::ProductA", "vendor a::Productb"]})
>>> df
                 text
0  vendor a::ProductA
1  vendor b::ProductA
2  vendor a::Productb
>>> df['text_new'] = df['text'].str.split('::').str[0]
>>> df
                 text  text_new
0  vendor a::ProductA  vendor a
1  vendor b::ProductA  vendor b
2  vendor a::Productb  vendor a

Here's a non-pandas solution:

>>> df['text_new1'] = [x.split('::')[0] for x in df['text']]
>>> df
                 text  text_new text_new1
0  vendor a::ProductA  vendor a  vendor a
1  vendor b::ProductA  vendor b  vendor b
2  vendor a::Productb  vendor a  vendor a

Edit: Here's the step-by-step explanation of what's happening in pandas above:

# Select the pandas.Series object you want
>>> df['text']
0    vendor a::ProductA
1    vendor b::ProductA
2    vendor a::Productb
Name: text, dtype: object

# using pandas.Series.str allows us to implement "normal" string methods 
# (like split) on a Series
>>> df['text'].str
<pandas.core.strings.StringMethods object at 0x110af4e48>

# Now we can use the split method to split on our '::' string. You'll see that
# a Series of lists is returned (just like what you'd see outside of pandas)
>>> df['text'].str.split('::')
0    [vendor a, ProductA]
1    [vendor b, ProductA]
2    [vendor a, Productb]
Name: text, dtype: object

# using the pandas.Series.str method, again, we will be able to index through
# the lists returned in the previous step
>>> df['text'].str.split('::').str
<pandas.core.strings.StringMethods object at 0x110b254a8>

# now we can grab the first item in each list above for our desired output
>>> df['text'].str.split('::').str[0]
0    vendor a
1    vendor b
2    vendor a
Name: text, dtype: object

I would suggest checking out the pandas.Series.str docs, or, better yet, Working with Text Data in pandas.

Pandas, remove everything after last '_'

Use a combination of str.rsplit and str.get for your desired outcome. str.rsplit simply splits a string from the end, while str.get gets the nth element of an iterator within a pd.Series object.

Answer

d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)

the n argument in rsplit limits number of splits in output so that you only keep everything before the last '_'.

Even though a solution using pd.Series.apply is almost half as fast, I like this one because is more expressive in it's syntax. If you want to use the pd.Series.apply solution (faster) check the timing part!

pandas documentation.

Example

strs = ['Stackoverflow_1234',
        'Stack_Over_Flow_1234',
        'Stackoverflow',
        'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})

This results in

print(df)
            SOURCE_NAME
0    Stackoverflow_1234
1  Stack_Over_Flow_1234
2         Stackoverflow
3   Stack_Overflow_1234

Using the proposed solution:

df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)

0      Stackoverflow
1    Stack_Over_Flow
2      Stackoverflow
3     Stack_Overflow
Name: SOURCE_NAME, dtype: object

Time

Interestingly, using pd.Series.str is not necessarily faster than using pd.Series.apply:

import pandas as pd

df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# increasing the number of rows x 100
df = pd.concat([df] * 100)

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to remove everything after the last occurence of a character in a Dataframe?

You could do:

df_temp = df.apply(lambda x: x.str.split('.').str[:-1].str.join('.'))

output:

            EQ1              EQ2             EQ3
0         Apple  Oranage.eatable             NaN
1  Pear.eatable           Banana             NaN
2        Orange           Tomato  Potato.eatable
3          Kiwi             Pear         Cabbage

see the string method docs

How can I remove string after last underscore in python dataframe?

pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))

Explaination:

df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})

Creates

    col
0   AA_XX
1   AAA_BB_XX
2   AA_BB_XYX
3   AA_A_B_YXX

Use apply in order to loop through the column you want to edit.

I broke the string at _ and then joined all parts leaving the last part at _

df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)

Results:

    col
0   AA
1   AAA_BB
2   AA_BB
3   AA_A_B

If your dataset contains values like AA (values without underscore).

Change the lambda like this

df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)

Remove values before and after special character

To remove the values that come before the '_' and after the '_' , essentially, keeping the middle, you can use .str.extract() with regex, as follows:

df['col1'] = df['col1'].str.extract(r'\w*?_([^_]*)(?:_)?')

Result:

print(df)

  col1 col2
0  bu1   dd
1  lap    d
2  lap    d
3   bb   dd

Edit

To extract also the digits at the end, you can do:

s_df = df['col1'].str.split('_', expand=True) 
s_df[2] = s_df[2].str.extract(r'(\d+)$').fillna('') 
df['col1'] = s_df[1] + s_df[2]

Result:

print(df)

   col1 col2
0   bu1   dd
1  lap1    d
2  lap2    d
3   bb1   dd

Remove ends of string entries in pandas DataFrame column

I think you can use str.replace with regex .txt$' ( $ - matches the end of the string):

import pandas as pd

df = pd.DataFrame({'A': {0: 2, 1: 1}, 
                   'C': {0: 5, 1: 1}, 
                   'B': {0: 4, 1: 2}, 
                   'filename': {0: "txt.txt", 1: "x.txt"}}, 
                columns=['filename','A','B', 'C'])

print df
  filename  A  B  C
0  txt.txt  2  4  5
1    x.txt  1  2  1

df['filename'] = df['filename'].str.replace(r'.txt$', '')
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

df['filename'] = df['filename'].map(lambda x: str(x)[:-4])
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

df['filename'] = df['filename'].str[:-4]
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

EDIT:

rstrip can remove more characters, if the end of strings contains some characters of striped string (in this case ., t, x):

Example:

print df
  filename  A  B  C
0  txt.txt  2  4  5
1    x.txt  1  2  1

df['filename'] = df['filename'].str.rstrip('.txt')

print df
  filename  A  B  C
0           2  4  5
1           1  2  1

Remove Last instance of a character and rest of a string

result = my_string.rsplit('_', 1)[0]

Which behaves like this:

>>> my_string = 'foo_bar_one_two_three'
>>> print(my_string.rsplit('_', 1)[0])
foo_bar_one_two

See in the documentation entry for str.rsplit([sep[, maxsplit]]).

Removing everything after a char in a dataframe

Try:

countries['info'] = countries['info'].str.split('-').str[0]

Output:

     country        info
0       england      london
1      scotland   edinburgh
2         china     beijing
3  unitedstates  washington