Remove Partial String from Dataframe With Pandas

Remove partial string from dataframe with Pandas

(Bad Answer)

`Series.str.split` soup

df['str'] = df['str'].str.split('(').str[0].str.split('_').str[-1]    
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

(Less Bad answer)

`Series.str.extract`

df['str'] = df['str'].str.extract(r'_([^_]+)\(', expand=False)
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

Regex methods come with their fair share of overhead, and str.extract does not do much to make things better.

(Better Answer)

`re.search` with list comp

import re

p = re.compile(r'(?<=_)[^_]+(?=\()')
df['str'] = [p.search(x)[0] for x in df['str'].tolist()] 
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This should be faster than the above methods. I find list comprehensions are really fast compared to most vectorised string pandas methods, even if this does use regex. I pre-compile the pattern in advance to alleviate some of the performance concerns.

(Also a better answer)

`str.split` with list comp

df['str'] = [
    x.split('(', 1)[0].split('_')[1] for x in df['str'].tolist()
]
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This combines the best of both worlds, the performance of a list comp and the speed of pure python string splitting. Should be the fastest.

Performance

df_test = pd.concat([df] * 10000, ignore_index=True)

%timeit df_test['str'].str.extract(r'_([^_]+)\(', expand=False)
%timeit df_test['str'].str.split('(').str[0].str.split('_').str[-1] 
%timeit [p.search(x)[0] for x in df_test['str'].tolist()] 
%timeit [x.split('(', 1)[0].split('_')[1] for x in df_test['str'].tolist()]

70.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
99.6 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
31 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # fastest but not by much

Deleting part of a string pandas DataFrame

Try using:

pandas.DataFrame.applymap
Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

new_df = df.filter(['tweet']).applymap(lambda x: x[x.find('InSight'):])
dates_df = df.filter(['tweet']).applymap(lambda x: x[x.find('-') + 1:x.find('InSight')])

How to delete a part of a string from pandas DataFrame in Python

Use str.strip(<unnecessary string>) to remove the unnecessary string:

df.date = df.date.str.strip('test_')

OUTPUT:

   a                 date
0  2  2021-07-21 04:34:02
1  6  2022-17-21 04:54:22
2  8  2020-06-21 04:34:02
3  9  2023-12-01 11:54:52

Delete part of string in pandas column

IIUC, you can use slicing and concatenation like:

df.Name.str[:-5] + df.Name.str[-5:].replace({'_R':''}, regex=True)

[out]

0        ARR
1        AR2
2     A3412d
3     Asfsvv
4    A_RUUYR
Name: Name, dtype: object

Remove Partial String from Dataframe With Pandas