Remove partial string from dataframe with Pandas
(Bad Answer)
Series.str.split
soup
df['str'] = df['str'].str.split('(').str[0].str.split('_').str[-1]
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
(Less Bad answer)
Series.str.extract
df['str'] = df['str'].str.extract(r'_([^_]+)\(', expand=False)
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
Regex methods come with their fair share of overhead, and str.extract
does not do much to make things better.
(Better Answer)
re.search
with list comp
import re
p = re.compile(r'(?<=_)[^_]+(?=\()')
df['str'] = [p.search(x)[0] for x in df['str'].tolist()]
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
This should be faster than the above methods. I find list comprehensions are really fast compared to most vectorised string pandas methods, even if this does use regex. I pre-compile the pattern in advance to alleviate some of the performance concerns.
(Also a better answer)
str.split
with list comp
df['str'] = [
x.split('(', 1)[0].split('_')[1] for x in df['str'].tolist()
]
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
This combines the best of both worlds, the performance of a list comp and the speed of pure python string splitting. Should be the fastest.
Performance
df_test = pd.concat([df] * 10000, ignore_index=True)
%timeit df_test['str'].str.extract(r'_([^_]+)\(', expand=False)
%timeit df_test['str'].str.split('(').str[0].str.split('_').str[-1]
%timeit [p.search(x)[0] for x in df_test['str'].tolist()]
%timeit [x.split('(', 1)[0].split('_')[1] for x in df_test['str'].tolist()]
70.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
99.6 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
31 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # fastest but not by much
Deleting part of a string pandas DataFrame
Try using:
pandas.DataFrame.applymap
Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
new_df = df.filter(['tweet']).applymap(lambda x: x[x.find('InSight'):])
dates_df = df.filter(['tweet']).applymap(lambda x: x[x.find('-') + 1:x.find('InSight')])
How to delete a part of a string from pandas DataFrame in Python
Use str.strip(<unnecessary string>)
to remove the unnecessary string:
df.date = df.date.str.strip('test_')
OUTPUT:
a date
0 2 2021-07-21 04:34:02
1 6 2022-17-21 04:54:22
2 8 2020-06-21 04:34:02
3 9 2023-12-01 11:54:52
Delete part of string in pandas column
IIUC, you can use slicing and concatenation like:
df.Name.str[:-5] + df.Name.str[-5:].replace({'_R':''}, regex=True)
[out]
0 ARR
1 AR2
2 A3412d
3 Asfsvv
4 A_RUUYR
Name: Name, dtype: object
Related Topics
What Do Numbers Starting With 0 Mean in Python
Rotate Tick Labels for Seaborn Barplot
How to Count the Amount of Sentences in a Paragraph in Python
Python Xlrd Unsupported Format, or Corrupt File.
How to Remove an Item from a List in Python If That Item Contains a Word
Python Xlsxwriter Set Border Around Multiple Cells
How to Split But Ignore Separators in Quoted Strings, in Python
How to Clear or Overwrite a Tkinter Canvas
How to Remove All Characters Before a Specific Character in Python
Django: Check Whether an Object Already Exists Before Adding
How to Start a Background Process in Python
Importing Local Module (Python Script) in Airflow Dag
How to Convert Strings With Billion or Million Abbreviation into Integers in a List
How to Find Duration Between Two Time Difference in Python Dataframe