Python pandas: remove everything after a delimiter in a string
You can use pandas.Series.str.split
just like you would use split
normally. Just split on the string '::'
, and index the list that's created from the split
method:
>>> df = pd.DataFrame({'text': ["vendor a::ProductA", "vendor b::ProductA", "vendor a::Productb"]})
>>> df
text
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
>>> df['text_new'] = df['text'].str.split('::').str[0]
>>> df
text text_new
0 vendor a::ProductA vendor a
1 vendor b::ProductA vendor b
2 vendor a::Productb vendor a
Here's a non-pandas solution:
>>> df['text_new1'] = [x.split('::')[0] for x in df['text']]
>>> df
text text_new text_new1
0 vendor a::ProductA vendor a vendor a
1 vendor b::ProductA vendor b vendor b
2 vendor a::Productb vendor a vendor a
Edit: Here's the step-by-step explanation of what's happening in pandas
above:
# Select the pandas.Series object you want
>>> df['text']
0 vendor a::ProductA
1 vendor b::ProductA
2 vendor a::Productb
Name: text, dtype: object
# using pandas.Series.str allows us to implement "normal" string methods
# (like split) on a Series
>>> df['text'].str
<pandas.core.strings.StringMethods object at 0x110af4e48>
# Now we can use the split method to split on our '::' string. You'll see that
# a Series of lists is returned (just like what you'd see outside of pandas)
>>> df['text'].str.split('::')
0 [vendor a, ProductA]
1 [vendor b, ProductA]
2 [vendor a, Productb]
Name: text, dtype: object
# using the pandas.Series.str method, again, we will be able to index through
# the lists returned in the previous step
>>> df['text'].str.split('::').str
<pandas.core.strings.StringMethods object at 0x110b254a8>
# now we can grab the first item in each list above for our desired output
>>> df['text'].str.split('::').str[0]
0 vendor a
1 vendor b
2 vendor a
Name: text, dtype: object
I would suggest checking out the pandas.Series.str docs, or, better yet, Working with Text Data in pandas.
Pandas, remove everything after last '_'
Use a combination of str.rsplit
and str.get
for your desired outcome. str.rsplit
simply splits a string from the end, while str.get
gets the nth element of an iterator within a pd.Series object.
Answer
d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
the n
argument in rsplit
limits number of splits in output so that you only keep everything before the last '_'.
Even though a solution using pd.Series.apply
is almost half as fast, I like this one because is more expressive in it's syntax. If you want to use the pd.Series.apply
solution (faster) check the timing part!
pandas documentation.
Example
strs = ['Stackoverflow_1234',
'Stack_Over_Flow_1234',
'Stackoverflow',
'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})
This results in
print(df)
SOURCE_NAME
0 Stackoverflow_1234
1 Stack_Over_Flow_1234
2 Stackoverflow
3 Stack_Overflow_1234
Using the proposed solution:
df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)
0 Stackoverflow
1 Stack_Over_Flow
2 Stackoverflow
3 Stack_Overflow
Name: SOURCE_NAME, dtype: object
Time
Interestingly, using pd.Series.str
is not necessarily faster than using pd.Series.apply
:
import pandas as pd
df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})
%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# increasing the number of rows x 100
df = pd.concat([df] * 100)
%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
How to remove everything after the last occurence of a character in a Dataframe?
You could do:
df_temp = df.apply(lambda x: x.str.split('.').str[:-1].str.join('.'))
output:
EQ1 EQ2 EQ3
0 Apple Oranage.eatable NaN
1 Pear.eatable Banana NaN
2 Orange Tomato Potato.eatable
3 Kiwi Pear Cabbage
see the string method docs
How can I remove string after last underscore in python dataframe?
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _
and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA
(values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Remove values before and after special character
To remove the values that come before the '_'
and after the '_'
, essentially, keeping the middle, you can use .str.extract()
with regex, as follows:
df['col1'] = df['col1'].str.extract(r'\w*?_([^_]*)(?:_)?')
Result:
print(df)
col1 col2
0 bu1 dd
1 lap d
2 lap d
3 bb dd
Edit
To extract also the digits at the end, you can do:
s_df = df['col1'].str.split('_', expand=True)
s_df[2] = s_df[2].str.extract(r'(\d+)$').fillna('')
df['col1'] = s_df[1] + s_df[2]
Result:
print(df)
col1 col2
0 bu1 dd
1 lap1 d
2 lap2 d
3 bb1 dd
Remove ends of string entries in pandas DataFrame column
I think you can use str.replace
with regex .txt$'
( $
- matches the end of the string):
import pandas as pd
df = pd.DataFrame({'A': {0: 2, 1: 1},
'C': {0: 5, 1: 1},
'B': {0: 4, 1: 2},
'filename': {0: "txt.txt", 1: "x.txt"}},
columns=['filename','A','B', 'C'])
print df
filename A B C
0 txt.txt 2 4 5
1 x.txt 1 2 1
df['filename'] = df['filename'].str.replace(r'.txt$', '')
print df
filename A B C
0 txt 2 4 5
1 x 1 2 1
df['filename'] = df['filename'].map(lambda x: str(x)[:-4])
print df
filename A B C
0 txt 2 4 5
1 x 1 2 1
df['filename'] = df['filename'].str[:-4]
print df
filename A B C
0 txt 2 4 5
1 x 1 2 1
EDIT:
rstrip
can remove more characters, if the end of strings contains some characters of striped string (in this case .
, t
, x
):
Example:
print df
filename A B C
0 txt.txt 2 4 5
1 x.txt 1 2 1
df['filename'] = df['filename'].str.rstrip('.txt')
print df
filename A B C
0 2 4 5
1 1 2 1
Remove Last instance of a character and rest of a string
result = my_string.rsplit('_', 1)[0]
Which behaves like this:
>>> my_string = 'foo_bar_one_two_three'
>>> print(my_string.rsplit('_', 1)[0])
foo_bar_one_two
See in the documentation entry for str.rsplit([sep[, maxsplit]])
.
Removing everything after a char in a dataframe
Try:
countries['info'] = countries['info'].str.split('-').str[0]
Output:
country info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington
Related Topics
Windowserror: [Error 126] the Specified Module Could Not Be Found
Pandas Extract Numbers from Column into New Columns
Install Utils Package in Python Facing With Error Package Not Found
Beautifulsoup: Get the Contents of a Specific Table
Read Merged Cells in Excel With Python
Append Dataframes Together in for Loop
How to Display a Float With Two Decimal Places
Populating Pandas Columns Based on Values in Other Columns
Comparing Two Xml Files in Python
Use Tqdm Progress Bar With Pandas
How to Add Pandas Data to an Existing CSV File
Why Does Tkinter Image Not Show Up If Created in a Function
Filtering Dataframe Using the Length of a Column
Unable Log in to the Django Admin Page With a Valid Username and Password