Remove certain string from entire column in pandas dataframe
You can use string slicing and then convert to a numeric type via pd.to_numeric
:
df['Grade'] = pd.to_numeric(df['Grade'].astype(str).str[:-1], errors='coerce')
Conversion to float
is recommended as a series of strings will be held in a generic and inefficient object
dtype, while numeric types permit vectorised operations.
remove words starting with "@" in a column from a dataframe
Please str.replace
string starting with @
Sample Data
text
0 News via @livemint: @RBI bars banks from links
1 Newsfeed from @oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\@\w+.*?)',"")
Still, can capture @
without escaping as noted by @baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(@\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
Best way to remove specific words from column in pandas dataframe?
It is an enhancement to @tdy Regex solution. The original regex Family|Drama
will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes
, the words will be removed as well.
Supposed that the labels are separated by " | ", there are three match conditions we want to remove.
- Gerne at start of string. e.g.
Drama | ...
- Gerne in the middle. e.g.
... | Drama | ...
- Gerne at end of string. e.g.
... | Drama
Use regex (^|\| )(?:Family|Drama)(?=( \||$))
to match one of three conditions. Note that | Drama | Family
has 2 overlapped matches, here I use ?=( \||$)
to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.
>>> genres = ["Family", "Drama"]
>>> df
# Book Labels
# 0 Drama | Drama 123 | Family
# 1 Drama 123 | Drama | Family
# 2 Drama | Family | Drama 123
# 3 123 Drama 123 | Family | Drama
# 4 Drama | Family | 123 Drama
>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))
>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)
# 0 | Drama 123
# 1 Drama 123
# 2 | Drama 123
# 3 123 Drama 123
# 4 | 123 Drama
>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")
# 0 Drama 123
# 1 Drama 123
# 2 Drama 123
# 3 123 Drama 123
# 4 123 Drama
How to remove text between two specific words in a dataframe column by python
You need to use Series.str.replace
directly:
df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)
Here, (?s)
stands for re.DOTALL
/ re.S
in order not to have to import re
, it is their inline modifier version. The .*?
matches any zero or more chars, as few as possible.
If Original
and Subject
need to be passed as variables containing literal text, do not forget about re.escape
:
import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)
Remove words in each row in a column of dataframe from another list of words in a column of another dataframe
If you want to remove just the word in the corresponding line of df2, you could do that as follows, but it will probably be slow for large data sets, because it only can partially can use fast C implementations:
# define your helper function to remove the string
def remove_string(ser_row):
return ser_row['cust_text'].replace(ser_row['remove'], '')
# create a temporary column with the string to remove in the first dataframe
df1['remove']= df2['column1']
df1= df1.apply(remove_string, axis='columns')
# drop the temporary column afterwards
df1.drop(columns=['remove'], inplace=True)
The result looks like:
Out[145]:
0 hi fine i to go
1 i need lines hold
2 i have the 60 packs
3 can you teach
dtype: object
If however, you want to remove all words in your df2 column from every column, you need to do it differntly. Unfortunately str.replace
does not help here with regular strings, unless you want to call it for every line in your second dataframe.
So if your second dataframe is not too large, you can create a regular expression to make use of str.replace
.
import re
replace=re.compile(r'\b(' + ('|'.join(df2['column1'])) + r')\b')
df1['cust_text'].str.replace(replace, '')
The output is:
Out[184]:
0 hi fine i to
1 i lines hold
2 i the 60 packs
3 can you teach
Name: cust_text, dtype: object
If you don't like the repeated spaces, that remain, you can just perform something like:
df1['cust_text'].str.replace(replace, '').str.replace(re.compile('\s{2,}'), ' ')
Addition: what, if not only the text without the words is relevant, but the words themselves as well. How can we get the words, which were replaced. Here is one attempt, which would work, if one character can be identified, which will not appear in the text. Let's assume this character is a @
, then you could do (on the original column value without replacement):
# enclose each keywords in @
ser_matched= df1['cust_text'].replace({replace: r'@\1@'}, regex=True)
# now remove the rest of the line, which is unmatched
# this is the part of the string after the last occurance
# of a @
ser_matched= ser_matched.replace({r'^(.*)@.*$': r'\1', '^@': ''}, regex=True)
# and if you like your keywords to be in a list, rather than a string
# you can split the string at last
ser_matched.str.split(r'@+')
Removing words from strings within a column dataframe
You use df.str.split
with df.str.slice
.
df['test'].str.split(n=4).str[-1]
How to remove words in pandas data frame column which match with words in another column
Use set difference
of splitted values per rows with apply
:
f=lambda x: ', '.join(set(x['Country'].split(', ')).difference(set(x['Exclude'].split(', '))))
df['Out'] = df.apply(f, axis=1)
Or list comprehension with zip
:
df['Out'] = ([', '.join(set(a.split(', ')).difference(set(b.split(', '))))
for a, b in zip(df['Country'], df['Exclude'])])
print (df)
Country Exclude \
0 Germany, France, Brazil, India, Russia France, Brazil
1 Russia, France, Jamaica, India, China India, Russia
2 Germany, Russia, Jamaica Jamaica
3 Italy, Jamaica Italy
Out
0 Germany, India, Russia
1 China, France, Jamaica
2 Germany, Russia
3 Jamaica
If order is important:
df['Out'] = [', '.join(x for x in a.split(', ') if x not in set(b.split(', ')))
for a, b in zip(df['Country'], df['Exclude'])]
print (df)
Country Exclude \
0 Germany, France, Brazil, India, Russia France, Brazil
1 Russia, France, Jamaica, India, China India, Russia
2 Germany, Russia, Jamaica Jamaica
3 Italy, Jamaica Italy
Out
0 Germany, India, Russia
1 France, Jamaica, China
2 Germany, Russia
3 Jamaica
How to replace text in a string column of a Pandas dataframe?
Use the vectorised str
method replace
:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
Related Topics
How to Divide Each Column of Pandas Dataframe by a Series
Python: [Errno 10054] an Existing Connection Was Forcibly Closed by the Remote Host
Importing Modules from Parent Folder
How to Convert Number 1 to a Boolean in Python
I Want to Reshape 2D Array into 3D Array
Replacing All Negative Values in Certain Columns by Another Value in Pandas
Make a Batch File Run a Python Code With Arguments
Python Handling Socket.Error: [Errno 104] Connection Reset by Peer
Matplotlib: Drawing Lines Between Points Ignoring Missing Data
Print Floating Point Values Without Leading Zero
Key Error When Selecting Columns in Pandas Dataframe After Read_Csv
What Do Numbers Starting With 0 Mean in Python
How to Get All Users in a Telegram Channel Using Telethon
Keras Valueerror: Input 0 Is Incompatible With Layer Conv2D_1: Expected Ndim=4, Found Ndim=5
How to Convert a 1 Channel Image into a 3 Channel With Opencv2
How to Compare 2 Indexes in Same List in Python
Vary the Color of Each Bar in Bargraph Using Particular Value