How to Remove Words in a Column in Pandas

Remove certain string from entire column in pandas dataframe

You can use string slicing and then convert to a numeric type via pd.to_numeric:

df['Grade'] = pd.to_numeric(df['Grade'].astype(str).str[:-1], errors='coerce')

Conversion to float is recommended as a series of strings will be held in a generic and inefficient object dtype, while numeric types permit vectorised operations.

remove words starting with "@" in a column from a dataframe

Please str.replace string starting with @

Sample Data

                                       text
0 News via @livemint: @RBI bars banks from links
1 Newsfeed from @oayments_source: How Africa
2 is that bitcoin? not my thing


tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\@\w+.*?)',"")

Still, can capture @ without escaping as noted by @baxx

tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(@\w+.*?)',"")

clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing

Best way to remove specific words from column in pandas dataframe?

It is an enhancement to @tdy Regex solution. The original regex Family|Drama will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes, the words will be removed as well.

Supposed that the labels are separated by " | ", there are three match conditions we want to remove.

  1. Gerne at start of string. e.g. Drama | ...
  2. Gerne in the middle. e.g. ... | Drama | ...
  3. Gerne at end of string. e.g. ... | Drama

Use regex (^|\| )(?:Family|Drama)(?=( \||$)) to match one of three conditions. Note that | Drama | Family has 2 overlapped matches, here I use ?=( \||$) to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.

>>> genres = ["Family", "Drama"]

>>> df

# Book Labels
# 0 Drama | Drama 123 | Family
# 1 Drama 123 | Drama | Family
# 2 Drama | Family | Drama 123
# 3 123 Drama 123 | Family | Drama
# 4 Drama | Family | 123 Drama

>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))

>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)

# 0 | Drama 123
# 1 Drama 123
# 2 | Drama 123
# 3 123 Drama 123
# 4 | 123 Drama

>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")

# 0 Drama 123
# 1 Drama 123
# 2 Drama 123
# 3 123 Drama 123
# 4 123 Drama

How to remove text between two specific words in a dataframe column by python

You need to use Series.str.replace directly:

df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)

Here, (?s) stands for re.DOTALL / re.S in order not to have to import re, it is their inline modifier version. The .*? matches any zero or more chars, as few as possible.

If Original and Subject need to be passed as variables containing literal text, do not forget about re.escape:

import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)

Remove words in each row in a column of dataframe from another list of words in a column of another dataframe

If you want to remove just the word in the corresponding line of df2, you could do that as follows, but it will probably be slow for large data sets, because it only can partially can use fast C implementations:

# define your helper function to remove the string
def remove_string(ser_row):
return ser_row['cust_text'].replace(ser_row['remove'], '')

# create a temporary column with the string to remove in the first dataframe
df1['remove']= df2['column1']
df1= df1.apply(remove_string, axis='columns')
# drop the temporary column afterwards
df1.drop(columns=['remove'], inplace=True)

The result looks like:

Out[145]: 
0 hi fine i to go
1 i need lines hold
2 i have the 60 packs
3 can you teach
dtype: object

If however, you want to remove all words in your df2 column from every column, you need to do it differntly. Unfortunately str.replace does not help here with regular strings, unless you want to call it for every line in your second dataframe.
So if your second dataframe is not too large, you can create a regular expression to make use of str.replace.

import re
replace=re.compile(r'\b(' + ('|'.join(df2['column1'])) + r')\b')
df1['cust_text'].str.replace(replace, '')

The output is:

Out[184]: 
0 hi fine i to
1 i lines hold
2 i the 60 packs
3 can you teach
Name: cust_text, dtype: object

If you don't like the repeated spaces, that remain, you can just perform something like:

df1['cust_text'].str.replace(replace, '').str.replace(re.compile('\s{2,}'), ' ')

Addition: what, if not only the text without the words is relevant, but the words themselves as well. How can we get the words, which were replaced. Here is one attempt, which would work, if one character can be identified, which will not appear in the text. Let's assume this character is a @, then you could do (on the original column value without replacement):

# enclose each keywords in @
ser_matched= df1['cust_text'].replace({replace: r'@\1@'}, regex=True)
# now remove the rest of the line, which is unmatched
# this is the part of the string after the last occurance
# of a @
ser_matched= ser_matched.replace({r'^(.*)@.*$': r'\1', '^@': ''}, regex=True)
# and if you like your keywords to be in a list, rather than a string
# you can split the string at last
ser_matched.str.split(r'@+')

Removing words from strings within a column dataframe

You use df.str.split with df.str.slice.

df['test'].str.split(n=4).str[-1]

How to remove words in pandas data frame column which match with words in another column

Use set difference of splitted values per rows with apply:

f=lambda x: ', '.join(set(x['Country'].split(', ')).difference(set(x['Exclude'].split(', '))))
df['Out'] = df.apply(f, axis=1)

Or list comprehension with zip:

df['Out'] = ([', '.join(set(a.split(', ')).difference(set(b.split(', ')))) 
for a, b in zip(df['Country'], df['Exclude'])])

print (df)
Country Exclude \
0 Germany, France, Brazil, India, Russia France, Brazil
1 Russia, France, Jamaica, India, China India, Russia
2 Germany, Russia, Jamaica Jamaica
3 Italy, Jamaica Italy

Out
0 Germany, India, Russia
1 China, France, Jamaica
2 Germany, Russia
3 Jamaica

If order is important:

df['Out'] = [', '.join(x for x in a.split(', ') if x not in set(b.split(', '))) 
for a, b in zip(df['Country'], df['Exclude'])]
print (df)
Country Exclude \
0 Germany, France, Brazil, India, Russia France, Brazil
1 Russia, France, Jamaica, India, China India, Russia
2 Germany, Russia, Jamaica Jamaica
3 Italy, Jamaica Italy

Out
0 Germany, India, Russia
1 France, Jamaica, China
2 Germany, Russia
3 Jamaica

How to replace text in a string column of a Pandas dataframe?

Use the vectorised str method replace:

df['range'] = df['range'].str.replace(',','-')

df
range
0 (2-30)
1 (50-290)

EDIT: so if we look at what you tried and why it didn't work:

df['range'].replace(',','-',inplace=True)

from the docs we see this description:

str or regex: str: string exactly matching to_replace will be replaced
with value

So because the str values do not match, no replacement occurs, compare with the following:

df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)

df['range']

0 (2,30)
1 -
Name: range, dtype: object

here we get an exact match on the second row and the replacement occurs.



Related Topics



Leave a reply



Submit