How to Drop Rows from Pandas Data Frame That Contains a Particular String in a Particular Column

How to drop rows from pandas data frame that contains a particular string in a particular column?

pandas has vectorized string operations, so you can just filter out the rows that contain the string you don't want:

In [91]: df = pd.DataFrame(dict(A=[5,3,5,6], C=["foo","bar","fooXYZbar", "bat"]))

In [92]: df
Out[92]:
A C
0 5 foo
1 3 bar
2 5 fooXYZbar
3 6 bat

In [93]: df[~df.C.str.contains("XYZ")]
Out[93]:
A C
0 5 foo
1 3 bar
3 6 bat

How to delete ANY row containing specific string in pandas?

You can use isin with any.

df = df[~df.isin(['refused']).any(axis=1)]

Drop rows in dataframe if the column matches particular string

Essentially you are forgetting to pass the boolean series (True/False) into brackets [...] or better with .loc[...]. Instead, you are re-assigning the values within those chunk columns to the result of your conditions but not applying conditions logically to the data frame.

Therefore, consider calling .loc[] with intersection of both those conditions:

# ASSIGN BOOLEAN SERIES
fname_jr = ~chunk.loc[0].str.contains("jr", na=False)
lname_jr = ~chunk.loc[1].str.contains("jr", na=False)

# PASS INTO .loc
chunk_sub = chunk.loc[fname_jr & lname_jr]
chunk_sub

# 0 1 ... 9 10
# 0 jane doe ... kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI= cigna_TOKEN_ENCRYPTION_KEY
# 2 jane sr ... kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI= cigna_TOKEN_ENCRYPTION_KEY

And to integrate multiple selections, call str.join to combine a list of items with pipe-delimiters:

# ASSIGN BOOLEAN SERIES
fname_jr_sr = ~chunk[0].str.contains("|".join(["sr", "jr"]), na=False)
lname_jr_sr = ~chunk[1].str.contains("|".join(["sr", "jr"]), na=False)

# PASS INTO .loc
chunk_sub = chunk.loc[fname_jr_sr & lname_jr_sr]
chunk_sub
# 0 1 ... 9 10
# 0 jane doe ... kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI= cigna_TOKEN_ENCRYPTION_KEY

Relatedly, your np.where call is not necessary as .loc will run on boolean series. Be sure to also escape | with backslashes \\ since the pipe symbol is a string matching operator. Altogether:

chunk = chunk.loc[(chunk[0].astype('str').str.len()>1) & 
(chunk[1].astype('str').str.len()>1) &
(chunk[4].astype('str').str.len()>4) &
(chunk[4].astype('str').str.len()<8) &
~chunk[0].str.contains("|".join(["sr", "jr", "\\|", "\\|\\|"]), na=False) &
~chunk[1].str.contains("|".join(["sr", "jr", "\\|", "\\|\\|"]), na=False)]

chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)

Python Pandas Dataframe dropping rows based on a column containing a character

IIUC

df[~df.dates.atype(str).str.contains('/')]

For example

df = pd.DataFrame()
df['dates'] = ['2011-01-20', '2011-01-20', '2011/01/20', '2011-01-20']

dates
0 2011-01-20
1 2011-01-20
2 2011/01/20
3 2011-01-20

Then

df[~df.dates.str.contains('/')]

dates
0 2011-01-20
1 2011-01-20
3 2011-01-20

You can also use map (as you tried), but using bool values rather than int, such that you perform boolean masking

df[df['dates'].map(lambda x: False if '/' in x else True )]

dates
0 2011-01-20
1 2011-01-20
3 2011-01-20

However notice that False if '/' in x else True is redundant. This is the same as just not '/' in x

df[df['dates'].map(lambda x: not '/' in x )]

dates
0 2011-01-20
1 2011-01-20
3 2011-01-20

Dropping rows with contain of a list of certain strings in Pandas

The Series.str.contains method accepts a regex.

>>> df
col1
0 24/05/2020
1 May Year 2020
2 Monday
3 May 2020
>>> drop_values = ['Monday','Year', '/']
>>> df[~df['col1'].str.contains('|'.join(drop_values))]
col1
3 May 2020

Deleting/dropping rows in pandas DataFrame with particular string in ANY column

You can select only object columns, obviously strings by select_dtypes:

df = energy.select_dtypes(object)
#added regex=False for improve performance like mentioned @jpp, thank you
mask = ~df.apply(lambda series: series.str.contains('Economy 7', regex=False)).any(axis=1)
no_eco = energy[mask]

Sample:

energy = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('adabbb')
})

print (energy)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 d
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b

df = energy.select_dtypes(object)
mask = ~df.apply(lambda series: series.str.contains('d')).any(axis=1)
no_eco = energy[mask]
print (no_eco)

A B C D E F
0 a 4 7 1 5 a
2 c 4 9 5 6 a
4 e 5 2 1 2 b
5 f 4 3 0 4 b

Pandas Drop Rows when a String is Matched to a Longer String in a Column in an Exact Match

You can create a set from drop_list and use set.isdisjoint on the split words in each row to evaluate if the exact match appears.

drop_set = set(drop_list)
msk = df['keyword'].apply(lambda x: drop_set.isdisjoint(x.split()))
df = df[msk]

Output:

        keyword
0 adidas socks
2 adidas shoes


Related Topics



Leave a reply



Submit