How to Remove a Pandas Dataframe from Another Dataframe

Remove one dataframe from another with Pandas

Use merge with outer join with filter by query, last remove helper column by drop:

df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k

How to remove rows of a DataFrame based off of data from another DataFrame?

isin with &

df.loc[~((df.Product_Num.isin(df2['Product_Num']))&(df.Price.isin(df2['Price']))),:]
Out[246]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99

Update

df.loc[~df.index.isin(df.merge(df2.assign(a='key'),how='left').dropna().index)]
Out[260]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99

Pandas delete rows in a dataframe that are not in another dataframe

Please try this:

df = pd.merge(df1, df2, how='left', indicator='Exist')
df['Exist'] = np.where(df.Exist == 'both', True, False)
df = df[df['Exist']==True].drop(['Exist','z'], axis=1)

In Pandas, how to delete rows from a Data Frame based on another Data Frame?

You can use boolean indexing and condition with isin, inverting boolean Series is by ~:

import pandas as pd

USERS = pd.DataFrame({'email':['a@g.com','b@g.com','b@g.com','c@g.com','d@g.com']})
print (USERS)
email
0 a@g.com
1 b@g.com
2 b@g.com
3 c@g.com
4 d@g.com

EXCLUDE = pd.DataFrame({'email':['a@g.com','d@g.com']})
print (EXCLUDE)
email
0 a@g.com
1 d@g.com
print (USERS.email.isin(EXCLUDE.email))
0 True
1 False
2 False
3 False
4 True
Name: email, dtype: bool

print (~USERS.email.isin(EXCLUDE.email))
0 False
1 True
2 True
3 True
4 False
Name: email, dtype: bool

print (USERS[~USERS.email.isin(EXCLUDE.email)])
email
1 b@g.com
2 b@g.com
3 c@g.com

Another solution with merge:

df = pd.merge(USERS, EXCLUDE, how='outer', indicator=True)
print (df)
email _merge
0 a@g.com both
1 b@g.com left_only
2 b@g.com left_only
3 c@g.com left_only
4 d@g.com both

print (df.loc[df._merge == 'left_only', ['email']])
email
1 b@g.com
2 b@g.com
3 c@g.com

Remove rows that are in another dataframe

Try merge

out = df1.merge(df2,how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
Out[128]:
A B C D E F G _merge
0 1 2 3 4 5 6 7 left_only
1 8 9 0 1 2 3 4 left_only

DataFrame remove rows existing in another DataFrame

Using pyspark:

You can create a list containing the customerId from DF2 with collect():

from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]

And then filter your DF1 customerId using isin with negation ~:

diff = df1.where(~col('customerId').isin(id_df2))

Remove duplicate rows dataframe from another dataframe

Drop columns in df1 which are also found in df2

df1.drop(columns=df2.columns, errors='ignore', inplace=True)

or

df1 =  df1.drop(columns=df2.columns, errors='ignore')

Drop rows in df1 and also in df2 in a specific column say date

Following your edit, if it is a single column like date, please try

df1[~df1['date'].isin(df2['date'])]

If it is a check on multiple columns, it can also be done. However, we will need more info. What happens if column1 in both df has same values in df1 and df2 and in the same row a column2 in both df has different values.?

How to remove rows from Pandas dataframe if the same row exists in another dataframe but end up with all columns from both df

You can use a left join to get only the id's in the first data frame and not the second data frame while also keeping all the second data frames columns.

import pandas as pd

df1 = pd.DataFrame(
data={"id": [1, 2, 3, 4], "col1": [9, 8, 7, 6], "col2": [5, 4, 3, 2]},
columns=["id", "col1", "col2"],
)
df2 = pd.DataFrame(
data={"id": [3, 4, 7], "col3": [11, 12, 13], "col4": [15, 16, 17]},
columns=["id", "col3", "col4"],
)

df_1_2 = df1.merge(df2, on="id", how="left", indicator=True)

df_1_not_2 = df_1_2[df_1_2["_merge"] == "left_only"].drop(columns=["_merge"])

which returns

   id  col1  col2  col3  col4
0 1 9 5 NaN NaN
1 2 8 4 NaN NaN


Related Topics



Leave a reply



Submit