Delete Rows That Exist in Another Data Frame

Delete rows that exist in another data frame?

You need the %in% operator. So,

df1[!(df1$name %in% df2$name),]

should give you what you want.

  • df1$name %in% df2$name tests whether the values in df1$name are in df2$name
  • The ! operator reverses the result.

How to remove rows from Pandas dataframe if the same row exists in another dataframe but end up with all columns from both df

You can use a left join to get only the id's in the first data frame and not the second data frame while also keeping all the second data frames columns.

import pandas as pd

df1 = pd.DataFrame(
data={"id": [1, 2, 3, 4], "col1": [9, 8, 7, 6], "col2": [5, 4, 3, 2]},
columns=["id", "col1", "col2"],
)
df2 = pd.DataFrame(
data={"id": [3, 4, 7], "col3": [11, 12, 13], "col4": [15, 16, 17]},
columns=["id", "col3", "col4"],
)

df_1_2 = df1.merge(df2, on="id", how="left", indicator=True)

df_1_not_2 = df_1_2[df_1_2["_merge"] == "left_only"].drop(columns=["_merge"])

which returns

   id  col1  col2  col3  col4
0 1 9 5 NaN NaN
1 2 8 4 NaN NaN

DataFrame remove rows existing in another DataFrame

Using pyspark:

You can create a list containing the customerId from DF2 with collect():

from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]

And then filter your DF1 customerId using isin with negation ~:

diff = df1.where(~col('customerId').isin(id_df2))

How to remove rows in a Pandas dataframe if the same row exists in another dataframe?

You an use merge with parameter indicator and outer join, query for filtering and then remove helper column with drop:

DataFrames are joined on all columns, so on parameter can be omit.

print (pd.merge(a,b, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
0 1
0 1 10
2 3 30

In Pandas, how to delete rows from a Data Frame based on another Data Frame?

You can use boolean indexing and condition with isin, inverting boolean Series is by ~:

import pandas as pd

USERS = pd.DataFrame({'email':['a@g.com','b@g.com','b@g.com','c@g.com','d@g.com']})
print (USERS)
email
0 a@g.com
1 b@g.com
2 b@g.com
3 c@g.com
4 d@g.com

EXCLUDE = pd.DataFrame({'email':['a@g.com','d@g.com']})
print (EXCLUDE)
email
0 a@g.com
1 d@g.com
print (USERS.email.isin(EXCLUDE.email))
0 True
1 False
2 False
3 False
4 True
Name: email, dtype: bool

print (~USERS.email.isin(EXCLUDE.email))
0 False
1 True
2 True
3 True
4 False
Name: email, dtype: bool

print (USERS[~USERS.email.isin(EXCLUDE.email)])
email
1 b@g.com
2 b@g.com
3 c@g.com

Another solution with merge:

df = pd.merge(USERS, EXCLUDE, how='outer', indicator=True)
print (df)
email _merge
0 a@g.com both
1 b@g.com left_only
2 b@g.com left_only
3 c@g.com left_only
4 d@g.com both

print (df.loc[df._merge == 'left_only', ['email']])
email
1 b@g.com
2 b@g.com
3 c@g.com

Delete rows from dataframe if column value does not exist in another dataframe

Your question doesn't contain enough information. So I'll try to guess and show you a toy example.
If your using pandas then the solution would be:

>>> df1 = pd.DataFrame([x for x in pd.date_range('1/1/2020', '3/1/2020')], columns=['date'])
>>> df2 = pd.DataFrame([x for x in pd.date_range('2/20/2020', '3/1/2020')], columns=['date'])

>>> df1.shape
out: (61, 1)

>>> df2.shape
out: (11, 1)

>>> df1.head()
out:
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05

>>> df2.head()
out:
date
0 2020-02-20
1 2020-02-21
2 2020-02-22
3 2020-02-23
4 2020-02-24

>>> new_df = df1[df1['date'].isin(df2['date'])]
>>> new_df
out:
date
50 2020-02-20
51 2020-02-21
52 2020-02-22
53 2020-02-23
54 2020-02-24
55 2020-02-25
56 2020-02-26
57 2020-02-27
58 2020-02-28
59 2020-02-29
60 2020-03-01

>>> new_df.shape
out: (11, 1)

Now in the "new_df" you will have only those dates which are contained in both dataframes

How to remove rows of a DataFrame based off of data from another DataFrame?

isin with &

df.loc[~((df.Product_Num.isin(df2['Product_Num']))&(df.Price.isin(df2['Price']))),:]
Out[246]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99

Update

df.loc[~df.index.isin(df.merge(df2.assign(a='key'),how='left').dropna().index)]
Out[260]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99

Pandas delete rows in a dataframe that are not in another dataframe

Please try this:

df = pd.merge(df1, df2, how='left', indicator='Exist')
df['Exist'] = np.where(df.Exist == 'both', True, False)
df = df[df['Exist']==True].drop(['Exist','z'], axis=1)


Related Topics



Leave a reply



Submit