Remove one dataframe from another with Pandas
Use merge
with outer join with filter by query
, last remove helper column by drop
:
df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
How to remove rows of a DataFrame based off of data from another DataFrame?
isin
with &
df.loc[~((df.Product_Num.isin(df2['Product_Num']))&(df.Price.isin(df2['Price']))),:]
Out[246]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99
Update
df.loc[~df.index.isin(df.merge(df2.assign(a='key'),how='left').dropna().index)]
Out[260]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99
Pandas delete rows in a dataframe that are not in another dataframe
Please try this:
df = pd.merge(df1, df2, how='left', indicator='Exist')
df['Exist'] = np.where(df.Exist == 'both', True, False)
df = df[df['Exist']==True].drop(['Exist','z'], axis=1)
In Pandas, how to delete rows from a Data Frame based on another Data Frame?
You can use boolean indexing
and condition with isin
, inverting boolean Series
is by ~
:
import pandas as pd
USERS = pd.DataFrame({'email':['a@g.com','b@g.com','b@g.com','c@g.com','d@g.com']})
print (USERS)
email
0 a@g.com
1 b@g.com
2 b@g.com
3 c@g.com
4 d@g.com
EXCLUDE = pd.DataFrame({'email':['a@g.com','d@g.com']})
print (EXCLUDE)
email
0 a@g.com
1 d@g.com
print (USERS.email.isin(EXCLUDE.email))
0 True
1 False
2 False
3 False
4 True
Name: email, dtype: bool
print (~USERS.email.isin(EXCLUDE.email))
0 False
1 True
2 True
3 True
4 False
Name: email, dtype: bool
print (USERS[~USERS.email.isin(EXCLUDE.email)])
email
1 b@g.com
2 b@g.com
3 c@g.com
Another solution with merge
:
df = pd.merge(USERS, EXCLUDE, how='outer', indicator=True)
print (df)
email _merge
0 a@g.com both
1 b@g.com left_only
2 b@g.com left_only
3 c@g.com left_only
4 d@g.com both
print (df.loc[df._merge == 'left_only', ['email']])
email
1 b@g.com
2 b@g.com
3 c@g.com
Remove rows that are in another dataframe
Try merge
out = df1.merge(df2,how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
Out[128]:
A B C D E F G _merge
0 1 2 3 4 5 6 7 left_only
1 8 9 0 1 2 3 4 left_only
DataFrame remove rows existing in another DataFrame
Using pyspark
:
You can create a list containing the customerId from DF2
with collect()
:
from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]
And then filter your DF1
customerId using isin
with negation ~
:
diff = df1.where(~col('customerId').isin(id_df2))
Remove duplicate rows dataframe from another dataframe
Drop columns in df1 which are also found in df2
df1.drop(columns=df2.columns, errors='ignore', inplace=True)
or
df1 = df1.drop(columns=df2.columns, errors='ignore')
Drop rows in df1 and also in df2 in a specific column say date
Following your edit, if it is a single column like date, please try
df1[~df1['date'].isin(df2['date'])]
If it is a check on multiple columns, it can also be done. However, we will need more info. What happens if column1 in both df has same values in df1 and df2 and in the same row a column2 in both df has different values.?
How to remove rows from Pandas dataframe if the same row exists in another dataframe but end up with all columns from both df
You can use a left join to get only the id
's in the first data frame and not the second data frame while also keeping all the second data frames columns.
import pandas as pd
df1 = pd.DataFrame(
data={"id": [1, 2, 3, 4], "col1": [9, 8, 7, 6], "col2": [5, 4, 3, 2]},
columns=["id", "col1", "col2"],
)
df2 = pd.DataFrame(
data={"id": [3, 4, 7], "col3": [11, 12, 13], "col4": [15, 16, 17]},
columns=["id", "col3", "col4"],
)
df_1_2 = df1.merge(df2, on="id", how="left", indicator=True)
df_1_not_2 = df_1_2[df_1_2["_merge"] == "left_only"].drop(columns=["_merge"])
which returns
id col1 col2 col3 col4
0 1 9 5 NaN NaN
1 2 8 4 NaN NaN
Related Topics
Python Overflowerror: Int Too Large to Convert to Float
How to Clear All Widgets from a Tkinter Window in One Go Without Referencing Them All Directly
Cannot Convert the Series to <Class 'Int''>
Iterating Over Every Two Elements in a List
How to Merge Two Cnn That Are Trained Over Different Data Stream
Python Regex - Finding Phone Number
How to Resolve Modulenotfounderror: No Module Named 'Google.Colab'
Merging Two Dataframes With Different Lengths
Json Valueerror: Expecting Property Name: Line 1 Column 2 (Char 1)
Index 0 Is Out of Bounds for Axis 0 With Size 0
Remove Very First Row in Pandas
Robot Framework Using Python, Key Press Without Selecting Any Button or Element in the Page
Pandas To_Csv: Suppress Scientific Notation in CSV File When Writing Pandas to Csv
Counting the Number of Duplicates in a List
Typeerror: Image Data Can Not Convert to Float
Faster Way to Read Excel Files to Pandas Dataframe
How to Make Tkinter Frames in a Loop and Update Object Values