Compare Two Data.Frames to Find the Rows in Data.Frame 1 That Are Not Present in Data.Frame 2

Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e

How to find common rows between two dataframe in R and remove them

We may use anti_join (originally posted as comments way before the other answer was posted)

library(dplyr)
anti_join(df1, df2, by = c("name1"))

data

df1 <- structure(list(name1 = c("a", "b", "c"), name2 = c("a1", "b1", 
"c1"), name3 = c("a2", "b2", "c2")), class = "data.frame", row.names = c(NA,
-3L))

df2 <- structure(list(name1 = c("a", "b"), name2 = c("a3", "b3")), class = "data.frame", row.names = c(NA,
-2L))

pandas get rows which are NOT in other dataframe

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14

EDIT

Another method as you've found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14

However if df2 does not start rows in the same manner then this won't work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14

Compare 2 DataFrames and drop rows that do not contain corresponding ID variables

You could use a boolean mask:

msk1 = df1['ID'].isin(df2['ID'])
msk2 = df2['ID'].isin(df1['ID'])
df1 = df1[msk1]
df2 = df2[msk2]

or use set.intersection and boolean indexing:

common = set(df1['ID']).intersection(df2['ID'])
df1 = df1[df1['ID'].isin(common)]
df2 = df2[df2['ID'].isin(common)]

Output:

df1:

   ID    Food  Amount
0 1 Ham 5
1 2 Cheese 2
3 4 Bacon 4

df2:

   ID    Food  Amount
0 1 Ham 6
1 2 Cheese 7
2 4 Bacon 15

Compare two dataframes column values. Find which values are in one df and not the other

You could just use normal sets to get unique customer ids for each year and then subtract them appropriately:

set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)

Out: 83

For your original approach to work you would need to drop the duplicates from the DataFrames, to make sure each customer appears only a single time:

Customer_2018 = df.loc[(df.OrderYear == 2018), ​"Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), ​"Customer ID"].drop_duplicates()

Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()

#Out:
True 552
False 83
Name: Customer ID, dtype: int64

Compare and identify the missing rows

One way would be to create a unique key combining the two columns and use %in% to find if the key is present in another dataset.

A base R option -

df2$compare <- do.call(paste, df2) %in% do.call(paste, df1)
df2

# A tibble: 7 x 3
# serial day compare
# <dbl> <dbl> <lgl>
#1 1 1 TRUE
#2 2 0 TRUE
#3 3 1 TRUE
#4 4 0 TRUE
#5 5 0 TRUE
#6 5 1 FALSE
#7 7 1 FALSE

If there are more columns in your data apart from serial and day use the below code.

cols <- c('serial', 'day')
df2$compare <- do.call(paste, df2[cols]) %in% do.call(paste, df1[cols])

Pandas better method to compare two dataframes and find entries that only exist in one

Looks like using 'outer' as the how was the solution

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

Output looks like this (after only showing the columns I care about)

  Name       Site   _merge
Charlie A left_only
Doug B right_only


Related Topics



Leave a reply



Submit