Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2
This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare
:
library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c
The function compare
gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):
difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e
How to find common rows between two dataframe in R and remove them
We may use anti_join
(originally posted as comments way before the other answer was posted)
library(dplyr)
anti_join(df1, df2, by = c("name1"))
data
df1 <- structure(list(name1 = c("a", "b", "c"), name2 = c("a1", "b1",
"c1"), name3 = c("a2", "b2", "c2")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(name1 = c("a", "b"), name2 = c("a3", "b3")), class = "data.frame", row.names = c(NA,
-2L))
pandas get rows which are NOT in other dataframe
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you've found is to use isin
which will produce NaN
rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Compare 2 DataFrames and drop rows that do not contain corresponding ID variables
You could use a boolean mask:
msk1 = df1['ID'].isin(df2['ID'])
msk2 = df2['ID'].isin(df1['ID'])
df1 = df1[msk1]
df2 = df2[msk2]
or use set.intersection
and boolean indexing:
common = set(df1['ID']).intersection(df2['ID'])
df1 = df1[df1['ID'].isin(common)]
df2 = df2[df2['ID'].isin(common)]
Output:
df1
:
ID Food Amount
0 1 Ham 5
1 2 Cheese 2
3 4 Bacon 4
df2
:
ID Food Amount
0 1 Ham 6
1 2 Cheese 7
2 4 Bacon 15
Compare two dataframes column values. Find which values are in one df and not the other
You could just use normal sets to get unique customer ids for each year and then subtract them appropriately:
set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)
Out: 83
For your original approach to work you would need to drop the duplicates from the DataFrames, to make sure each customer appears only a single time:
Customer_2018 = df.loc[(df.OrderYear == 2018), "Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), "Customer ID"].drop_duplicates()
Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()
#Out:
True 552
False 83
Name: Customer ID, dtype: int64
Compare and identify the missing rows
One way would be to create a unique key combining the two columns and use %in%
to find if the key is present in another dataset.
A base R option -
df2$compare <- do.call(paste, df2) %in% do.call(paste, df1)
df2
# A tibble: 7 x 3
# serial day compare
# <dbl> <dbl> <lgl>
#1 1 1 TRUE
#2 2 0 TRUE
#3 3 1 TRUE
#4 4 0 TRUE
#5 5 0 TRUE
#6 5 1 FALSE
#7 7 1 FALSE
If there are more columns in your data apart from serial
and day
use the below code.
cols <- c('serial', 'day')
df2$compare <- do.call(paste, df2[cols]) %in% do.call(paste, df1[cols])
Pandas better method to compare two dataframes and find entries that only exist in one
Looks like using 'outer' as the how
was the solution
z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both
Output looks like this (after only showing the columns I care about)
Name Site _merge
Charlie A left_only
Doug B right_only
Related Topics
How to Combine Multiple Variable Data to a Single Variable Data
R Markdown - Changing Font Size and Font Type in HTML Output
Deleting Rows in R Based on Values Over Multiple Columns
Split an Audio File into Pieces of an Arbitrary Size
Splitting a Dataframe into Several Dataframes
Conditional Replacement of a Comma With a Dot in a Numeric Column
Concatenate String Columns and Order in Alphabetical Order
How to Change the Spacing Between Legend Items in Ggplot2
How to Change the Default Colors in Plotly Chart
How to Filter Multiple Columns With Same Condition in R
Add Column Values Based on Other Columns in Data Frame Using for and If
Find Duplicated Elements With Dplyr
Change R Default Library Path Using .Libpaths in Rprofile.Site Fails to Work
Using Ggplot2, How to Insert a Break in the Axis
If Else Statements to Check If a String Contains a Substring in R
How to Convert Only Some Positive Numbers to Negative Numbers (Conditional Recoding)