Find complement of a data frame (anti - join)
Try anti_join
from dplyr
library(dplyr)
anti_join(df, df1, by='heads')
how to find complement of a dataframe with respect of another df?
Use left_anti
join
df1
df1 = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
], ['id', 'col'])
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
| 2| d|
| 2| e|
| 3| f|
+---+---+
df2
df2 = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
], ['id', 'col'])
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
left_anti
join
df1.join(df2, on=['id'], how='left_anti').show()
+---+---+
| id|col|
+---+---+
| 2| d|
| 2| e|
| 3| f|
+---+---+
Using Anti Join in R
There's a dplyr function to do this called anti_join
:
library(dplyr)
anti_join(df1, df2, by = c('Check'))
To sort it in descending order of Count (thanks to Ben Bolker for pointing out that part of the question) you can use arrange
.
library(dplyr)
df1 %>%
anti_join(df2, by = c('Check')) %>%
arrange(desc(Count))
How to get relative complement of one data.frame in another?
Try this
library(dplyr)
output <- anti_join(foo,bleh)
output[order(output$start),]
Another option using setdiff
from dplyr
package (@Frank Thanks for the correction)
setdiff(foo,bleh)
# start stop
#1 5 7
#2 9 11
#3 13 15
#4 17 19
List of elements in join table without match
Once a left_join
is done, you cannot find out what wasn't matched. As Petr suggested in that answer, you can subsequently use anti_join
to find what doesn't match.
Another technique (that only requires one merge operation) is to do a full join and filter on elements unique to the left and to the right to see what is missing.
Using datasets used in the examples of full_join
:
full_join(band_members, band_instruments)
# Joining, by = "name"
# # A tibble: 4 x 3
# name band plays
# <chr> <chr> <chr>
# 1 Mick Stones <NA>
# 2 John Beatles guitar
# 3 Paul Beatles bass
# 4 Keith <NA> guitar
In this example, one can approximate the left-join with filter(!is.na(band))
and right-join with filter(!is.na(plays))
, and finally one can get the second frame's unmatched elements with filter(is.na(plays))
.
In this example, it's "clear" since there were no NA
values before the merge. If there is no column that is known to never be NA
(in either or both frames), then you can add one with low-cost. For instance mutate(band_members, orig=TRUE)
(and same for band_instruments
) will give you solid "known" columns.
How can I perform a setdiff merge using data.table?
In this case, it's equivalent to an anti join:
tab1[!tab2, on=c("let", "num")]
But setdiff()
would only the first row for every let,num
. This is marked for v1.9.8, FR #547.
How to compare 2 datasets based on one column?
dplyr
You can use dplyr::anti_join
.
anti_join(df1, df2, by="var4")
# A tibble: 1 x 4
var1 var2 var3 var4
<dbl> <chr> <chr> <chr>
1 2 peach blue 2021-12-24
base R
df1[!df2$var4 %in% df1$var4,]
data.table
setDT(df1)[!df2, on = "var4"]
Related Topics
Select/Assign to Data.Table When Variable Names Are Stored in a Character Vector
Controlling Number of Decimal Digits in Print Output in R
Cluster Analysis in R: Determine the Optimal Number of Clusters
Why Can't R'S Ifelse Statements Return Vectors
Fitting a Linear Model With Multiple Lhs
Looping Over a Date or Posixct Object Results in a Numeric Iterator
How to Force R to Use a Specified Factor Level as Reference in a Regression
Replace a Value in a Data Frame Based on a Conditional ('If') Statement
Counting Unique Values Across Variables (Columns) in R
Does Ifelse Really Calculate Both of Its Vectors Every Time? Is It Slow
In R, How to Get an Object'S Name After It Is Sent to a Function
Error: C Stack Usage Is Too Close to the Limit
Rotating and Spacing Axis Labels in Ggplot2
How to Plot Two Histograms Together in R
How to Replace Negative Values in a Dataframe Column With a Different Value