Find Complement of a Data Frame (Anti - Join)

Find complement of a data frame (anti - join)

Try anti_join from dplyr

library(dplyr)
anti_join(df, df1, by='heads')

how to find complement of a dataframe with respect of another df?

Use left_anti join

df1
df1 = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
], ['id', 'col'])

+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
| 2| d|
| 2| e|
| 3| f|
+---+---+
df2
df2 = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
], ['id', 'col'])

+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
left_anti join
df1.join(df2, on=['id'], how='left_anti').show()

+---+---+
| id|col|
+---+---+
| 2| d|
| 2| e|
| 3| f|
+---+---+

Using Anti Join in R

There's a dplyr function to do this called anti_join:

library(dplyr)
anti_join(df1, df2, by = c('Check'))

To sort it in descending order of Count (thanks to Ben Bolker for pointing out that part of the question) you can use arrange.

library(dplyr)
df1 %>%
anti_join(df2, by = c('Check')) %>%
arrange(desc(Count))

How to get relative complement of one data.frame in another?

Try this

library(dplyr)
output <- anti_join(foo,bleh)
output[order(output$start),]

Another option using setdiff from dplyr package (@Frank Thanks for the correction)

setdiff(foo,bleh)
# start stop
#1 5 7
#2 9 11
#3 13 15
#4 17 19

List of elements in join table without match

Once a left_join is done, you cannot find out what wasn't matched. As Petr suggested in that answer, you can subsequently use anti_join to find what doesn't match.

Another technique (that only requires one merge operation) is to do a full join and filter on elements unique to the left and to the right to see what is missing.

Using datasets used in the examples of full_join:

full_join(band_members, band_instruments)
# Joining, by = "name"
# # A tibble: 4 x 3
# name band plays
# <chr> <chr> <chr>
# 1 Mick Stones <NA>
# 2 John Beatles guitar
# 3 Paul Beatles bass
# 4 Keith <NA> guitar

In this example, one can approximate the left-join with filter(!is.na(band)) and right-join with filter(!is.na(plays)), and finally one can get the second frame's unmatched elements with filter(is.na(plays)).

In this example, it's "clear" since there were no NA values before the merge. If there is no column that is known to never be NA (in either or both frames), then you can add one with low-cost. For instance mutate(band_members, orig=TRUE) (and same for band_instruments) will give you solid "known" columns.

How can I perform a setdiff merge using data.table?

In this case, it's equivalent to an anti join:

tab1[!tab2, on=c("let", "num")]

But setdiff() would only the first row for every let,num. This is marked for v1.9.8, FR #547.

How to compare 2 datasets based on one column?

dplyr

You can use dplyr::anti_join.

anti_join(df1, df2, by="var4")
# A tibble: 1 x 4
var1 var2 var3 var4
<dbl> <chr> <chr> <chr>
1 2 peach blue 2021-12-24

base R

df1[!df2$var4 %in% df1$var4,]

data.table

setDT(df1)[!df2, on = "var4"]


Related Topics



Leave a reply



Submit