﻿ Find Complement of a Data Frame (Anti - Join) - ITCodar

# Find Complement of a Data Frame (Anti - Join)

## Find complement of a data frame (anti - join)

Try `anti_join` from `dplyr`

``library(dplyr)anti_join(df, df1, by='heads')``

## how to find complement of a dataframe with respect of another df?

Use `left_anti` join

##### `df1`
``df1 = spark.createDataFrame([    (1, 'a'),    (1, 'b'),    (1, 'c'),    (2, 'd'),    (2, 'e'),    (3, 'f'),], ['id', 'col'])+---+---+| id|col|+---+---+|  1|  a||  1|  b||  1|  c||  2|  d||  2|  e||  3|  f|+---+---+``
##### `df2`
``df2 = spark.createDataFrame([    (1, 'a'),    (1, 'b'),    (1, 'c'),], ['id', 'col'])+---+---+| id|col|+---+---+|  1|  a||  1|  b||  1|  c|+---+---+``
##### `left_anti` join
``df1.join(df2, on=['id'], how='left_anti').show()+---+---+| id|col|+---+---+|  2|  d||  2|  e||  3|  f|+---+---+``

## Using Anti Join in R

There's a dplyr function to do this called `anti_join`:

``library(dplyr)anti_join(df1, df2, by = c('Check'))``

To sort it in descending order of Count (thanks to Ben Bolker for pointing out that part of the question) you can use `arrange`.

``library(dplyr)df1 %>% anti_join(df2, by = c('Check')) %>%arrange(desc(Count))``

## How to get relative complement of one data.frame in another?

Try this

``library(dplyr)output <- anti_join(foo,bleh)output[order(output\$start),]``

Another option using `setdiff` from `dplyr` package (@Frank Thanks for the correction)

``setdiff(foo,bleh)#  start stop#1     5    7#2     9   11#3    13   15#4    17   19``

## List of elements in join table without match

Once a `left_join` is done, you cannot find out what wasn't matched. As Petr suggested in that answer, you can subsequently use `anti_join` to find what doesn't match.

Another technique (that only requires one merge operation) is to do a full join and filter on elements unique to the left and to the right to see what is missing.

Using datasets used in the examples of `full_join`:

``full_join(band_members, band_instruments)# Joining, by = "name"# # A tibble: 4 x 3#   name  band    plays #   <chr> <chr>   <chr> # 1 Mick  Stones  <NA>  # 2 John  Beatles guitar# 3 Paul  Beatles bass  # 4 Keith <NA>    guitar``

In this example, one can approximate the left-join with `filter(!is.na(band))` and right-join with `filter(!is.na(plays))`, and finally one can get the second frame's unmatched elements with `filter(is.na(plays))`.

In this example, it's "clear" since there were no `NA` values before the merge. If there is no column that is known to never be `NA` (in either or both frames), then you can add one with low-cost. For instance `mutate(band_members, orig=TRUE)` (and same for `band_instruments`) will give you solid "known" columns.

## How can I perform a setdiff merge using data.table?

In this case, it's equivalent to an anti join:

``tab1[!tab2, on=c("let", "num")]``

But `setdiff()` would only the first row for every `let,num`. This is marked for v1.9.8, FR #547.

## How to compare 2 datasets based on one column?

### dplyr

You can use `dplyr::anti_join`.

``anti_join(df1, df2, by="var4")# A tibble: 1 x 4   var1 var2  var3  var4        <dbl> <chr> <chr> <chr>     1     2 peach blue  2021-12-24``

### base R

``df1[!df2\$var4 %in% df1\$var4,]``

### data.table

``setDT(df1)[!df2, on = "var4"]``