Using Anti_Join() from the Dplyr on Two Tables from Two Different Databases

Using anti_join() from the dplyr on two tables from two different databases

Try this:

mismatch_records <- anti_join(table_c, table_d, by = c("CLIENT_ID" = "ClientId"))

Creating a new data frame from two existing data frame based on values from two columns

We can use

library(data.table)
setDT(df2)[!df1, on = .(ColA, ColB)]
# ColA ColB ColE
#1: a 31 41
#2: b 11 13

data

df1 <- structure(list(ColA = c("a", "a", "b", "c"), ColB = c(1L, 3L, 
5L, 9L), ColC = c(2L, 4L, 6L, 10L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ColA = c("a", "a", "a", "b", "b", "c"), ColB = c(1L,
31L, 3L, 5L, 11L, 9L), ColE = c(22L, 41L, 63L, 6L, 13L, 20L)), class = "data.frame", row.names = c(NA,
-6L))

Exclude common rows in tibbles

Use setdiff() function from dplyr library

A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
C <- setdiff(A,B)

Just to add.

Setdiff(A,B) gives out those elements present in A but not in B.

How can I compare two data.frames and remove the same entries in the specific two columns in R?

Simple one-line solution in dplyr:

dplyr::anti_join(a1,a2,by=c("ID","chr"))

Simple one-line solution in base R:

a1[!(a1$ID %in% a2$ID & a1$chr %in% a2$chr),]

Output

 ID chr loc var
1 6 1 2 1
2 5 11 6 7
3 0 12 10 4

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

In dplyr, what are the intrinsic differences between setdiff and anti_join?

Both subset the first parameter, but setdiff requires the columns to be the same:

library(dplyr)

setdiff(mtcars, mtcars[1:30, ])
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

setdiff(mtcars, mtcars[1:30, 1:6])
#> Error in setdiff_data_frame(x, y): not compatible: Cols in x but not y: `carb`, `gear`, `am`, `vs`, `qsec`.

whereas anti_join is a join, so doesn't:

anti_join(mtcars, mtcars[1:30, 1:3])
#> Joining, by = c("mpg", "cyl", "disp")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

Anti-Join Pandas

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()

TableA

Sample Image


TableB

Sample Image

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)

# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)

Sample Image

Method 2

rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)

pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

Sample Image

10,000 rows 5,000 overlap

loops are bad

Sample Image

Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e

How to make join operations in dplyr silent?

If you want to be heavy-handed, you can do

aa = suppressMessages(inner_join(a, b))

The better choice, as Jazzurro suggests, is to specify the by argument. dplyr only prints a message to let you know what its guess is for which columns to join by. If you don't make it guess, it doesn't confirm things with you. This is a safer choice as well, from defensive coding standpoint.

If this is in a knitr document, you can set the chunk option message=FALSE.



Related Topics



Leave a reply



Submit