Using anti_join() from the dplyr on two tables from two different databases
Try this:
mismatch_records <- anti_join(table_c, table_d, by = c("CLIENT_ID" = "ClientId"))
Creating a new data frame from two existing data frame based on values from two columns
We can use
library(data.table)
setDT(df2)[!df1, on = .(ColA, ColB)]
# ColA ColB ColE
#1: a 31 41
#2: b 11 13
data
df1 <- structure(list(ColA = c("a", "a", "b", "c"), ColB = c(1L, 3L,
5L, 9L), ColC = c(2L, 4L, 6L, 10L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ColA = c("a", "a", "a", "b", "b", "c"), ColB = c(1L,
31L, 3L, 5L, 11L, 9L), ColE = c(22L, 41L, 63L, 6L, 13L, 20L)), class = "data.frame", row.names = c(NA,
-6L))
Exclude common rows in tibbles
Use setdiff() function from dplyr library
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
C <- setdiff(A,B)
Just to add.
Setdiff(A,B) gives out those elements present in A but not in B.
How can I compare two data.frames and remove the same entries in the specific two columns in R?
Simple one-line solution in dplyr
:
dplyr::anti_join(a1,a2,by=c("ID","chr"))
Simple one-line solution in base R:
a1[!(a1$ID %in% a2$ID & a1$chr %in% a2$chr),]
Output
ID chr loc var
1 6 1 2 1
2 5 11 6 7
3 0 12 10 4
How to join (merge) data frames (inner, outer, left, right)
By using the merge
function and its optional parameters:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by
a vector, e.g., by = c("CustomerId", "OrderId")
.
If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
where CustomerId_in_df1
is the name of the column in the first data frame and CustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
In dplyr, what are the intrinsic differences between setdiff and anti_join?
Both subset the first parameter, but setdiff
requires the columns to be the same:
library(dplyr)
setdiff(mtcars, mtcars[1:30, ])
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
setdiff(mtcars, mtcars[1:30, 1:6])
#> Error in setdiff_data_frame(x, y): not compatible: Cols in x but not y: `carb`, `gear`, `am`, `vs`, `qsec`.
whereas anti_join
is a join, so doesn't:
anti_join(mtcars, mtcars[1:30, 1:3])
#> Joining, by = c("mpg", "cyl", "disp")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
Anti-Join Pandas
Consider the following dataframes
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
Method 1
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Method 2
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
Timing
4 rows with 2 overlap
Method 1 is much quicker
10,000 rows 5,000 overlap
loops are bad
Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2
This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare
:
library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c
The function compare
gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):
difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e
How to make join operations in dplyr silent?
If you want to be heavy-handed, you can do
aa = suppressMessages(inner_join(a, b))
The better choice, as Jazzurro suggests, is to specify the by
argument. dplyr
only prints a message to let you know what its guess is for which columns to join by. If you don't make it guess, it doesn't confirm things with you. This is a safer choice as well, from defensive coding standpoint.
If this is in a knitr
document, you can set the chunk option message=FALSE
.
Related Topics
Ggplot Legend - Scale_Colour_Manual Not Working
Draw Bloxplots in R Given 25,50,75 Percentiles and Min and Max Values
How to Specify the Size/Layout of a Single Plot to Match a Certain Grid in R
Predict Out of Sample on Fixed Effects Model
R Xts: .001 Millisecond in Index
Likert Plot Showing Percentage Values
Upload and View a PDF in R Shiny
Group Rows in Data Frame Based on Time Difference Between Consecutive Rows
R Predict Function Returning Too Many Values
How to Insert Missing Observations on a Data Frame
Using Anti_Join() from the Dplyr on Two Tables from Two Different Databases
How to Adapt a Latex Beamer Theme to Apply It in an Rmarkdown::Beamer_Presentation
R: Strptime() and Is.Na () Unexpected Results
Remove Unused Categorical Values Boxplot - R