Compare If Two Dataframe Objects in R Are Equal

Compare if two dataframe objects in R are equal?

It is not clear what it means to test if two data frames are "value equal" but to test if the values are the same, here is an example of two non-identical dataframes with equal values:

a <- data.frame(x = 1:10)
b <- data.frame(y = 1:10)

To test if all values are equal:

all(a == b) # TRUE

To test if objects are identical (they are not, they have different column names):

identical(a,b) # FALSE: class, colnames, rownames must all match.

How to check if two data frames are equal

Look up all.equal. It has some riders but it might work for you.

all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE

Compare 2 dataframes for equality in R

dplyr's setdiff works on data frames, I would suggest

library(dplyr)
nrow(setdiff(a, c)) == 0 & nrow(setdiff(c, a)) == 0
# [1] TRUE

Note that this will not account for number of duplicate rows. (i.e., if a has multiple copies of a row, and c has only one copy of that row, it will still return TRUE). Not sure how you want duplicate rows handled...

If you do care about having the same number of duplicates, then I would suggest two possibilities: (a) adding an ID column to differentiate the duplicates and using the approach above, or (b) sorting, resetting the row names (annoyingly), and using identical.

(a) adding an ID column

library(dplyr)
a_id = group_by_all(a) %>% mutate(id = row_number())
c_id = group_by_all(c) %>% mutate(id = row_number())
nrow(setdiff(a_id, c_id)) == 0 & nrow(setdiff(c_id, a_id)) == 0
# [1] TRUE

(b) sorting

a_sort = a[do.call(order, a), ]
row.names(a_sort) = NULL
c_sort = c[do.call(order, c), ]
row.names(c_sort) = NULL
identical(a_sort, c_sort)
# [1] TRUE

Compare two dataframes in R

One approach could be to convert data to long format, perform an inner_join subtract values, check if all the values are in range and get the data back in wide format.

library(dplyr)
library(tidyr)

df1 %>% pivot_longer(cols = -Letter) %>%
inner_join(df2 %>% pivot_longer(cols = -Letter), by = c("Letter", "name")) %>%
mutate(value = value.x - value.y) %>%
group_by(Letter) %>%
mutate(check = all(between(value, 0, 2))) %>%
select(-value.x, -value.y) %>%
pivot_wider()

# Letter check `2011` `2012` `2013`
# <chr> <lgl> <int> <int> <int>
#1 A TRUE 1 2 1
#2 C FALSE 0 -1 3

data

df1 <- structure(list(Letter = c("A", "B", "C"), `2011` = c(2L, 6L,5L),
`2012` = c(3L, 6L, 4L), `2013` = c(5L, 6L, 8L)), row.names = c(NA, -3L),
class = "data.frame")

df2 <- structure(list(Letter = c("A", "C"), `2011` = c(1L, 5L), `2012` = c(1L,
5L), `2013` = 4:5), row.names = c(NA, -2L), class = "data.frame")

Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e

How to compare every row of dataframe to dataframe in R?

You can do this: in short, with each row of the dataframe, duplicate it to create a new dataframe with all values changed to that row, and compare that dataframe with the original (whether the values are the same). rowSums of each of that comparison will give you the vectors you want.

# Create the desired output in list 
lst <-
lapply(1:nrow(df), function(nr) {
rowSums(replicate(nrow(df), df[nr, ], simplify = FALSE) %>%
do.call("rbind", .) == df)})

# To create the desired dataframe
df %>% tibble(desired_column = I(lst))

In tibble call in the last row, I() is used to put in list output as a column.



Related Topics



Leave a reply



Submit