How to Merge Two Data Frames in R by a Common Column with Mismatched Date/Time Values

How to merge two data frames in r by a common column with mismatched date/time values

After first converting your datetime character strings to POSIXt classes, some combination of rounding and truncating those times should get you something you can use as the basis of a merge.

First read in your data, and create corresponding POSIXt datetimes:

dts1 <- structure(list(datetime = structure(1:6,
.Label = c("30/03/2011 02:32", "30/03/2011 02:42",
"30/03/2011 02:52", "30/03/2011 03:02", "30/03/2011 03:12",
"30/03/2011 03:22"), class = "factor"), count = c(27L, 3L,
0L, 1L, 15L, 0L), period = c(561L, 600L, 574L, 550L, 600L,
597L)), .Names = c("datetime", "count", "period"),
class = "data.frame", row.names = c(NA, -6L))
dts2 <- structure(list(datetime = structure(1:7,
.Label = c("30/03/2011 01:59", "30/03/2011 02:58",
"30/03/2011 03:55", "30/03/2011 04:53", "30/03/2011 05:52",
"30/03/2011 06:48", "30/03/2011 07:48"), class = "factor"),
dist = c(23.9, 14.7, 10.4, 35.4, 56.1, 12.3, 10.7), car =
c(1L, 1L, 2L, 1L, 1L, 1L, 1L), satd = c(3L, 7L, 9L, 3L, 7L,
4L, 5L), alt = c(1.76, 6.36, -0.34, 3.55, -0.91, 6.58,
4.18)), .Names = c("datetime", "dist", "car", "satd",
"alt"), class = "data.frame", row.names = c(NA, -7L))

# create corresponding POSIXlt vector
# (you could update the 'datetime' columns in-place if you prefer)
datetime1 <- strptime(dts1$datetime, format="%d/%m/%Y %H:%M")
datetime2 <- strptime(dts2$datetime, format="%d/%m/%Y %H:%M")

The following code produces a merged table based on the nearest hour in all cases. Inside the merge it's just prepending a column with the rounded times to each of your data frames, merging based on that (i.e., column number 1), then using the -1 index to remove that column at the end:

# merge based on nearest hour
merge(
cbind(round(datetime1, "hours"), dts1),
cbind(round(datetime2, "hours"), dts2),
by=1, suffixes=c("_dts1", "_dts2")
)[-1]

datetime_dts1 count period datetime_dts2 dist car satd alt
1 30/03/2011 02:32 27 561 30/03/2011 02:58 14.7 1 7 6.36
2 30/03/2011 02:42 3 600 30/03/2011 02:58 14.7 1 7 6.36
3 30/03/2011 02:52 0 574 30/03/2011 02:58 14.7 1 7 6.36
4 30/03/2011 03:02 1 550 30/03/2011 02:58 14.7 1 7 6.36
5 30/03/2011 03:12 15 600 30/03/2011 02:58 14.7 1 7 6.36
6 30/03/2011 03:22 0 597 30/03/2011 02:58 14.7 1 7 6.36

As above, but this time just truncating on hour:

merge(
cbind(trunc(datetime1, "hours"), dts1),
cbind(trunc(datetime2, "hours"), dts2),
by=1, suffixes=c("_dts1", "_dts2")
)[-1]

datetime_dts1 count period datetime_dts2 dist car satd alt
1 30/03/2011 02:32 27 561 30/03/2011 02:58 14.7 1 7 6.36
2 30/03/2011 02:42 3 600 30/03/2011 02:58 14.7 1 7 6.36
3 30/03/2011 02:52 0 574 30/03/2011 02:58 14.7 1 7 6.36
4 30/03/2011 03:02 1 550 30/03/2011 03:55 10.4 2 9 -0.34
5 30/03/2011 03:12 15 600 30/03/2011 03:55 10.4 2 9 -0.34
6 30/03/2011 03:22 0 597 30/03/2011 03:55 10.4 2 9 -0.34

As above, but for dts1 treat the record as belonging to previous hour until 10 minutes past the hour, by subtracting 10*60 seconds before truncating. This one produces the same output you specified, though without more information I'm not sure that it's the exact rule you want.

merge(
cbind(trunc(datetime1 - 10*60, "hours"), dts1),
cbind(trunc(datetime2, "hours"), dts2),
by=1, suffixes=c("_dts1", "_dts2")
)[-1]

datetime_dts1 count period datetime_dts2 dist car satd alt
1 30/03/2011 02:32 27 561 30/03/2011 02:58 14.7 1 7 6.36
2 30/03/2011 02:42 3 600 30/03/2011 02:58 14.7 1 7 6.36
3 30/03/2011 02:52 0 574 30/03/2011 02:58 14.7 1 7 6.36
4 30/03/2011 03:02 1 550 30/03/2011 02:58 14.7 1 7 6.36
5 30/03/2011 03:12 15 600 30/03/2011 03:55 10.4 2 9 -0.34
6 30/03/2011 03:22 0 597 30/03/2011 03:55 10.4 2 9 -0.34

You could tweak the details of which ones you round, which ones you truncate, and whether you first subtract/add some time depending on your specific rule.

Edit:

Not the most elegant, but here is a different approach that accommodates the more complicated conditional rule you described in your comments. This leans heavily on na.locf from the zoo package to first determine which dts2 times come before and after each dts1 record. With those in hand, it's just a matter of applying the rule to select the desired dts2 time, matching back to the original dts1 table, then merging.

library(zoo)

# create ordered list of all datetimes, using names to keep
# track of which ones come from each data frame
alldts <- sort(c(
setNames(datetime1, rep("dts1", length(datetime1))),
setNames(datetime2, rep("dts2", length(datetime2)))))
is.dts1 <- names(alldts)=="dts1"

# for each dts1 record, get previous closest dts2 time
dts2.prev <- alldts
dts2.prev[is.dts1] <- NA
dts2.prev <- na.locf(dts2.prev, na.rm=FALSE)[is.dts1]

# for each dts1 record, get next closest dts2 time
dts2.next <- alldts
dts2.next[is.dts1] <- NA
dts2.next <- na.locf(dts2.next, na.rm=FALSE, fromLast=TRUE)[is.dts1]

# for each dts1 record, apply rule to choose dts2 time
use.prev <- !is.na(dts2.prev) & (alldts[is.dts1] - dts2.prev < 5)
dts2.to.use <- ifelse(use.prev, as.character(dts2.prev),
as.character(dts2.next))

# merge based on chosen dts2 times, prepended as character vector
# for the purpose of merging
merge(
cbind(.dt=dts2.to.use[match(datetime1, alldts[is.dts1])], dts1),
cbind(.dt=as.character(datetime2), dts2),
by=".dt", all.x=TRUE, suffixes=c("_dts1", "_dts2")
)[-1]

Merging dataframes of different length by matching dates

Use the dplyr package and try left_join(). This returns all rows from df1 and all columns from both df1 and df2. Any rows in df1 with no match will receive NA.

library(dplyr)
left_join(df1, df2, by = "date_time")

Check out the other types of join you can have with ?join.

merge dataframes based on common columns but keeping all rows from x

I think you have data tables rather than simple dataframes, and merge works slightly differently between the two. You could try forcing it to use the dataframe method by using NewDataframe <- merge.data.frame(x, y, all.x=TRUE) which should by default merge on all shared column names.

Merge 2 data frames using common date, plus 2 rows before and n-1 rows after

This might be one approach using tidyverse and fuzzyjoin.

First, indicate event numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length days - 1 after the date).

Then, you can use fuzzy_inner_join to get the selected rows from the second data.frame. Here, you will want to include where the datetime in the second data.frame falls after the start date and before the end date of the first data.frame.

library(tidyverse)
library(fuzzyjoin)

df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime + df1$length - 1

fuzzy_inner_join(
df1,
df2,
by = c("start_date" = "datetime", "end_date" = "datetime"),
match_fun = c(`<=`, `>=`)
) %>%
select(datetime.y, length, q, event)

I tried this out with some made up data:

R> df1
datetime length
1 2003-06-03 1
2 2003-06-12 1
3 2003-06-21 1
4 2003-06-30 3
5 2003-07-09 5
6 2003-07-18 1
7 2003-07-27 1
8 2003-08-05 2
9 2003-08-14 1
10 2003-08-23 1
11 2003-09-01 3

R> df2
datetime q
1 2003-06-03 44
2 2003-06-04 52
3 2003-06-05 34
4 2003-06-06 20
5 2003-06-07 57
6 2003-06-08 67
7 2003-06-09 63
8 2003-06-10 51
9 2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16 6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03 9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70

This will give the following output:

   datetime.y length  q event
1 2003-06-03 1 44 1
2 2003-06-10 1 51 2
3 2003-06-11 1 56 2
4 2003-06-12 1 37 2
5 2003-06-19 1 61 3
6 2003-06-20 1 42 3
7 2003-06-21 1 28 3
8 2003-06-28 3 38 4
9 2003-06-29 3 27 4
10 2003-06-30 3 49 4
11 2003-07-01 3 10 4
12 2003-07-02 3 89 4
13 2003-07-07 5 31 5
14 2003-07-08 5 29 5
15 2003-07-09 5 84 5
16 2003-07-10 5 60 5
17 2003-07-11 5 19 5
18 2003-07-12 5 97 5
19 2003-07-13 5 35 5

If the output desired is different than above, please let me know what should be different so that I can correct it.


Data

df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233, 
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"),
length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11,
start_date = structure(c(12204, 12213, 12222, 12231, 12240,
12249, 12258, 12267, 12276, 12285, 12294), class = "Date"),
end_date = structure(c(12206, 12215, 12224, 12235, 12246,
12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")

df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209,
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218,
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227,
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236,
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245,
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L,
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L,
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L,
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))

Merge 2 dataframes by matching dates

Instead of using merge with data.table, you can also simply join as follows:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

this gives:

> df2[df1]
id dates field1 field2
1: MUM-1 2015-07-10 1 0
2: MUM-1 2015-07-11 NA NA
3: MUM-1 2015-07-12 2 1
4: MUM-2 2014-01-14 4 3
5: MUM-2 2014-01-15 NA NA
6: MUM-2 2014-01-16 NA NA
7: MUM-2 2014-01-17 0 1

Doing this with dplyr:

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

As can be seen, @Arun's approach is slightly faster:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
jaap = dt2[dt1, on = c('id','dates')],
pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
dplr = left_join(dt1, dt2, by=c("id","dates")),
arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

test elapsed relative
4 arun 0.015 1.000
1 jaap 0.016 1.067
3 dplr 0.037 2.467
2 pavo 1.033 68.867

For a comparison on a large dataset, see the answer of @Arun.

How do I combine two data-frames based on two columns?

See the documentation on ?merge, which states:

By default the data frames are merged on the columns with names they both have, 
but separate specifications of the columns can be given by by.x and by.y.

This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:

x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match

This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.

Matching data from one dataframe to a time block that it fits within in a second dataframe

Something like this ?

findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
res <- cbind(df2, df1[rows,])

datetime a n dpm datetime count
2 10/11/2012 16:26 2.03 27 3473 10/11/2012 16:35 55
7 10/11/2012 17:24 1.35 28 3636 10/11/2012 17:25 455
13 10/11/2012 18:21 7.63 29 3516 10/11/2012 18:25 583

PS1: I think the count of your expected result is wrong on row #1

PS2: It would have been easier if you had provided the datasets in a directly usable
form.
I had to do:

d1 <- 
'datetime count
10/11/2012 16:25 231
...
'
d2 <-
'datetime a n dpm
10/11/2012 16:26 2.03 27 3473
10/11/2012 17:24 1.35 28 3636
10/11/2012 18:21 7.63 29 3516
'

.parse <- function(s) {
cs <- gsub('\\s\\s+', '\t', s)
read.table(text=cs, sep="\t", header=TRUE, stringsAsFactors=FALSE)
}

df1 <- .parse(d1)
df2 <- .parse(d2)

Find closest timestamps between two dataframes and merge different columns when time difference is 60s

This can be directly expressed in sql:

library(sqldf)
sqldf("select a.*, b.Data df1_Data
from df2 a
left join df1 b on abs(a.Timestamp - b.Timestamp) < 60")

giving:

            Timestamp Data df1_Data
1 2019-12-31 19:00:10 10 1
2 2019-12-31 19:02:30 11 2
3 2019-12-31 19:12:45 12 7
4 2019-12-31 19:20:15 13 NA


Related Topics



Leave a reply



Submit