Join Two Data Frames in R Based on Closest Timestamp

Join two data frames in R based on closest timestamp

Using rolling joins feature of data.table with roll = "nearest":

require(data.table) # v1.9.6+
setDT(table1)[, val2 := setDT(table2)[table1, val2, on = "date", roll = "nearest"]]

Here, val2 column is created by performing a join on the column date with roll = "nearest" option. For each row of table1$date, the closest matching row from table2$date is computed, and val2 for corresponding row is extracted.

Left join two dataframes based on nearest timestamp in R

Here is a roundabout way using dplyr

library(dplyr)
df1 %>%
left_join(df2 %>%
left_join(df1) %>%
mutate(date_diff = abs(date2 - date1)) %>%
group_by(ID, date2) %>%
filter(date_diff == min(date_diff)) %>%
ungroup() %>%
select(-date2, -date_diff) ) %>%
mutate(Flag = case_when(is.na(Flag) ~ "No",
TRUE ~ Flag))
Joining, by = "ID"
Joining, by = c("ID", "date1")
# A tibble: 7 x 3
ID date1 Flag
<dbl> <dttm> <chr>
1 1 2020-07-11 19:14:23 Yes
2 1 2020-07-21 13:11:10 Yes
3 1 2020-07-21 18:07:25 No
4 1 2020-07-28 18:18:11 No
5 2 2020-07-13 16:47:26 Yes
6 2 2020-07-18 17:11:37 Yes
7 3 2020-07-23 10:39:19 No

Join two dataframes with several columns in R based on closest timestamp

Enter the world of data.table's rolling joins

sample data

#or use
# setDT(df1); setDT(df2)
#to convert existing data.frame df1 and df2 to data.table


library( data.table)
df1 <- data.table::fread("Timestamp Var1 Var2
01-01-20T10:47 7 8
01-01-20T11:50 6 4")

df2 <- data.table::fread("Timestamp Var851 Var852
01-01-20T10:55 4 1
01-01-20T12:08 3 4")

#timestamps/dates have to be of posix- or date-class to be able
#to roll-join them
df1[, Timestamp := as.POSIXct( Timestamp, format = "%d-%m-%yT%H:%M")]
df2[, Timestamp := as.POSIXct( Timestamp, format = "%d-%m-%yT%H:%M")]

code

df2[df1, roll = "nearest", on = .(Timestamp)]

# Timestamp Var851 Var852 Var1 Var2
# 1: 2020-01-01 10:47:00 4 1 7 8
# 2: 2020-01-01 11:50:00 3 4 6 4

Find closest timestamps between two dataframes and merge different columns when time difference is 60s

This can be directly expressed in sql:

library(sqldf)
sqldf("select a.*, b.Data df1_Data
from df2 a
left join df1 b on abs(a.Timestamp - b.Timestamp) < 60")

giving:

            Timestamp Data df1_Data
1 2019-12-31 19:00:10 10 1
2 2019-12-31 19:02:30 11 2
3 2019-12-31 19:12:45 12 7
4 2019-12-31 19:20:15 13 NA

Matching two dataframes based on a common ID and the closest timestamp (less than 5 min) in R

Using the dfata frames shown reproducibly in the Note at the end, perform a left join using the indicated condition grouping by dat1 rows and taking the minimum seconds difference over the matched rows. Remove the seconds difference column (4) at the end. Note that there are 60 * 5 seconds in 5 minutes.

library(sqldf)

out <- sqldf("select a.ID,
a.timestamp,
b.timestamp [timestamp.y],
min(abs(a.timestamp - b.timestamp)) seconds,
b.x1,
b.x2
from dat1 a
left join dat2 b on a.ID = b.ID and
abs(a.timestamp - b.timestamp) < 60 * 5
group by a.rowid")[-4]
out$timestamp.y <- as.POSIXct(out$timestamp.y, origin = "1970-01-01")

# check
all.equal(out, target)
## [1] TRUE

Note

Assume the inputs and target below shown reproducibly. Note that the timestamp columns have POSIXct class.

dat1 <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643,
1603712596, 1603719412), class = c("POSIXct", "POSIXt"), tzone = "")),
row.names = c(NA, -3L), class = "data.frame")

dat2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
timestamp = structure(c(1603713323, 1603708763, 1603715319,
1603724172, 1603751992, 1603719472, 1603721777,
1603799685, 1603801544), class = c("POSIXct", "POSIXt"), tzone = ""),
x1 = c("a", "b", "c", "d", "e", "a", "b", "c", "d"), x2 = c("c",
"b", "e", "a", "e", "a", "f", "d", "a")), row.names = c(NA,
-9L), class = "data.frame")

target <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643,
1603712596, 1603719412), class = c("POSIXct", "POSIXt")),
timestamp.y = structure(c(1603708763,
NA, 1603719472), class = c("POSIXct", "POSIXt")),
x1 = c("b", NA, "a"), x2 = c("b", NA, "a")), row.names = c(NA,
-3L), class = "data.frame")

Join two data frames in R based on closest timestamp within groups

We can join the data frames by id and then calculate the time difference and keep the observation with the minimal time difference for each individual:

library(tidyverse)
df2 %>%
left_join(df1, by = "id") %>%
mutate(time_dif = abs(time.x - time.y)) %>%
group_by(id) %>%
filter(time_dif == min(time_dif))

# A tibble: 5 x 5
# Groups: id [5]
id time.x score time.y time_dif
<dbl> <dttm> <dbl> <dttm> <drtn>
1 1 2017-01-25 00:00:00 4 2017-01-11 00:00:00 14 days
2 2 2017-02-02 00:00:00 26 2017-02-01 00:00:00 1 days
3 3 2017-06-15 00:00:00 38 2017-03-23 00:00:00 84 days
4 4 2017-06-20 00:00:00 45.2 2017-03-31 00:00:00 81 days
5 5 2017-01-31 00:00:00 61.4 2017-04-12 00:00:00 71 days

Data

df1 <- structure(list(id = c(1, 2, 3, 4, 5), time = structure(c(1484092800, 
1485907200, 1490227200, 1490918400, 1491955200), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))



df2 <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5), time = structure(c(1466553600,
1465948800, 1453420800, 1485302400, 1433030400, 1421712000, 1453852800,
1485302400, 1485993600, 1517529600, 1400544000, 1434067200, 1466985600,
1497484800, 1390003200, 1516060800, 1464825600, 1497916800, 1527638400,
1454025600, 1390608000, 1421712000, 1466467200, 1453852800, 1485820800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), score = c(3,
2, 5, 4, 5, 24.2, 24.8, 25.4, 26, 26.6, 36.2, 36.8, 37.4, 38,
38.6, 44, 44.6, 45.2, 45.8, 46.4, 59, 59.6, 60.2, 60.8, 61.4)), row.names = c(NA,
-25L), class = c("tbl_df", "tbl", "data.frame"))

Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
abs()

Next, for each row of df1, the row of df2 with closest distance is given by:

  closest.df1 <- apply(dateDist, 1, which.min)

Finally, the merge is performed manually:

cbind(rename_with(df1, ~paste0("df1.", "", .x)),
rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))

##>+ df1.date df1.value df2.date df2.value
##>1 2021-11-23 20:56:06 500 2021-11-23 20:55:47 Ship Emma
##>1.1 2021-11-23 20:56:07 900 2021-11-23 20:55:47 Ship Emma
##>1.2 2021-11-23 20:56:08 1000 2021-11-23 20:55:47 Ship Emma
##>1.3 2021-11-23 20:56:09 200 2021-11-23 20:55:47 Ship Emma
##>1.4 2021-11-23 20:56:10 300 2021-11-23 20:55:47 Ship Emma
##>1.5 2021-11-23 20:56:11 10 2021-11-23 20:55:47 Ship Emma
##>5 2021-11-23 22:13:56 1000 2021-11-23 22:16:01 Ship Amy
##>5.1 2021-11-23 22:13:57 450 2021-11-23 22:16:01 Ship Amy
##>5.2 2021-11-23 22:13:58 950 2021-11-23 22:16:01 Ship Amy
##>5.3 2021-11-23 22:13:59 600 2021-11-23 22:16:01 Ship Amy
##>12 2021-11-24 03:23:21 100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22 750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23 150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24 200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25 300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34 400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35 900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36 1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37 800 2021-11-24 03:23:37 Ship Sally

The second approach involves first calculating the cartesian product of all the rows of df1 and df2 and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character()) to get all the combinations of the two dataframes :

mutate(df1, id = row_number()) %>%
inner_join(mutate(df2, id = row_number()),by = character()) |>
mutate(dist = abs(date.x - date.y)) |>
group_by(id.x) |>
filter(dist == min(dist)) |>
select(-id.x, -id.y, -dist)

##>+ # A tibble: 19 × 7
##># Groups: id.x [19]
##> date.x value.x id.x date.y value.y id.y dist
##> <dttm> <dbl> <int> <dttm> <chr> <int> <drtn>
##> 1 2021-11-23 20:56:06 500 1 2021-11-23 20:55:47 Ship Emma 1 19 s…
##> 2 2021-11-23 20:56:07 900 2 2021-11-23 20:55:47 Ship Emma 1 20 s…
##> 3 2021-11-23 20:56:08 1000 3 2021-11-23 20:55:47 Ship Emma 1 21 s…
##> 4 2021-11-23 20:56:09 200 4 2021-11-23 20:55:47 Ship Emma 1 22 s…
##> 5 2021-11-23 20:56:10 300 5 2021-11-23 20:55:47 Ship Emma 1 23 s…
##> 6 2021-11-23 20:56:11 10 6 2021-11-23 20:55:47 Ship Emma 1 24 s…
##> 7 2021-11-23 22:13:56 1000 7 2021-11-23 22:16:01 Ship Amy 5 125 s…
##> 8 2021-11-23 22:13:57 450 8 2021-11-23 22:16:01 Ship Amy 5 124 s…
##> 9 2021-11-23 22:13:58 950 9 2021-11-23 22:16:01 Ship Amy 5 123 s…
##>10 2021-11-23 22:13:59 600 10 2021-11-23 22:16:01 Ship Amy 5 122 s…
##>11 2021-11-24 03:23:21 100 11 2021-11-24 03:23:37 Ship Sally 12 16 s…
##>12 2021-11-24 03:23:22 750 12 2021-11-24 03:23:37 Ship Sally 12 15 s…
##>13 2021-11-24 03:23:23 150 13 2021-11-24 03:23:37 Ship Sally 12 14 s…
##>14 2021-11-24 03:23:24 200 14 2021-11-24 03:23:37 Ship Sally 12 13 s…
##>15 2021-11-24 03:23:25 300 15 2021-11-24 03:23:37 Ship Sally 12 12 s…
##>16 2021-11-24 03:24:34 400 16 2021-11-24 03:23:37 Ship Sally 12 57 s…
##>17 2021-11-24 03:24:35 900 17 2021-11-24 03:23:37 Ship Sally 12 58 s…
##>18 2021-11-24 03:24:36 1020 18 2021-11-24 03:23:37 Ship Sally 12 59 s…
##>19 2021-11-24 03:24:37 800 19 2021-11-24 03:23:37 Ship Sally 12 60 s…

How to join two dataframes by nearest time-date?

data.table should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table happy). Also make sure you're setting the key column before using roll.

(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):

new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )

date data.new
1: 2016-03-02 12:20:00 t
2: 2016-03-07 12:20:00 u

old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )


date data.old
1: 2016-03-02 12:20:00 a
2: 2016-03-07 12:20:00 b

setkey( new, date )
setkey( old, date )

combined <- new[ old, roll = "nearest" ]
combined

date data.new data.old
1: 2015-03-02 12:20:00 t d
2: 2016-03-02 12:20:00 t a
3: 2016-03-07 12:20:00 u b
4: 2016-04-02 12:20:00 v c

I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:

combined <- old[ new, roll = "nearest" ]
combined

date data.old data.new
1: 2016-03-02 12:20:00 a t
2: 2016-03-07 12:20:00 b u
3: 2016-04-02 12:20:00 c v


Related Topics



Leave a reply



Submit