How to merge two data frames in r by a common column with mismatched date/time values
After first converting your datetime character strings to POSIXt
classes, some combination of round
ing and trunc
ating those times should get you something you can use as the basis of a merge.
First read in your data, and create corresponding POSIXt datetimes:
dts1 <- structure(list(datetime = structure(1:6,
.Label = c("30/03/2011 02:32", "30/03/2011 02:42",
"30/03/2011 02:52", "30/03/2011 03:02", "30/03/2011 03:12",
"30/03/2011 03:22"), class = "factor"), count = c(27L, 3L,
0L, 1L, 15L, 0L), period = c(561L, 600L, 574L, 550L, 600L,
597L)), .Names = c("datetime", "count", "period"),
class = "data.frame", row.names = c(NA, -6L))
dts2 <- structure(list(datetime = structure(1:7,
.Label = c("30/03/2011 01:59", "30/03/2011 02:58",
"30/03/2011 03:55", "30/03/2011 04:53", "30/03/2011 05:52",
"30/03/2011 06:48", "30/03/2011 07:48"), class = "factor"),
dist = c(23.9, 14.7, 10.4, 35.4, 56.1, 12.3, 10.7), car =
c(1L, 1L, 2L, 1L, 1L, 1L, 1L), satd = c(3L, 7L, 9L, 3L, 7L,
4L, 5L), alt = c(1.76, 6.36, -0.34, 3.55, -0.91, 6.58,
4.18)), .Names = c("datetime", "dist", "car", "satd",
"alt"), class = "data.frame", row.names = c(NA, -7L))
# create corresponding POSIXlt vector
# (you could update the 'datetime' columns in-place if you prefer)
datetime1 <- strptime(dts1$datetime, format="%d/%m/%Y %H:%M")
datetime2 <- strptime(dts2$datetime, format="%d/%m/%Y %H:%M")
The following code produces a merged table based on the nearest hour in all cases. Inside the merge it's just prepending a column with the rounded times to each of your data frames, merging based on that (i.e., column number 1), then using the -1
index to remove that column at the end:
# merge based on nearest hour
merge(
cbind(round(datetime1, "hours"), dts1),
cbind(round(datetime2, "hours"), dts2),
by=1, suffixes=c("_dts1", "_dts2")
)[-1]
datetime_dts1 count period datetime_dts2 dist car satd alt
1 30/03/2011 02:32 27 561 30/03/2011 02:58 14.7 1 7 6.36
2 30/03/2011 02:42 3 600 30/03/2011 02:58 14.7 1 7 6.36
3 30/03/2011 02:52 0 574 30/03/2011 02:58 14.7 1 7 6.36
4 30/03/2011 03:02 1 550 30/03/2011 02:58 14.7 1 7 6.36
5 30/03/2011 03:12 15 600 30/03/2011 02:58 14.7 1 7 6.36
6 30/03/2011 03:22 0 597 30/03/2011 02:58 14.7 1 7 6.36
As above, but this time just truncating on hour:
merge(
cbind(trunc(datetime1, "hours"), dts1),
cbind(trunc(datetime2, "hours"), dts2),
by=1, suffixes=c("_dts1", "_dts2")
)[-1]
datetime_dts1 count period datetime_dts2 dist car satd alt
1 30/03/2011 02:32 27 561 30/03/2011 02:58 14.7 1 7 6.36
2 30/03/2011 02:42 3 600 30/03/2011 02:58 14.7 1 7 6.36
3 30/03/2011 02:52 0 574 30/03/2011 02:58 14.7 1 7 6.36
4 30/03/2011 03:02 1 550 30/03/2011 03:55 10.4 2 9 -0.34
5 30/03/2011 03:12 15 600 30/03/2011 03:55 10.4 2 9 -0.34
6 30/03/2011 03:22 0 597 30/03/2011 03:55 10.4 2 9 -0.34
As above, but for dts1 treat the record as belonging to previous hour until 10 minutes past the hour, by subtracting 10*60 seconds before truncating. This one produces the same output you specified, though without more information I'm not sure that it's the exact rule you want.
merge(
cbind(trunc(datetime1 - 10*60, "hours"), dts1),
cbind(trunc(datetime2, "hours"), dts2),
by=1, suffixes=c("_dts1", "_dts2")
)[-1]
datetime_dts1 count period datetime_dts2 dist car satd alt
1 30/03/2011 02:32 27 561 30/03/2011 02:58 14.7 1 7 6.36
2 30/03/2011 02:42 3 600 30/03/2011 02:58 14.7 1 7 6.36
3 30/03/2011 02:52 0 574 30/03/2011 02:58 14.7 1 7 6.36
4 30/03/2011 03:02 1 550 30/03/2011 02:58 14.7 1 7 6.36
5 30/03/2011 03:12 15 600 30/03/2011 03:55 10.4 2 9 -0.34
6 30/03/2011 03:22 0 597 30/03/2011 03:55 10.4 2 9 -0.34
You could tweak the details of which ones you round, which ones you truncate, and whether you first subtract/add some time depending on your specific rule.
Edit:
Not the most elegant, but here is a different approach that accommodates the more complicated conditional rule you described in your comments. This leans heavily on na.locf
from the zoo package to first determine which dts2 times come before and after each dts1 record. With those in hand, it's just a matter of applying the rule to select the desired dts2 time, matching back to the original dts1 table, then merging.
library(zoo)
# create ordered list of all datetimes, using names to keep
# track of which ones come from each data frame
alldts <- sort(c(
setNames(datetime1, rep("dts1", length(datetime1))),
setNames(datetime2, rep("dts2", length(datetime2)))))
is.dts1 <- names(alldts)=="dts1"
# for each dts1 record, get previous closest dts2 time
dts2.prev <- alldts
dts2.prev[is.dts1] <- NA
dts2.prev <- na.locf(dts2.prev, na.rm=FALSE)[is.dts1]
# for each dts1 record, get next closest dts2 time
dts2.next <- alldts
dts2.next[is.dts1] <- NA
dts2.next <- na.locf(dts2.next, na.rm=FALSE, fromLast=TRUE)[is.dts1]
# for each dts1 record, apply rule to choose dts2 time
use.prev <- !is.na(dts2.prev) & (alldts[is.dts1] - dts2.prev < 5)
dts2.to.use <- ifelse(use.prev, as.character(dts2.prev),
as.character(dts2.next))
# merge based on chosen dts2 times, prepended as character vector
# for the purpose of merging
merge(
cbind(.dt=dts2.to.use[match(datetime1, alldts[is.dts1])], dts1),
cbind(.dt=as.character(datetime2), dts2),
by=".dt", all.x=TRUE, suffixes=c("_dts1", "_dts2")
)[-1]
Merging dataframes of different length by matching dates
Use the dplyr
package and try left_join()
. This returns all rows from df1
and all columns from both df1
and df2
. Any rows in df1
with no match will receive NA
.
library(dplyr)
left_join(df1, df2, by = "date_time")
Check out the other types of join you can have with ?join
.
merge dataframes based on common columns but keeping all rows from x
I think you have data tables rather than simple dataframes, and merge works slightly differently between the two. You could try forcing it to use the dataframe method by using NewDataframe <- merge.data.frame(x, y, all.x=TRUE)
which should by default merge on all shared column names.
Merge 2 data frames using common date, plus 2 rows before and n-1 rows after
This might be one approach using tidyverse
and fuzzyjoin
.
First, indicate event
numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length
days - 1 after the date).
Then, you can use fuzzy_inner_join
to get the selected rows from the second data.frame. Here, you will want to include where the datetime
in the second data.frame falls after the start date and before the end date of the first data.frame.
library(tidyverse)
library(fuzzyjoin)
df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime + df1$length - 1
fuzzy_inner_join(
df1,
df2,
by = c("start_date" = "datetime", "end_date" = "datetime"),
match_fun = c(`<=`, `>=`)
) %>%
select(datetime.y, length, q, event)
I tried this out with some made up data:
R> df1
datetime length
1 2003-06-03 1
2 2003-06-12 1
3 2003-06-21 1
4 2003-06-30 3
5 2003-07-09 5
6 2003-07-18 1
7 2003-07-27 1
8 2003-08-05 2
9 2003-08-14 1
10 2003-08-23 1
11 2003-09-01 3
R> df2
datetime q
1 2003-06-03 44
2 2003-06-04 52
3 2003-06-05 34
4 2003-06-06 20
5 2003-06-07 57
6 2003-06-08 67
7 2003-06-09 63
8 2003-06-10 51
9 2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16 6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03 9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70
This will give the following output:
datetime.y length q event
1 2003-06-03 1 44 1
2 2003-06-10 1 51 2
3 2003-06-11 1 56 2
4 2003-06-12 1 37 2
5 2003-06-19 1 61 3
6 2003-06-20 1 42 3
7 2003-06-21 1 28 3
8 2003-06-28 3 38 4
9 2003-06-29 3 27 4
10 2003-06-30 3 49 4
11 2003-07-01 3 10 4
12 2003-07-02 3 89 4
13 2003-07-07 5 31 5
14 2003-07-08 5 29 5
15 2003-07-09 5 84 5
16 2003-07-10 5 60 5
17 2003-07-11 5 19 5
18 2003-07-12 5 97 5
19 2003-07-13 5 35 5
If the output desired is different than above, please let me know what should be different so that I can correct it.
Data
df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233,
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"),
length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11,
start_date = structure(c(12204, 12213, 12222, 12231, 12240,
12249, 12258, 12267, 12276, 12285, 12294), class = "Date"),
end_date = structure(c(12206, 12215, 12224, 12235, 12246,
12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")
df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209,
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218,
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227,
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236,
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245,
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L,
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L,
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L,
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))
Merge 2 dataframes by matching dates
Instead of using merge
with data.table
, you can also simply join as follows:
setDT(df1)
setDT(df2)
df2[df1, on = c('id','dates')]
this gives:
> df2[df1]
id dates field1 field2
1: MUM-1 2015-07-10 1 0
2: MUM-1 2015-07-11 NA NA
3: MUM-1 2015-07-12 2 1
4: MUM-2 2014-01-14 4 3
5: MUM-2 2014-01-15 NA NA
6: MUM-2 2014-01-16 NA NA
7: MUM-2 2014-01-17 0 1
Doing this with dplyr
:
library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))
As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:
dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]
As can be seen, @Arun's approach is slightly faster:
library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
jaap = dt2[dt1, on = c('id','dates')],
pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
dplr = left_join(dt1, dt2, by=c("id","dates")),
arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])
test elapsed relative
4 arun 0.015 1.000
1 jaap 0.016 1.067
3 dplr 0.037 2.467
2 pavo 1.033 68.867
For a comparison on a large dataset, see the answer of @Arun.
How do I combine two data-frames based on two columns?
See the documentation on ?merge
, which states:
By default the data frames are merged on the columns with names they both have,
but separate specifications of the columns can be given by by.x and by.y.
This clearly implies that merge
will merge data frames based on more than one column. From the final example given in the documentation:
x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match
This example was meant to demonstrate the use of incomparables
, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x
and y
using by.x
and by.y
.
Matching data from one dataframe to a time block that it fits within in a second dataframe
Something like this ?
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
res <- cbind(df2, df1[rows,])
datetime a n dpm datetime count
2 10/11/2012 16:26 2.03 27 3473 10/11/2012 16:35 55
7 10/11/2012 17:24 1.35 28 3636 10/11/2012 17:25 455
13 10/11/2012 18:21 7.63 29 3516 10/11/2012 18:25 583
PS1: I think the count of your expected result is wrong on row #1
PS2: It would have been easier if you had provided the datasets in a directly usable
form.
I had to do:
d1 <-
'datetime count
10/11/2012 16:25 231
...
'
d2 <-
'datetime a n dpm
10/11/2012 16:26 2.03 27 3473
10/11/2012 17:24 1.35 28 3636
10/11/2012 18:21 7.63 29 3516
'
.parse <- function(s) {
cs <- gsub('\\s\\s+', '\t', s)
read.table(text=cs, sep="\t", header=TRUE, stringsAsFactors=FALSE)
}
df1 <- .parse(d1)
df2 <- .parse(d2)
Find closest timestamps between two dataframes and merge different columns when time difference is 60s
This can be directly expressed in sql:
library(sqldf)
sqldf("select a.*, b.Data df1_Data
from df2 a
left join df1 b on abs(a.Timestamp - b.Timestamp) < 60")
giving:
Timestamp Data df1_Data
1 2019-12-31 19:00:10 10 1
2 2019-12-31 19:02:30 11 2
3 2019-12-31 19:12:45 12 7
4 2019-12-31 19:20:15 13 NA
Related Topics
Twitter Emoji Encoding Problems with Twitter and R
Binning Data, Finding Results by Group, and Plotting Using R
R Cumulative Sum with a Condition and a Reset
Drawing a Stratified Sample in R
Using Rvest to Scrape a Website W/ a Login Page
R Shiny - Uioutput Not Rendering Inside Menuitem
Does Installing Blas/Atlas/Mkl/Openblas Will Speed Up R Package That Is Written in C/C++
Replace Missing Values with a Value from Another Column
How to Merge Two Data Frames in R by a Common Column with Mismatched Date/Time Values
Categorical Scatter Plot with Mean Segments Using Ggplot2 in R
How to Optimize the Following Code with Nested While-Loop? Multicore an Option
What If I Want to Web Scrape with R for a Page with Parameters
Scales = "Free" Works for Facet_Wrap But Doesn't for Facet_Grid
Error: Maximal Number of Dlls Reached
Rmarkdown Setting the Position of Kable
Are Factors Stored More Efficiently in Data.Table Than Characters