Matching Timestamped Data to Closest Time in Another Dataset. Properly Vectorized? Faster Way

Matching timestamped data to closest time in another dataset. Properly vectorized? Faster way?

You can try data.tables rolling join using the "nearest" option

library(data.table) # v1.9.6+
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
# [1] 5 7 7 8

Find the closest date between dataset1 and dataset2

An option using outer:

satDat[apply(abs(outer(satDat, stationDat, difftime, units = 'days')), 2, which.min)]

#>  [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#>  [6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

How it works:

outer takes applies difftime to each pair of elements in the two vectors, returning a matrix,
over which apply iterates over the columns (MARGIN = 2), calling which.min on each, which returns the index of the smallest,
which is used to subset satDat.

Note that outer allocates a matrix with dimensions of length(satDat) by length(stationDat), which can require a lot of memory if your data is already large.

Matching data in 2 data frame in R?

Simple solution using base R merge:

merge(data2, data1, all.x = TRUE)

faster method for ordered column names dataframe from numeric dataframe in R

If the number of columns in your df is just three, here is a faster solution using max.col. It is provably about 8x faster than the fastest solution proposed in the other answer when nrow(df)=100.

The case in which nrow(df)=100

library(microbenchmark)
set.seed(123)
size <- 100
df <- data.frame(x = abs(rnorm(size)), y = abs(rnorm(size)), z = abs(rnorm(size)))  

f1 <- function(df){
    vec <- unlist(t(df))
    sq <- seq(0,(nrow(df)-1)*3,3)
    m1 <- max.col(df)
    # -----------------------
    vec[sq+m1] <- -Inf
    m2 <- max.col(matrix(vec, ncol=3, byrow=T))
    vec[sq+m2] <- -Inf
    # -----------------------
    m3 <- max.col(matrix(vec, ncol=3, byrow=T))
    nm <- names(df)
    cbind(nm[m1], nm[m2], nm[m3])
}

all(f1(df)==get_name_df_with_for(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply_names(df))
# [1] TRUE
all(f1(df)==get_name_df_double_t(df))
# [1] TRUE
microbenchmark(f1(df), "f2"=get_name_df_with_for(df), "f3"=get_name_df_with_apply(df), 
            "f4"=get_name_df_with_apply_names(df), "f5"=get_name_df_double_t(df))

# Unit: microseconds
   # expr       min         lq       mean    median         uq       max neval
 # f1(df)   395.643   458.0905   470.8278   472.633   492.7355   701.464   100
     # f2 59262.146 61773.0865 63098.5840 62963.223 64309.4780 74246.953   100
     # f3  5491.521  5637.1605  6754.3912  5801.619  5956.4545 90457.611   100
     # f4  3392.689  3463.9055  3603.1546  3569.125  3707.2795  4237.012   100
     # f5  5513.335  5636.3045  5954.9277  5781.089  5971.2115  8622.017   100

Significantly faster when nrow(df)=1000

# Unit: microseconds
   # expr        min          lq        mean      median          uq        max neval
 # f1(df)    693.765    769.8995    878.3698    815.6655    846.4615   3559.929   100
     # f2 627876.429 646057.8155 671925.4799 657768.6270 694047.9940 797900.142   100
     # f3  49570.397  52038.3515  54334.0501  53838.8465  56181.0515  62517.965   100
     # f4  28892.611  30046.8180  31961.4085  31262.4040  33057.5525  48694.850   100
     # f5  49866.379  51491.7235  54413.8287  53705.3970  55962.0575  75287.600   100

merge by nearest neighbour in group - R

As mentioned by Henrik, this can be solved by updating in a rolling join to the nearest which is available in the data.table package. Additionally, the OP has requested to go for the most recent date if matches are equally distant.

library(data.table)
setDT(dat1)[setDT(dat2), roll = "nearest", on = c("iso2code", "year"), 
     `:=`(year.2 = i.year, affpol = i.affpol)]

dat1

   iso2code year elect_polar_lrecon year.2 affpol
1:       AT 1999               2.48     NA     NA
2:       AT 2002               4.18     NA     NA
3:       AT 2006               3.66   2008   2.47
4:       AT 2010               3.91     NA     NA
5:       AT 2014               4.01   2013   2.49
6:       AT 2019               3.55     NA     NA

This operation has updated dat1 by reference, i.e., without copying the whole data object by adding two additional columns.

Now, the OP has requested to go for the most recent date if matches are equally distant but the join has picked the older date. Apparently, there is no parameter to control this in a rolling join to the nearest.

The workaround is to create a helper variable nyear which holds the negative year and to join on this:

setDT(dat1)[, nyear := -year][setDT(dat2)[, nyear := -year], 
                              roll = "nearest", on = c("iso2code", "nyear"), 
                             `:=`(year.2 = i.year, affpol = i.affpol)][
                               , nyear := NULL]
dat1

   iso2code year elect_polar_lrecon year.2 affpol
1:       AT 1999               2.48     NA     NA
2:       AT 2002               4.18     NA     NA
3:       AT 2006               3.66     NA     NA
4:       AT 2010               3.91   2008   2.47
5:       AT 2014               4.01   2013   2.49
6:       AT 2019               3.55     NA     NA

joining data based on a moving time window in R

You could use findInterval function to find nearest value:

# example data:
x <- rnorm(120000)
y <- rnorm(71000)
y <- sort(y) # second vector must be sorted
id <- findInterval(x, y, all.inside=TRUE) # finds position of last y smaller then x
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # to find nearest

In your case some as.numeric might be needed.

# assumed that SortWeath is sorted, if not then SortWeath <- SortWeath[order(SortWeath$DateTime),]
x <- as.numeric(SortLoc$DateTime)
y <- as.numeric(SortWeath$DateTime)
id <- findInterval(x, y, all.inside=TRUE)
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1)
SortLoc$WndSp  <- SortWeath$WndSp[id_min]
SortLoc$WndDir <- SortWeath$WndDir[id_min]
SortLoc$Hgt    <- SortWeath$Hgt[id_min]

Some addition: you should never, ABSOLUTELY NEWER add values to data.frame in for-loop. Check this comparison:

N=1000
x <- numeric(N)
X <- data.frame(x=x)
require(rbenchmark)
benchmark(
    vector = {for (i in 1:N) x[i]<-1},
    data.frame = {for (i in 1:N) X$x[i]<-1}
)
#         test replications elapsed relative
# 2 data.frame          100    4.32    22.74
# 1     vector          100    0.19     1.00

data.frame version is over 20 times slower, and if more rows it contain then difference is bigger.

So if you change you script and first initialize result vectors:

tmp_WndSp <- tmp_WndDir <- tmp_Hg <- rep(NA, nrow(SortLoc))

then update values in loop

tmp_WndSp[i] <- SortWeath$WndSp[weathrow+1]
# and so on...

and at the end (outside the loop) update proper columns:

SortLoc$WndSp <- tmp_WndSp
SortLoc$WndDir <- tmp_WndDir
SortLoc$Hgt <- tmp_Hgt

It should run much faster.

Finding the nearest value and return the index of array in Python

This is similar to using bisect_left, but it'll allow you to pass in an array of targets

def find_closest(A, target):
    #A must be sorted
    idx = A.searchsorted(target)
    idx = np.clip(idx, 1, len(A)-1)
    left = A[idx-1]
    right = A[idx]
    idx -= target - left < right - target
    return idx

Some explanation:

First the general case: idx = A.searchsorted(target) returns an index for each target such that target is between A[index - 1] and A[index]. I call these left and right so we know that left < target <= right. target - left < right - target is True (or 1) when target is closer to left and False (or 0) when target is closer to right.

Now the special case: when target is less than all the elements of A, idx = 0. idx = np.clip(idx, 1, len(A)-1) replaces all values of idx < 1 with 1, so idx=1. In this case left = A[0], right = A[1] and we know that target <= left <= right. Therefor we know that target - left <= 0 and right - target >= 0 so target - left < right - target is True unless target == left == right and idx - True = 0.

There is another special case if target is greater than all the elements of A, In that case idx = A.searchsorted(target) and np.clip(idx, 1, len(A)-1) replaces len(A) with len(A) - 1 so idx=len(A) -1 and target - left < right - target ends up False so idx returns len(A) -1. I'll let you work though the logic on your own.

For example:

In [163]: A = np.arange(0, 20.)

In [164]: target = np.array([-2, 100., 2., 2.4, 2.5, 2.6])

In [165]: find_closest(A, target)
Out[165]: array([ 0, 19,  2,  2,  3,  3])

Matching Timestamped Data to Closest Time in Another Dataset. Properly Vectorized? Faster Way