Matching Timestamped Data to Closest Time in Another Dataset. Properly Vectorized? Faster Way

Matching timestamped data to closest time in another dataset. Properly vectorized? Faster way?

You can try data.tables rolling join using the "nearest" option

library(data.table) # v1.9.6+
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
# [1] 5 7 7 8

Find the closest date between dataset1 and dataset2

An option using outer:

satDat[apply(abs(outer(satDat, stationDat, difftime, units = 'days')), 2, which.min)]

#> [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

How it works:

  • outer takes applies difftime to each pair of elements in the two vectors, returning a matrix,
  • over which apply iterates over the columns (MARGIN = 2), calling which.min on each, which returns the index of the smallest,
  • which is used to subset satDat.

Note that outer allocates a matrix with dimensions of length(satDat) by length(stationDat), which can require a lot of memory if your data is already large.

Matching data in 2 data frame in R?

Simple solution using base R merge:

merge(data2, data1, all.x = TRUE)

faster method for ordered column names dataframe from numeric dataframe in R

If the number of columns in your df is just three, here is a faster solution using max.col. It is provably about 8x faster than the fastest solution proposed in the other answer when nrow(df)=100.

The case in which nrow(df)=100

library(microbenchmark)
set.seed(123)
size <- 100
df <- data.frame(x = abs(rnorm(size)), y = abs(rnorm(size)), z = abs(rnorm(size)))

f1 <- function(df){
vec <- unlist(t(df))
sq <- seq(0,(nrow(df)-1)*3,3)
m1 <- max.col(df)
# -----------------------
vec[sq+m1] <- -Inf
m2 <- max.col(matrix(vec, ncol=3, byrow=T))
vec[sq+m2] <- -Inf
# -----------------------
m3 <- max.col(matrix(vec, ncol=3, byrow=T))
nm <- names(df)
cbind(nm[m1], nm[m2], nm[m3])
}

all(f1(df)==get_name_df_with_for(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply_names(df))
# [1] TRUE
all(f1(df)==get_name_df_double_t(df))
# [1] TRUE
microbenchmark(f1(df), "f2"=get_name_df_with_for(df), "f3"=get_name_df_with_apply(df),
"f4"=get_name_df_with_apply_names(df), "f5"=get_name_df_double_t(df))

# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 395.643 458.0905 470.8278 472.633 492.7355 701.464 100
# f2 59262.146 61773.0865 63098.5840 62963.223 64309.4780 74246.953 100
# f3 5491.521 5637.1605 6754.3912 5801.619 5956.4545 90457.611 100
# f4 3392.689 3463.9055 3603.1546 3569.125 3707.2795 4237.012 100
# f5 5513.335 5636.3045 5954.9277 5781.089 5971.2115 8622.017 100

Significantly faster when nrow(df)=1000

# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 693.765 769.8995 878.3698 815.6655 846.4615 3559.929 100
# f2 627876.429 646057.8155 671925.4799 657768.6270 694047.9940 797900.142 100
# f3 49570.397 52038.3515 54334.0501 53838.8465 56181.0515 62517.965 100
# f4 28892.611 30046.8180 31961.4085 31262.4040 33057.5525 48694.850 100
# f5 49866.379 51491.7235 54413.8287 53705.3970 55962.0575 75287.600 100

merge by nearest neighbour in group - R

As mentioned by Henrik, this can be solved by updating in a rolling join to the nearest which is available in the data.table package. Additionally, the OP has requested to go for the most recent date if matches are equally distant.

library(data.table)
setDT(dat1)[setDT(dat2), roll = "nearest", on = c("iso2code", "year"),
`:=`(year.2 = i.year, affpol = i.affpol)]

dat1
   iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 2008 2.47
4: AT 2010 3.91 NA NA
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA

This operation has updated dat1 by reference, i.e., without copying the whole data object by adding two additional columns.

Now, the OP has requested to go for the most recent date if matches are equally distant but the join has picked the older date. Apparently, there is no parameter to control this in a rolling join to the nearest.

The workaround is to create a helper variable nyear which holds the negative year and to join on this:

setDT(dat1)[, nyear := -year][setDT(dat2)[, nyear := -year], 
roll = "nearest", on = c("iso2code", "nyear"),
`:=`(year.2 = i.year, affpol = i.affpol)][
, nyear := NULL]
dat1
   iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 NA NA
4: AT 2010 3.91 2008 2.47
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA

joining data based on a moving time window in R

You could use findInterval function to find nearest value:

# example data:
x <- rnorm(120000)
y <- rnorm(71000)
y <- sort(y) # second vector must be sorted
id <- findInterval(x, y, all.inside=TRUE) # finds position of last y smaller then x
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # to find nearest

In your case some as.numeric might be needed.

# assumed that SortWeath is sorted, if not then SortWeath <- SortWeath[order(SortWeath$DateTime),]
x <- as.numeric(SortLoc$DateTime)
y <- as.numeric(SortWeath$DateTime)
id <- findInterval(x, y, all.inside=TRUE)
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1)
SortLoc$WndSp <- SortWeath$WndSp[id_min]
SortLoc$WndDir <- SortWeath$WndDir[id_min]
SortLoc$Hgt <- SortWeath$Hgt[id_min]

Some addition: you should never, ABSOLUTELY NEWER add values to data.frame in for-loop. Check this comparison:

N=1000
x <- numeric(N)
X <- data.frame(x=x)
require(rbenchmark)
benchmark(
vector = {for (i in 1:N) x[i]<-1},
data.frame = {for (i in 1:N) X$x[i]<-1}
)
# test replications elapsed relative
# 2 data.frame 100 4.32 22.74
# 1 vector 100 0.19 1.00

data.frame version is over 20 times slower, and if more rows it contain then difference is bigger.

So if you change you script and first initialize result vectors:

tmp_WndSp <- tmp_WndDir <- tmp_Hg <- rep(NA, nrow(SortLoc))

then update values in loop

tmp_WndSp[i] <- SortWeath$WndSp[weathrow+1]
# and so on...

and at the end (outside the loop) update proper columns:

SortLoc$WndSp <- tmp_WndSp
SortLoc$WndDir <- tmp_WndDir
SortLoc$Hgt <- tmp_Hgt

It should run much faster.

Finding the nearest value and return the index of array in Python

This is similar to using bisect_left, but it'll allow you to pass in an array of targets

def find_closest(A, target):
#A must be sorted
idx = A.searchsorted(target)
idx = np.clip(idx, 1, len(A)-1)
left = A[idx-1]
right = A[idx]
idx -= target - left < right - target
return idx

Some explanation:

First the general case: idx = A.searchsorted(target) returns an index for each target such that target is between A[index - 1] and A[index]. I call these left and right so we know that left < target <= right. target - left < right - target is True (or 1) when target is closer to left and False (or 0) when target is closer to right.

Now the special case: when target is less than all the elements of A, idx = 0. idx = np.clip(idx, 1, len(A)-1) replaces all values of idx < 1 with 1, so idx=1. In this case left = A[0], right = A[1] and we know that target <= left <= right. Therefor we know that target - left <= 0 and right - target >= 0 so target - left < right - target is True unless target == left == right and idx - True = 0.

There is another special case if target is greater than all the elements of A, In that case idx = A.searchsorted(target) and np.clip(idx, 1, len(A)-1)
replaces len(A) with len(A) - 1 so idx=len(A) -1 and target - left < right - target ends up False so idx returns len(A) -1. I'll let you work though the logic on your own.

For example:

In [163]: A = np.arange(0, 20.)

In [164]: target = np.array([-2, 100., 2., 2.4, 2.5, 2.6])

In [165]: find_closest(A, target)
Out[165]: array([ 0, 19, 2, 2, 3, 3])


Related Topics



Leave a reply



Submit