Matching timestamped data to closest time in another dataset. Properly vectorized? Faster way?
You can try data.table
s rolling join using the "nearest" option
library(data.table) # v1.9.6+
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
# [1] 5 7 7 8
Find the closest date between dataset1 and dataset2
An option using outer
:
satDat[apply(abs(outer(satDat, stationDat, difftime, units = 'days')), 2, which.min)]
#> [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"
How it works:
outer
takes appliesdifftime
to each pair of elements in the two vectors, returning a matrix,- over which
apply
iterates over the columns (MARGIN = 2
), callingwhich.min
on each, which returns the index of the smallest, - which is used to subset
satDat
.
Note that outer
allocates a matrix with dimensions of length(satDat)
by length(stationDat)
, which can require a lot of memory if your data is already large.
Matching data in 2 data frame in R?
Simple solution using base R merge
:
merge(data2, data1, all.x = TRUE)
faster method for ordered column names dataframe from numeric dataframe in R
If the number of columns in your df
is just three, here is a faster solution using max.col
. It is provably about 8x faster than the fastest solution proposed in the other answer when nrow(df)=100
.
The case in which nrow(df)=100
library(microbenchmark)
set.seed(123)
size <- 100
df <- data.frame(x = abs(rnorm(size)), y = abs(rnorm(size)), z = abs(rnorm(size)))
f1 <- function(df){
vec <- unlist(t(df))
sq <- seq(0,(nrow(df)-1)*3,3)
m1 <- max.col(df)
# -----------------------
vec[sq+m1] <- -Inf
m2 <- max.col(matrix(vec, ncol=3, byrow=T))
vec[sq+m2] <- -Inf
# -----------------------
m3 <- max.col(matrix(vec, ncol=3, byrow=T))
nm <- names(df)
cbind(nm[m1], nm[m2], nm[m3])
}
all(f1(df)==get_name_df_with_for(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply_names(df))
# [1] TRUE
all(f1(df)==get_name_df_double_t(df))
# [1] TRUE
microbenchmark(f1(df), "f2"=get_name_df_with_for(df), "f3"=get_name_df_with_apply(df),
"f4"=get_name_df_with_apply_names(df), "f5"=get_name_df_double_t(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 395.643 458.0905 470.8278 472.633 492.7355 701.464 100
# f2 59262.146 61773.0865 63098.5840 62963.223 64309.4780 74246.953 100
# f3 5491.521 5637.1605 6754.3912 5801.619 5956.4545 90457.611 100
# f4 3392.689 3463.9055 3603.1546 3569.125 3707.2795 4237.012 100
# f5 5513.335 5636.3045 5954.9277 5781.089 5971.2115 8622.017 100
Significantly faster when nrow(df)=1000
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 693.765 769.8995 878.3698 815.6655 846.4615 3559.929 100
# f2 627876.429 646057.8155 671925.4799 657768.6270 694047.9940 797900.142 100
# f3 49570.397 52038.3515 54334.0501 53838.8465 56181.0515 62517.965 100
# f4 28892.611 30046.8180 31961.4085 31262.4040 33057.5525 48694.850 100
# f5 49866.379 51491.7235 54413.8287 53705.3970 55962.0575 75287.600 100
merge by nearest neighbour in group - R
As mentioned by Henrik, this can be solved by updating in a rolling join to the nearest which is available in the data.table package. Additionally, the OP has requested to go for the most recent date if matches are equally distant.
library(data.table)
setDT(dat1)[setDT(dat2), roll = "nearest", on = c("iso2code", "year"),
`:=`(year.2 = i.year, affpol = i.affpol)]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 2008 2.47
4: AT 2010 3.91 NA NA
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA
This operation has updated dat1
by reference, i.e., without copying the whole data object by adding two additional columns.
Now, the OP has requested to go for the most recent date if matches are equally distant but the join has picked the older date. Apparently, there is no parameter to control this in a rolling join to the nearest.
The workaround is to create a helper variable nyear
which holds the negative year and to join on this:
setDT(dat1)[, nyear := -year][setDT(dat2)[, nyear := -year],
roll = "nearest", on = c("iso2code", "nyear"),
`:=`(year.2 = i.year, affpol = i.affpol)][
, nyear := NULL]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 NA NA
4: AT 2010 3.91 2008 2.47
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA
joining data based on a moving time window in R
You could use findInterval
function to find nearest value:
# example data:
x <- rnorm(120000)
y <- rnorm(71000)
y <- sort(y) # second vector must be sorted
id <- findInterval(x, y, all.inside=TRUE) # finds position of last y smaller then x
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # to find nearest
In your case some as.numeric
might be needed.
# assumed that SortWeath is sorted, if not then SortWeath <- SortWeath[order(SortWeath$DateTime),]
x <- as.numeric(SortLoc$DateTime)
y <- as.numeric(SortWeath$DateTime)
id <- findInterval(x, y, all.inside=TRUE)
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1)
SortLoc$WndSp <- SortWeath$WndSp[id_min]
SortLoc$WndDir <- SortWeath$WndDir[id_min]
SortLoc$Hgt <- SortWeath$Hgt[id_min]
Some addition: you should never, ABSOLUTELY NEWER add values to data.frame
in for-loop. Check this comparison:
N=1000
x <- numeric(N)
X <- data.frame(x=x)
require(rbenchmark)
benchmark(
vector = {for (i in 1:N) x[i]<-1},
data.frame = {for (i in 1:N) X$x[i]<-1}
)
# test replications elapsed relative
# 2 data.frame 100 4.32 22.74
# 1 vector 100 0.19 1.00
data.frame
version is over 20 times slower, and if more rows it contain then difference is bigger.
So if you change you script and first initialize result vectors:
tmp_WndSp <- tmp_WndDir <- tmp_Hg <- rep(NA, nrow(SortLoc))
then update values in loop
tmp_WndSp[i] <- SortWeath$WndSp[weathrow+1]
# and so on...
and at the end (outside the loop) update proper columns:
SortLoc$WndSp <- tmp_WndSp
SortLoc$WndDir <- tmp_WndDir
SortLoc$Hgt <- tmp_Hgt
It should run much faster.
Finding the nearest value and return the index of array in Python
This is similar to using bisect_left, but it'll allow you to pass in an array of targets
def find_closest(A, target):
#A must be sorted
idx = A.searchsorted(target)
idx = np.clip(idx, 1, len(A)-1)
left = A[idx-1]
right = A[idx]
idx -= target - left < right - target
return idx
Some explanation:
First the general case: idx = A.searchsorted(target)
returns an index for each target
such that target
is between A[index - 1]
and A[index]
. I call these left
and right
so we know that left < target <= right
. target - left < right - target
is True
(or 1) when target is closer to left
and False
(or 0) when target is closer to right
.
Now the special case: when target
is less than all the elements of A
, idx = 0
. idx = np.clip(idx, 1, len(A)-1)
replaces all values of idx
< 1 with 1, so idx=1
. In this case left = A[0]
, right = A[1]
and we know that target <= left <= right
. Therefor we know that target - left <= 0
and right - target >= 0
so target - left < right - target
is True
unless target == left == right
and idx - True = 0
.
There is another special case if target
is greater than all the elements of A
, In that case idx = A.searchsorted(target)
and np.clip(idx, 1, len(A)-1)
replaces len(A)
with len(A) - 1
so idx=len(A) -1
and target - left < right - target
ends up False
so idx returns len(A) -1
. I'll let you work though the logic on your own.
For example:
In [163]: A = np.arange(0, 20.)
In [164]: target = np.array([-2, 100., 2., 2.4, 2.5, 2.6])
In [165]: find_closest(A, target)
Out[165]: array([ 0, 19, 2, 2, 3, 3])
Related Topics
Lapply Function /Loops on List of Lists R
Warning: Non-Integer #Successes in a Binomial Glm! (Survey Packages)
Dplyr::Select One Column and Output as Vector
Hyperlinking Text in a Ggplot2 Visualization
R Cmd Check Note: Found No Calls To: 'R_Registerroutines', 'R_Usedynamicsymbols'
Writings Functions (Procedures) for Data.Table Objects
Assign Headers Based on Existing Row in Dataframe in R
How to Install Multiple Packages
How to Remove "Rows" with a Na Value
Clustering List for Hclust Function
Add New Columns to a Data.Table Containing Many Variables
Reshape Wide Format, to Multi-Column Long Format
Ggplot2 Error:Discrete Value Supplied to Continuous Scale
How to Calculate Cyclomatic Complexity for R Functions