Efficient Way to Filter One Data Frame by Ranges in Another

Efficient way to filter one data frame by ranges in another

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>%
filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>%
filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)

Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)

Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a

Filter between multiple date ranges

With some inspiration from this question on how to Efficient way to filter one data frame by ranges in another, I came up with the following solutions.

One is a very slow with very large datasets:

It takes my data provided above and uses rowwise()

filtered3 <- df %>% 
rowwise() %>%
filter(any(datetime >= start & datetime <= end))

As I mentioned, with more than 3 million rows in my data, this was very slow.

Another option, also from the answer linked above, includes using the data.table package, which has an inrange function. This one works much faster.

library(data.table)
range <- data.table(start = start, end = end)
filtered4 <- setDT(df)[datetime %inrange% range]

efficiently filter data frame according to conditions given as rows of another data frame

One way of getting this to work is using glue, eval and parse functions.

I created a function (my_conditions) so it can be used more easily. There is still some manual work involved for changing columns names / condition tables, but not as much and this can probably be automated as well. The function calls on the glue package.

my_conditions <- function(column_name, condition_table){
# create conditions
conditions <- glue::glue("{column_name} > {condition_table$xmin} & {column_name} < {condition_table$xmax}")
# collapse into 1 statement using " | " for or statement
conditions <- paste0(conditions, collapse = " | ")
return(conditions)
}

The result of calling my_conditions("PC1", f1) is a long string which has all the conditions of table f1.

[1] "PC1 > -3.59811981997059 & PC1 < -3.34997362548985 | PC1 > -3.10182743100913 & PC1 < -2.8536812365284 | PC1 > -2.8536812365284 & PC1 < -2.60553504204766 | PC1 > 2.8536812365284 & PC1 < 3.10182743100912 | PC1 > 3.59811981997058 & PC1 < 3.84626601445132"

Using eval and parse to parse and evaluate the conditions in the code.

using dplyr:

df %>% 
filter(eval(parse(text = my_conditions("PC1", f1))))
# A tibble: 1 x 3
PC1 PC2 PC3
<dbl> <dbl> <dbl>
1 3.09 0.856 -2.02

filtering in base R: just add the table name in front of the column

df[eval(parse(text = my_conditions("df$PC1", f1))), ]

# A tibble: 1 x 3
PC1 PC2 PC3
<dbl> <dbl> <dbl>
1 3.09 0.856 -2.02

Filter dataframe between values in two vectors and add results to list in R

Tidyverse Solution 1 (using purrr's map2):

library(tidyverse)
map2(v, v1, ~ filter(mydata, x >= .x & x <= .y))

Tidyverse Solution 2 (this time with map)

map(1:length(v), ~ mydata[mydata$x >= v[.] & mydata$x <= v1[.],])

For Loop Solution

result <- list()
for (i in 1:length(v)) {
result[[i]] <- filter(mydata, x >= v[i] & x <= v1[i])
}

How to filter rows by column value ranges in R?

Here is a data.table approach

library(data.table)
# keep Gene that are not joined in the non-equi join on df1 below
df2[!Gene %in% df2[df1, on = .(Chromosome, Gene.Start >= Min, Gene.End <= Max)]$Gene, ]
# Gene Gene.Start Gene.End Chromosome
# 1: Gene2 950 990 1

Filtering data frame by multiple columns from another data frame

Is this what you need?

Perhaps not the most elegant solution, but you can paste together the combinations of years and ID in both data.frames and then use one to filter the other. Probably not the best way if you have a large data.frame though.

df %>% 
dplyr::filter(paste0(lubridate::year(date), "_", ID) %in% paste0(df2$year,"_", df2$ID))

         date        x        y ID
1 2010-12-26 74119.46 839347.8 1
2 2010-12-27 72401.02 891788.1 2
3 2010-12-31 66940.94 810089.6 1
4 2012-01-02 68214.97 881200.1 3
5 2012-01-07 70595.92 863277.7 3
6 2012-01-12 79799.85 857738.5 3
7 2012-01-17 61102.50 848880.6 3
8 2012-01-22 71798.29 883455.7 3
9 2012-01-27 61550.93 889447.7 3
10 2012-02-01 69863.50 838101.4 3
11 2012-02-06 71202.38 873705.6 3
12 2012-02-11 60124.56 828661.6 3
13 2012-02-16 65963.74 824347.5 3
14 2012-02-21 79347.69 818929.1 3
15 2012-02-26 68082.87 879863.1 3
16 2012-03-02 68661.00 891477.0 3
17 2012-03-07 71369.69 849595.6 3
18 2012-03-12 73265.85 834035.4 3
19 2012-03-17 70777.06 833344.5 3
20 2012-03-22 72104.04 881329.5 3
21 2012-03-27 75471.59 848650.2 3
22 2012-04-01 77590.13 867834.6 3
23 2012-04-06 75664.27 828857.6 3
24 2012-04-11 65789.62 814059.0 3
25 2012-04-16 72841.91 893683.3 3
26 2012-04-21 61047.06 805820.7 3
27 2012-04-26 77232.51 896022.5 3
28 2012-05-01 77553.05 817557.6 3
29 2012-05-06 75597.76 899616.4 3

Perhaps a more efficient way would be to use a join:

df$year = lubridate::year(df$date)
dplyr::left_join(df2, df, by=c("ID", "year")) %>% na.omit()

How to subset windows in a dataframe using start- and end-values from another dataframe in R?

Maybe try this approach with purrr::map2

# dataframe of data to subset
df1 <- tibble(my_values = rnorm(100, mean = 45, sd = 30) %>% abs())

# dataframe of windows (i.e. row number IDs) to extract from data
df2 <-tibble::tribble(
~window_start, ~window_end,
3L, 10L,
21L, 25L,
52L, 63L,
78L, 90L
)

subset_thats_in <- function(mini, maxi){
df1 %>%
filter(between(my_values, mini, maxi))
}

purrr::map2(df2$window_start,
df2$window_end,
subset_thats_in)
[[1]]
# A tibble: 4 × 1
my_values
<dbl>
1 6.47
2 8.69
3 7.73
4 7.35

[[2]]
# A tibble: 12 × 1
my_values
<dbl>
1 24.2
2 22.9
3 22.4
4 24.4
5 22.6
6 21.7
7 23.2
8 21.3
9 23.3
10 21.1
11 23.5
12 22.6

[[3]]
# A tibble: 10 × 1
my_values
<dbl>
1 54.0
2 61.4
3 62.5
4 60.8
5 60.5
6 55.5
7 61.4
8 59.0
9 57.9
10 53.3

[[4]]
# A tibble: 6 × 1
my_values
<dbl>
1 87.8
2 79.1
3 80.5
4 82.7
5 85.2
6 80.6

How do I filter a range of numbers in R?

You can use %in%, or as has been mentioned, alternatively dplyrs between():

 library(dplyr)

new_frame <- Mydata %>% filter(x %in% (3:7) )
new_frame
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97

While %in% works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%, use dplyr's between():

 new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
new_frame2
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97

To further clarify, note that %in% checks for the presence in a set of values:

3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE

The above return TRUE because 3:7 is shorthand for seq(3, 7) which produces:

3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7

As such, if you were to use %in% to check for values not produced by :, it will return FALSE:

4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE

Whereas between checks against the end points and all values in between:

between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE

Filtering and summarising a dataframe based another search dataframe

I reproduced the same results using data.table but it actually performs worse than OPs solution. Leaving it here in case it helps other people answer:

library(data.table)
setDT(df)
setDT(search)

df[search,
on = .(dt > min_dt, dt < max_dt, x = category),
.(min_dt,max_dt,dt,x,y,category)][,list(.N, mean_val = mean(y)),
by = list(min_dt,max_dt,category)]

Benchmark:

dt_summ = function(df,search){
setDT(df)
setDT(search)

setkeyv(df,c("dt","y"))

df[search,
on = .(dt > min_dt, dt < max_dt, x = category),
.(min_dt,max_dt,dt,x,y,category)][,
list(.N, mean_val = mean(y)),
by = list(min_dt,max_dt,category)]
}


dplyr_summ = function(df,search){
bind_cols(search, purrr::pmap_dfr(search, filter_summarise))
}

library(microbenchmark)
microbenchmark(
dplyr = dplyr_summ(df,search),
dt = dt_summ(df,search)
)

#Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 4.0562 4.4588 5.580925 4.70385 5.0531 65.5202 100
# dt 6.7754 7.5449 8.246862 7.97395 8.6485 15.8260 100


Related Topics



Leave a reply



Submit