Efficient Way to Filter One Data Frame by Ranges in Another

Efficient way to filter one data frame by ranges in another

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
  rowSums(mapply(function(a,b) between(vals, a, b), 
                 spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>% 
  filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>% 
  filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = main_data %>% 
    filter(Day %in% filt.vals)
  )

Unit: microseconds
     expr      min       lq      mean    median       uq      max neval cld
       OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481   100   c
   eipi10  658.941  686.933  782.8840  714.4440  770.679 2474.941   100  b 
 eipi10_2  579.338  601.355  655.1451  619.2595  672.535 1032.145   100 a

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter = 
  data.frame(Span_number = c(1:9),
             Start = c(2,7,1,15,12,23,90,9000,50000),
             End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = {
    filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
    main_data %>% 
      filter(Day %in% filt.vals)}, 
  times=10
  )

Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval cld
       OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648    10   b
   eipi10   24.209111   25.434856   29.526571   26.455813   32.051920   48.277326    10  a 
 eipi10_2    2.505509    2.618668    4.037414    2.892234    6.222845    8.266612    10  a

Filter between multiple date ranges

With some inspiration from this question on how to Efficient way to filter one data frame by ranges in another, I came up with the following solutions.

One is a very slow with very large datasets:

It takes my data provided above and uses rowwise()

filtered3 <- df %>% 
  rowwise() %>%
  filter(any(datetime >= start & datetime <= end))

As I mentioned, with more than 3 million rows in my data, this was very slow.

Another option, also from the answer linked above, includes using the data.table package, which has an inrange function. This one works much faster.

library(data.table)
range <- data.table(start = start, end = end)
filtered4 <- setDT(df)[datetime %inrange% range]

efficiently filter data frame according to conditions given as rows of another data frame

One way of getting this to work is using glue, eval and parse functions.

I created a function (my_conditions) so it can be used more easily. There is still some manual work involved for changing columns names / condition tables, but not as much and this can probably be automated as well. The function calls on the glue package.

my_conditions <- function(column_name, condition_table){
  # create conditions
  conditions <- glue::glue("{column_name} > {condition_table$xmin} & {column_name} < {condition_table$xmax}")
  # collapse into 1 statement using " | " for or statement
  conditions <- paste0(conditions, collapse = " | ")
  return(conditions)
}

The result of calling my_conditions("PC1", f1) is a long string which has all the conditions of table f1.

[1] "PC1 > -3.59811981997059 & PC1 < -3.34997362548985 | PC1 > -3.10182743100913 & PC1 < -2.8536812365284 | PC1 > -2.8536812365284 & PC1 < -2.60553504204766 | PC1 > 2.8536812365284 & PC1 < 3.10182743100912 | PC1 > 3.59811981997058 & PC1 < 3.84626601445132"

Using eval and parse to parse and evaluate the conditions in the code.

using dplyr:

df %>% 
  filter(eval(parse(text = my_conditions("PC1", f1))))
# A tibble: 1 x 3
    PC1   PC2   PC3
  <dbl> <dbl> <dbl>
1  3.09 0.856 -2.02

filtering in base R: just add the table name in front of the column

df[eval(parse(text = my_conditions("df$PC1", f1))), ]

# A tibble: 1 x 3
    PC1   PC2   PC3
  <dbl> <dbl> <dbl>
1  3.09 0.856 -2.02

Filter dataframe between values in two vectors and add results to list in R

Tidyverse Solution 1 (using purrr's `map2`):

library(tidyverse)
map2(v, v1, ~ filter(mydata, x >= .x & x <= .y))

Tidyverse Solution 2 (this time with `map`)

map(1:length(v), ~ mydata[mydata$x >= v[.] & mydata$x <= v1[.],])

For Loop Solution

result <- list()
for (i in 1:length(v)) {
  result[[i]] <- filter(mydata, x >= v[i] & x <= v1[i])
}

How to filter rows by column value ranges in R?

Here is a data.table approach

library(data.table)
# keep Gene that are not joined in the non-equi join on df1 below
df2[!Gene %in% df2[df1, on = .(Chromosome, Gene.Start >= Min, Gene.End <= Max)]$Gene, ]
#     Gene Gene.Start Gene.End Chromosome
# 1: Gene2        950      990          1

Filtering data frame by multiple columns from another data frame

Is this what you need?

Perhaps not the most elegant solution, but you can paste together the combinations of years and ID in both data.frames and then use one to filter the other. Probably not the best way if you have a large data.frame though.

df %>% 
    dplyr::filter(paste0(lubridate::year(date), "_", ID) %in% paste0(df2$year,"_", df2$ID))

         date        x        y ID
1  2010-12-26 74119.46 839347.8  1
2  2010-12-27 72401.02 891788.1  2
3  2010-12-31 66940.94 810089.6  1
4  2012-01-02 68214.97 881200.1  3
5  2012-01-07 70595.92 863277.7  3
6  2012-01-12 79799.85 857738.5  3
7  2012-01-17 61102.50 848880.6  3
8  2012-01-22 71798.29 883455.7  3
9  2012-01-27 61550.93 889447.7  3
10 2012-02-01 69863.50 838101.4  3
11 2012-02-06 71202.38 873705.6  3
12 2012-02-11 60124.56 828661.6  3
13 2012-02-16 65963.74 824347.5  3
14 2012-02-21 79347.69 818929.1  3
15 2012-02-26 68082.87 879863.1  3
16 2012-03-02 68661.00 891477.0  3
17 2012-03-07 71369.69 849595.6  3
18 2012-03-12 73265.85 834035.4  3
19 2012-03-17 70777.06 833344.5  3
20 2012-03-22 72104.04 881329.5  3
21 2012-03-27 75471.59 848650.2  3
22 2012-04-01 77590.13 867834.6  3
23 2012-04-06 75664.27 828857.6  3
24 2012-04-11 65789.62 814059.0  3
25 2012-04-16 72841.91 893683.3  3
26 2012-04-21 61047.06 805820.7  3
27 2012-04-26 77232.51 896022.5  3
28 2012-05-01 77553.05 817557.6  3
29 2012-05-06 75597.76 899616.4  3

Perhaps a more efficient way would be to use a join:

df$year = lubridate::year(df$date)
dplyr::left_join(df2, df, by=c("ID", "year")) %>% na.omit()

How to subset windows in a dataframe using start- and end-values from another dataframe in R?

Maybe try this approach with purrr::map2

# dataframe of data to subset
df1 <- tibble(my_values = rnorm(100, mean = 45, sd = 30) %>% abs())

# dataframe of windows (i.e. row number IDs) to extract from data
df2 <-tibble::tribble(
  ~window_start, ~window_end,
  3L,         10L,
  21L,         25L,
  52L,         63L,
  78L,         90L
)

subset_thats_in <- function(mini, maxi){
  df1 %>% 
    filter(between(my_values, mini, maxi))
}

purrr::map2(df2$window_start, 
            df2$window_end, 
            subset_thats_in)

[[1]]
# A tibble: 4 × 1
  my_values
      <dbl>
1      6.47
2      8.69
3      7.73
4      7.35

[[2]]
# A tibble: 12 × 1
   my_values
       <dbl>
 1      24.2
 2      22.9
 3      22.4
 4      24.4
 5      22.6
 6      21.7
 7      23.2
 8      21.3
 9      23.3
10      21.1
11      23.5
12      22.6

[[3]]
# A tibble: 10 × 1
   my_values
       <dbl>
 1      54.0
 2      61.4
 3      62.5
 4      60.8
 5      60.5
 6      55.5
 7      61.4
 8      59.0
 9      57.9
10      53.3

[[4]]
# A tibble: 6 × 1
  my_values
      <dbl>
1      87.8
2      79.1
3      80.5
4      82.7
5      85.2
6      80.6

How do I filter a range of numbers in R?

You can use %in%, or as has been mentioned, alternatively dplyrs between():

 library(dplyr)
 
 new_frame <- Mydata %>% filter(x %in% (3:7) )
 new_frame
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97

While %in% works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%, use dplyr's between():

 new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
 new_frame2
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97

To further clarify, note that %in% checks for the presence in a set of values:

3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE

The above return TRUE because 3:7 is shorthand for seq(3, 7) which produces:

3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7

As such, if you were to use %in% to check for values not produced by :, it will return FALSE:

4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE

Whereas between checks against the end points and all values in between:

between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE

Filtering and summarising a dataframe based another search dataframe

I reproduced the same results using data.table but it actually performs worse than OPs solution. Leaving it here in case it helps other people answer:

library(data.table)
setDT(df)
setDT(search)

df[search,
   on = .(dt > min_dt, dt < max_dt, x = category),
   .(min_dt,max_dt,dt,x,y,category)][,list(.N, mean_val = mean(y)),
                                        by = list(min_dt,max_dt,category)]

Benchmark:

dt_summ = function(df,search){
  setDT(df)
  setDT(search)

  setkeyv(df,c("dt","y"))

  df[search,
     on = .(dt > min_dt, dt < max_dt, x = category),
     .(min_dt,max_dt,dt,x,y,category)][,
                                          list(.N, mean_val = mean(y)),
                                          by = list(min_dt,max_dt,category)]
}


dplyr_summ = function(df,search){
  bind_cols(search, purrr::pmap_dfr(search, filter_summarise))
}

library(microbenchmark)
microbenchmark(
  dplyr = dplyr_summ(df,search),
  dt = dt_summ(df,search)
)

#Unit: milliseconds
#  expr    min     lq     mean  median     uq     max neval
# dplyr 4.0562 4.4588 5.580925 4.70385 5.0531 65.5202   100
#    dt 6.7754 7.5449 8.246862 7.97395 8.6485 15.8260   100

Efficient Way to Filter One Data Frame by Ranges in Another