Efficient way to filter one data frame by ranges in another
Here's a function that you can run in dplyr
to find dates within a given range using the between
function (from dplyr
). For each value of Day
, mapply
runs between
on each of the pairs of Start
and End
dates and the function uses rowSums
to return TRUE
if Day
is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.
test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}
main_data %>%
filter(test.overlap(Day))
If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):
filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)
Now compare execution speeds. I shortened your code to require only the filtering operation:
library(microbenchmark)
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)
Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a
UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between
method, and far greater still for the second method).
main_data = data.frame(Day=c(1:100000))
spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a
Filter between multiple date ranges
With some inspiration from this question on how to Efficient way to filter one data frame by ranges in another, I came up with the following solutions.
One is a very slow with very large datasets:
It takes my data provided above and uses rowwise()
filtered3 <- df %>%
rowwise() %>%
filter(any(datetime >= start & datetime <= end))
As I mentioned, with more than 3 million rows in my data, this was very slow.
Another option, also from the answer linked above, includes using the data.table package, which has an inrange
function. This one works much faster.
library(data.table)
range <- data.table(start = start, end = end)
filtered4 <- setDT(df)[datetime %inrange% range]
efficiently filter data frame according to conditions given as rows of another data frame
One way of getting this to work is using glue
, eval
and parse
functions.
I created a function (my_conditions) so it can be used more easily. There is still some manual work involved for changing columns names / condition tables, but not as much and this can probably be automated as well. The function calls on the glue
package.
my_conditions <- function(column_name, condition_table){
# create conditions
conditions <- glue::glue("{column_name} > {condition_table$xmin} & {column_name} < {condition_table$xmax}")
# collapse into 1 statement using " | " for or statement
conditions <- paste0(conditions, collapse = " | ")
return(conditions)
}
The result of calling my_conditions("PC1", f1)
is a long string which has all the conditions of table f1.
[1] "PC1 > -3.59811981997059 & PC1 < -3.34997362548985 | PC1 > -3.10182743100913 & PC1 < -2.8536812365284 | PC1 > -2.8536812365284 & PC1 < -2.60553504204766 | PC1 > 2.8536812365284 & PC1 < 3.10182743100912 | PC1 > 3.59811981997058 & PC1 < 3.84626601445132"
Using eval
and parse
to parse and evaluate the conditions in the code.
using dplyr:
df %>%
filter(eval(parse(text = my_conditions("PC1", f1))))
# A tibble: 1 x 3
PC1 PC2 PC3
<dbl> <dbl> <dbl>
1 3.09 0.856 -2.02
filtering in base R: just add the table name in front of the column
df[eval(parse(text = my_conditions("df$PC1", f1))), ]
# A tibble: 1 x 3
PC1 PC2 PC3
<dbl> <dbl> <dbl>
1 3.09 0.856 -2.02
Filter dataframe between values in two vectors and add results to list in R
Tidyverse Solution 1 (using purrr's map2
):
library(tidyverse)
map2(v, v1, ~ filter(mydata, x >= .x & x <= .y))
Tidyverse Solution 2 (this time with map
)
map(1:length(v), ~ mydata[mydata$x >= v[.] & mydata$x <= v1[.],])
For Loop Solution
result <- list()
for (i in 1:length(v)) {
result[[i]] <- filter(mydata, x >= v[i] & x <= v1[i])
}
How to filter rows by column value ranges in R?
Here is a data.table
approach
library(data.table)
# keep Gene that are not joined in the non-equi join on df1 below
df2[!Gene %in% df2[df1, on = .(Chromosome, Gene.Start >= Min, Gene.End <= Max)]$Gene, ]
# Gene Gene.Start Gene.End Chromosome
# 1: Gene2 950 990 1
Filtering data frame by multiple columns from another data frame
Is this what you need?
Perhaps not the most elegant solution, but you can paste together the combinations of years and ID in both data.frames and then use one to filter the other. Probably not the best way if you have a large data.frame though.
df %>%
dplyr::filter(paste0(lubridate::year(date), "_", ID) %in% paste0(df2$year,"_", df2$ID))
date x y ID
1 2010-12-26 74119.46 839347.8 1
2 2010-12-27 72401.02 891788.1 2
3 2010-12-31 66940.94 810089.6 1
4 2012-01-02 68214.97 881200.1 3
5 2012-01-07 70595.92 863277.7 3
6 2012-01-12 79799.85 857738.5 3
7 2012-01-17 61102.50 848880.6 3
8 2012-01-22 71798.29 883455.7 3
9 2012-01-27 61550.93 889447.7 3
10 2012-02-01 69863.50 838101.4 3
11 2012-02-06 71202.38 873705.6 3
12 2012-02-11 60124.56 828661.6 3
13 2012-02-16 65963.74 824347.5 3
14 2012-02-21 79347.69 818929.1 3
15 2012-02-26 68082.87 879863.1 3
16 2012-03-02 68661.00 891477.0 3
17 2012-03-07 71369.69 849595.6 3
18 2012-03-12 73265.85 834035.4 3
19 2012-03-17 70777.06 833344.5 3
20 2012-03-22 72104.04 881329.5 3
21 2012-03-27 75471.59 848650.2 3
22 2012-04-01 77590.13 867834.6 3
23 2012-04-06 75664.27 828857.6 3
24 2012-04-11 65789.62 814059.0 3
25 2012-04-16 72841.91 893683.3 3
26 2012-04-21 61047.06 805820.7 3
27 2012-04-26 77232.51 896022.5 3
28 2012-05-01 77553.05 817557.6 3
29 2012-05-06 75597.76 899616.4 3
Perhaps a more efficient way would be to use a join:
df$year = lubridate::year(df$date)
dplyr::left_join(df2, df, by=c("ID", "year")) %>% na.omit()
How to subset windows in a dataframe using start- and end-values from another dataframe in R?
Maybe try this approach with purrr::map2
# dataframe of data to subset
df1 <- tibble(my_values = rnorm(100, mean = 45, sd = 30) %>% abs())
# dataframe of windows (i.e. row number IDs) to extract from data
df2 <-tibble::tribble(
~window_start, ~window_end,
3L, 10L,
21L, 25L,
52L, 63L,
78L, 90L
)
subset_thats_in <- function(mini, maxi){
df1 %>%
filter(between(my_values, mini, maxi))
}
purrr::map2(df2$window_start,
df2$window_end,
subset_thats_in)
[[1]]
# A tibble: 4 × 1
my_values
<dbl>
1 6.47
2 8.69
3 7.73
4 7.35
[[2]]
# A tibble: 12 × 1
my_values
<dbl>
1 24.2
2 22.9
3 22.4
4 24.4
5 22.6
6 21.7
7 23.2
8 21.3
9 23.3
10 21.1
11 23.5
12 22.6
[[3]]
# A tibble: 10 × 1
my_values
<dbl>
1 54.0
2 61.4
3 62.5
4 60.8
5 60.5
6 55.5
7 61.4
8 59.0
9 57.9
10 53.3
[[4]]
# A tibble: 6 × 1
my_values
<dbl>
1 87.8
2 79.1
3 80.5
4 82.7
5 85.2
6 80.6
How do I filter a range of numbers in R?
You can use %in%
, or as has been mentioned, alternatively dplyr
s between()
:
library(dplyr)
new_frame <- Mydata %>% filter(x %in% (3:7) )
new_frame
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97
While %in%
works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%
, use dplyr
's between()
:
new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
new_frame2
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97
To further clarify, note that %in%
checks for the presence in a set of values:
3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE
The above return TRUE
because 3:7
is shorthand for seq(3, 7)
which produces:
3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7
As such, if you were to use %in%
to check for values not produced by :
, it will return FALSE
:
4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE
Whereas between
checks against the end points and all values in between:
between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE
Filtering and summarising a dataframe based another search dataframe
I reproduced the same results using data.table
but it actually performs worse than OPs solution. Leaving it here in case it helps other people answer:
library(data.table)
setDT(df)
setDT(search)
df[search,
on = .(dt > min_dt, dt < max_dt, x = category),
.(min_dt,max_dt,dt,x,y,category)][,list(.N, mean_val = mean(y)),
by = list(min_dt,max_dt,category)]
Benchmark:
dt_summ = function(df,search){
setDT(df)
setDT(search)
setkeyv(df,c("dt","y"))
df[search,
on = .(dt > min_dt, dt < max_dt, x = category),
.(min_dt,max_dt,dt,x,y,category)][,
list(.N, mean_val = mean(y)),
by = list(min_dt,max_dt,category)]
}
dplyr_summ = function(df,search){
bind_cols(search, purrr::pmap_dfr(search, filter_summarise))
}
library(microbenchmark)
microbenchmark(
dplyr = dplyr_summ(df,search),
dt = dt_summ(df,search)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 4.0562 4.4588 5.580925 4.70385 5.0531 65.5202 100
# dt 6.7754 7.5449 8.246862 7.97395 8.6485 15.8260 100
Related Topics
Proper Idiom for Adding Zero Count Rows in Tidyr/Dplyr
Re-Ordering Factor Levels in Data Frame
How to Create a Loop That Includes Both a Code Chunk and Text with Knitr in R
Automatically Delete Files/Folders
Code to Import Data from a Stack Overflow Query into R
How to Make Tibbles Display Significant Digits
Sum Cells of Certain Columns for Each Row
Error ".Onload Failed in Loadnamespace() for 'Tcltk'"
Mean of Each Element of a List of Matrices
Pasting Elements of Two Vectors Alphabetically
Converting Two Columns of a Data Frame to a Named Vector
Is Set.Seed Consistent Over Different Versions of R (And Ubuntu)
Dplyr - Using Column Names as Function Arguments
Making a Stacked Bar Plot for Multiple Variables - Ggplot2 in R
How to Remove Empty Factors from Ggplot2 Facets
Finding Point of Intersection in R