Filter by ranges supplied by two vectors, without a join operation
Maybe you could borrow the inrange
function from data.table
, which
checks whether each value in x is in between any of the
intervals provided in lower,upper.
Usage:
inrange(x, lower, upper, incbounds=TRUE)
library(dplyr); library(data.table)
tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
# a
#1 2
#2 4
#3 5
How can I fast outer-join and filter two vectors (or lists), preferably in base R?
The main reason your solution is so slow is because of the line cp <- c( cp, list(c(d1, d2)))
. This is a very inefficient way to grow an object because it results in the object being copied with each c()
call.
If you instead insert into a list you will see substantially better performance. We can make a couple of other small optimisations:
- Checking whether
FUN
is missing outside of the main loop so that we just need to do it once. - Allocating a vector of the correct length up-front if
FUN
is missing, since we know the length of the output for this case.
outer_join <- function(x, y = x, FUN) {
fmissing <- missing(FUN)
if (fmissing) {
cp <- vector("list", length(x) * length(y))
} else {
cp <- list()
}
i <- 1L
for (d1 in x) {
for (d2 in y) {
if (fmissing || FUN(d1, d2)) {
cp[[i]] <- c(d1, d2)
i <- i + 1L
}
}
}
cp
}
microbenchmark::microbenchmark(
`Ex. 1` = outer_join(seq(2^8)),
`Ex. 2` = outer_join(seq(2^8), FUN = `==`),
`Ex. 3` = outer_join(seq(2^8), FUN = function(a, b) (a - b) %% 7L == 0),
times = 10,
unit = "s"
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Ex. 1 0.02300627 0.02473937 0.02787098 0.02566033 0.03057122 0.03753821 10
#> Ex. 2 0.01391696 0.01527710 0.01785506 0.01735052 0.01916601 0.02490142 10
#> Ex. 3 0.05839193 0.06460381 0.07189763 0.07218238 0.08215803 0.08275439 10
Also see this chapter from Hadley Wickham's Advanced R for a discussion of issues with growing objects, as well as Chapter 2 of R Inferno. My experience is that R is not as slow at loops as it is reputed to be, as long as you avoid growing objects inefficiently.
Efficient way to filter one data frame by ranges in another
Here's a function that you can run in dplyr
to find dates within a given range using the between
function (from dplyr
). For each value of Day
, mapply
runs between
on each of the pairs of Start
and End
dates and the function uses rowSums
to return TRUE
if Day
is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.
test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}
main_data %>%
filter(test.overlap(Day))
If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):
filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)
Now compare execution speeds. I shortened your code to require only the filtering operation:
library(microbenchmark)
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)
Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a
UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between
method, and far greater still for the second method).
main_data = data.frame(Day=c(1:100000))
spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a
Filter dataframe between values in two vectors and add results to list in R
Tidyverse Solution 1 (using purrr's map2
):
library(tidyverse)
map2(v, v1, ~ filter(mydata, x >= .x & x <= .y))
Tidyverse Solution 2 (this time with map
)
map(1:length(v), ~ mydata[mydata$x >= v[.] & mydata$x <= v1[.],])
For Loop Solution
result <- list()
for (i in 1:length(v)) {
result[[i]] <- filter(mydata, x >= v[i] & x <= v1[i])
}
Efficient way to filter one data frame by ranges in another
Here's a function that you can run in dplyr
to find dates within a given range using the between
function (from dplyr
). For each value of Day
, mapply
runs between
on each of the pairs of Start
and End
dates and the function uses rowSums
to return TRUE
if Day
is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.
test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}
main_data %>%
filter(test.overlap(Day))
If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):
filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)
Now compare execution speeds. I shortened your code to require only the filtering operation:
library(microbenchmark)
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)
Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a
UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between
method, and far greater still for the second method).
main_data = data.frame(Day=c(1:100000))
spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))
microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a
Subset a dataframe in two dataframes by values in two columns of another dataframe
We can full_join
these two data frames and then filter for the rows within the start
and end
column. The Flag
column in the example is just for the join. Finally, we can use distinct
to remove duplicated rows.
library(dplyr)
df3 <- df1 %>%
mutate(Flag = 1) %>%
full_join(df2 %>% mutate(Flag = 1), by = "Flag") %>%
filter(position >= start, position <= end) %>%
distinct(position)
df3
# position
# 1 3
# 2 6
# 3 12
# 4 25
DATA
df1 <- read.table(text = "position
3
6
12
18
25
31", header = TRUE)
df2 <- read.table(text = "start end
2 17
24 29",
header = TRUE)
Filtering Data Table Row Vector That Lies Between 2 Numeric Vectors
Your data:
df <- read.table(header=TRUE, text='
ID AltID Crit1 Crit2 Crit3
1 1 1 5 10
1 2 3 7 15
1 3 2 6 11')
minCutoff = c(0, 5, 10)
maxCutoff = c(4, 7, 12)
TL;DR:
df[rowSums(mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11
Having a variable number of Crit
columns is easily handled with a function to apply to each in turn, and then aggregate the results. If you are already using the dplyr
package, then you already have dplyr::between
, but if not then here is an acceptable replacement:
between <- function(x, low, hi) low <= x & x <= hi
I'll walk you through the work:
isbetween <- mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)
isbetween
# Crit1 Crit2 Crit3
# [1,] TRUE TRUE TRUE
# [2,] TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE
df[grepl("Crit", colnames(df)) ]
is one way (of several) for looking at just the columns that are of interest to you;mapply
applies a function (between
, in this case) with the first value of each of the other lists/vectors. It is effectively the same as:between(df[3], minCutoff[1], maxCutoff[1])
between(df[4], minCutoff[2], maxCutoff[2])
...
Now that we have a logical matrix of individual values within their respective cutoffs, we an look at each row to check if they meet your filter requirements of 3 or more. Unfortunately, your listed expected output is not compatible with your rules, so I'll offer some alternatives:
"where any 3 columns fall outside the range", meaning if 3 or more columns are
FALSE
, then the row should be removedrowSums(!isbetween) >= 3
# [1] FALSE FALSE FALSE"where at least 3 columns fall inside the range", which is what your expected output suggests:
rowSums(isbetween) >= 3
# [1] TRUE FALSE TRUE
Regardless of which you choose, take this logical vector and subset the rows, such as
df[rowSums(isbetween) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11
(The biggest difference between Rui's answer and this is that that answer uses apply
on a data.frame
for row-wise operations, implicitly converting the involved columns into a matrix. My answer works column-wise (natural operation with frames), so no conversion is done. Other than this conversion, if the frame is not huge then the performance of row-wise versus column-wise should be roughly the same. If it is largely assymmetric (e.g., many many more rows than columns), then it might be a little faster to work column-wise. Vectorized work in R is almost always much faster than iterative.)
Take dates from one dataframe and filter data in another dataframe
library(dplyr)
df<-left_join(df,df1,by="user")
df <- df %>% filter(date>=start_date & date<=end_date)
R: How to filter/subset a sequence of dates
you could use subset
Generating your sample data:
temp<-
read.table(text="date sessions
2014-12-01 1932
2014-12-02 1828
2014-12-03 2349
2014-12-04 8192
2014-12-05 3188
2014-12-06 3277", header=T)
Making sure it's in date format:
temp$date <- as.Date(temp$date, format= "%Y-%m-%d")
temp
# date sessions
# 1 2014-12-01 1932
# 2 2014-12-02 1828
# 3 2014-12-03 2349
# 4 2014-12-04 8192
# 5 2014-12-05 3188
# 6 2014-12-06 3277
Using subset
:
subset(temp, date> "2014-12-03" & date < "2014-12-05")
which gives:
# date sessions
# 4 2014-12-04 8192
you could also use []
:
temp[(temp$date> "2014-12-03" & temp$date < "2014-12-05"),]
Related Topics
Rstudio Viewer Pane Not Working
Plot Weighted Frequency Matrix
Use Different Font Sizes for Different Portions of Text in Ggplot2 Title
How to Set R to Default Options
Linear Regression with Constraints on The Coefficients
Getting Stargazer Column Labels to Print on Two or Three Lines
Aggregating Rows for Multiple Columns in R
Applying Function (Ks.Test) Between Two Data Frames Column-Wise in R
Summing Multiple Columns in an R Data-Frame Quickly
How to Position Annotate Text in The Blank Area of Facet Ggplot
Find If Each Row of a Logical Matrix Has at Least One True
Convert 12Hour Time to 24Hour Time
Tiff Plot Generation and Compression: R VS. Gimp VS. Irfanview VS. Photoshop File Sizes
Segfault in R Using Reshape2 Package and Dcast
Using Inst/Extdata with Vignette During Package Checking R 2.14.0