Filter by Ranges Supplied by Two Vectors, Without a Join Operation

Filter by ranges supplied by two vectors, without a join operation

Maybe you could borrow the inrange function from data.table, which

checks whether each value in x is in between any of the
intervals provided in lower,upper.

Usage:

inrange(x, lower, upper, incbounds=TRUE)

library(dplyr); library(data.table)

tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
# a
#1 2
#2 4
#3 5

How can I fast outer-join and filter two vectors (or lists), preferably in base R?

The main reason your solution is so slow is because of the line cp <- c( cp, list(c(d1, d2))). This is a very inefficient way to grow an object because it results in the object being copied with each c() call.

If you instead insert into a list you will see substantially better performance. We can make a couple of other small optimisations:

  • Checking whether FUN is missing outside of the main loop so that we just need to do it once.
  • Allocating a vector of the correct length up-front if FUN is missing, since we know the length of the output for this case.
outer_join <- function(x, y = x, FUN) {
fmissing <- missing(FUN)
if (fmissing) {
cp <- vector("list", length(x) * length(y))
} else {
cp <- list()
}
i <- 1L
for (d1 in x) {
for (d2 in y) {
if (fmissing || FUN(d1, d2)) {
cp[[i]] <- c(d1, d2)
i <- i + 1L
}
}
}
cp
}

microbenchmark::microbenchmark(
`Ex. 1` = outer_join(seq(2^8)),
`Ex. 2` = outer_join(seq(2^8), FUN = `==`),
`Ex. 3` = outer_join(seq(2^8), FUN = function(a, b) (a - b) %% 7L == 0),
times = 10,
unit = "s"
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Ex. 1 0.02300627 0.02473937 0.02787098 0.02566033 0.03057122 0.03753821 10
#> Ex. 2 0.01391696 0.01527710 0.01785506 0.01735052 0.01916601 0.02490142 10
#> Ex. 3 0.05839193 0.06460381 0.07189763 0.07218238 0.08215803 0.08275439 10

Also see this chapter from Hadley Wickham's Advanced R for a discussion of issues with growing objects, as well as Chapter 2 of R Inferno. My experience is that R is not as slow at loops as it is reputed to be, as long as you avoid growing objects inefficiently.

Efficient way to filter one data frame by ranges in another

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>%
filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>%
filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)

Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)

Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a

Filter dataframe between values in two vectors and add results to list in R

Tidyverse Solution 1 (using purrr's map2):

library(tidyverse)
map2(v, v1, ~ filter(mydata, x >= .x & x <= .y))

Tidyverse Solution 2 (this time with map)

map(1:length(v), ~ mydata[mydata$x >= v[.] & mydata$x <= v1[.],])

For Loop Solution

result <- list()
for (i in 1:length(v)) {
result[[i]] <- filter(mydata, x >= v[i] & x <= v1[i])
}

Efficient way to filter one data frame by ranges in another

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>%
filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>%
filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)

Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)

Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a

Subset a dataframe in two dataframes by values in two columns of another dataframe

We can full_join these two data frames and then filter for the rows within the start and end column. The Flag column in the example is just for the join. Finally, we can use distinct to remove duplicated rows.

library(dplyr)

df3 <- df1 %>%
mutate(Flag = 1) %>%
full_join(df2 %>% mutate(Flag = 1), by = "Flag") %>%
filter(position >= start, position <= end) %>%
distinct(position)
df3
# position
# 1 3
# 2 6
# 3 12
# 4 25

DATA

df1 <- read.table(text = "position
3
6
12
18
25
31", header = TRUE)

df2 <- read.table(text = "start end
2 17
24 29",
header = TRUE)

Filtering Data Table Row Vector That Lies Between 2 Numeric Vectors

Your data:

df <- read.table(header=TRUE, text='
ID AltID Crit1 Crit2 Crit3
1 1 1 5 10
1 2 3 7 15
1 3 2 6 11')
minCutoff = c(0, 5, 10)
maxCutoff = c(4, 7, 12)

TL;DR:

df[rowSums(mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11

Having a variable number of Crit columns is easily handled with a function to apply to each in turn, and then aggregate the results. If you are already using the dplyr package, then you already have dplyr::between, but if not then here is an acceptable replacement:

between <- function(x, low, hi) low <= x & x <= hi

I'll walk you through the work:

isbetween <- mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)
isbetween
# Crit1 Crit2 Crit3
# [1,] TRUE TRUE TRUE
# [2,] TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE
  • df[grepl("Crit", colnames(df)) ] is one way (of several) for looking at just the columns that are of interest to you;

  • mapply applies a function (between, in this case) with the first value of each of the other lists/vectors. It is effectively the same as:

    between(df[3], minCutoff[1], maxCutoff[1])
    between(df[4], minCutoff[2], maxCutoff[2])
    ...

Now that we have a logical matrix of individual values within their respective cutoffs, we an look at each row to check if they meet your filter requirements of 3 or more. Unfortunately, your listed expected output is not compatible with your rules, so I'll offer some alternatives:

  • "where any 3 columns fall outside the range", meaning if 3 or more columns are FALSE, then the row should be removed

    rowSums(!isbetween) >= 3
    # [1] FALSE FALSE FALSE
  • "where at least 3 columns fall inside the range", which is what your expected output suggests:

    rowSums(isbetween) >= 3
    # [1] TRUE FALSE TRUE

Regardless of which you choose, take this logical vector and subset the rows, such as

df[rowSums(isbetween) >= 3,]
# ID AltID Crit1 Crit2 Crit3
# 1 1 1 1 5 10
# 3 1 3 2 6 11

(The biggest difference between Rui's answer and this is that that answer uses apply on a data.frame for row-wise operations, implicitly converting the involved columns into a matrix. My answer works column-wise (natural operation with frames), so no conversion is done. Other than this conversion, if the frame is not huge then the performance of row-wise versus column-wise should be roughly the same. If it is largely assymmetric (e.g., many many more rows than columns), then it might be a little faster to work column-wise. Vectorized work in R is almost always much faster than iterative.)

Take dates from one dataframe and filter data in another dataframe

library(dplyr)
df<-left_join(df,df1,by="user")
df <- df %>% filter(date>=start_date & date<=end_date)

R: How to filter/subset a sequence of dates

you could use subset

Generating your sample data:

temp<-
read.table(text="date sessions
2014-12-01 1932
2014-12-02 1828
2014-12-03 2349
2014-12-04 8192
2014-12-05 3188
2014-12-06 3277", header=T)

Making sure it's in date format:

temp$date <- as.Date(temp$date, format= "%Y-%m-%d")

temp

# date sessions
# 1 2014-12-01 1932
# 2 2014-12-02 1828
# 3 2014-12-03 2349
# 4 2014-12-04 8192
# 5 2014-12-05 3188
# 6 2014-12-06 3277

Using subset :

subset(temp, date> "2014-12-03" & date < "2014-12-05")

which gives:

  #        date sessions
# 4 2014-12-04 8192

you could also use []:

temp[(temp$date> "2014-12-03" & temp$date < "2014-12-05"),]


Related Topics



Leave a reply



Submit