Filter by Ranges Supplied by Two Vectors, Without a Join Operation

Filter by ranges supplied by two vectors, without a join operation

Maybe you could borrow the inrange function from data.table, which

checks whether each value in x is in between any of the
intervals provided in lower,upper.

Usage:

inrange(x, lower, upper, incbounds=TRUE)

library(dplyr); library(data.table)

tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
#  a
#1 2
#2 4
#3 5

How can I fast outer-join and filter two vectors (or lists), preferably in base R?

The main reason your solution is so slow is because of the line cp <- c( cp, list(c(d1, d2))). This is a very inefficient way to grow an object because it results in the object being copied with each c() call.

If you instead insert into a list you will see substantially better performance. We can make a couple of other small optimisations:

Checking whether FUN is missing outside of the main loop so that we just need to do it once.
Allocating a vector of the correct length up-front if FUN is missing, since we know the length of the output for this case.

outer_join <- function(x, y = x, FUN) {
  fmissing <- missing(FUN)
  if (fmissing) {
    cp <- vector("list", length(x) * length(y))
  } else {
    cp <- list()
  }
  i <- 1L
  for (d1 in x) {
    for (d2 in y) {
      if (fmissing || FUN(d1, d2)) {
        cp[[i]] <- c(d1, d2)
        i <- i + 1L
      }
    }
  }
  cp
}

microbenchmark::microbenchmark(
  `Ex. 1` = outer_join(seq(2^8)),
  `Ex. 2` = outer_join(seq(2^8), FUN = `==`),
  `Ex. 3` = outer_join(seq(2^8), FUN = function(a, b) (a - b) %% 7L == 0),
  times = 10,
  unit = "s"
)
#> Unit: seconds
#>   expr        min         lq       mean     median         uq        max neval
#>  Ex. 1 0.02300627 0.02473937 0.02787098 0.02566033 0.03057122 0.03753821    10
#>  Ex. 2 0.01391696 0.01527710 0.01785506 0.01735052 0.01916601 0.02490142    10
#>  Ex. 3 0.05839193 0.06460381 0.07189763 0.07218238 0.08215803 0.08275439    10

Also see this chapter from Hadley Wickham's Advanced R for a discussion of issues with growing objects, as well as Chapter 2 of R Inferno. My experience is that R is not as slow at loops as it is reputed to be, as long as you avoid growing objects inefficiently.

Efficient way to filter one data frame by ranges in another

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
  rowSums(mapply(function(a,b) between(vals, a, b), 
                 spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>% 
  filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>% 
  filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = main_data %>% 
    filter(Day %in% filt.vals)
  )

Unit: microseconds
     expr      min       lq      mean    median       uq      max neval cld
       OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481   100   c
   eipi10  658.941  686.933  782.8840  714.4440  770.679 2474.941   100  b 
 eipi10_2  579.338  601.355  655.1451  619.2595  672.535 1032.145   100 a

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter = 
  data.frame(Span_number = c(1:9),
             Start = c(2,7,1,15,12,23,90,9000,50000),
             End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = {
    filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
    main_data %>% 
      filter(Day %in% filt.vals)}, 
  times=10
  )

Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval cld
       OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648    10   b
   eipi10   24.209111   25.434856   29.526571   26.455813   32.051920   48.277326    10  a 
 eipi10_2    2.505509    2.618668    4.037414    2.892234    6.222845    8.266612    10  a

Filter dataframe between values in two vectors and add results to list in R

Tidyverse Solution 1 (using purrr's `map2`):

library(tidyverse)
map2(v, v1, ~ filter(mydata, x >= .x & x <= .y))

Tidyverse Solution 2 (this time with `map`)

map(1:length(v), ~ mydata[mydata$x >= v[.] & mydata$x <= v1[.],])

For Loop Solution

result <- list()
for (i in 1:length(v)) {
  result[[i]] <- filter(mydata, x >= v[i] & x <= v1[i])
}

Efficient way to filter one data frame by ranges in another

test.overlap = function(vals) {
  rowSums(mapply(function(a,b) between(vals, a, b), 
                 spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>% 
  filter(test.overlap(Day))

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>% 
  filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = main_data %>% 
    filter(Day %in% filt.vals)
  )

Unit: microseconds
     expr      min       lq      mean    median       uq      max neval cld
       OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481   100   c
   eipi10  658.941  686.933  782.8840  714.4440  770.679 2474.941   100  b 
 eipi10_2  579.338  601.355  655.1451  619.2595  672.535 1032.145   100 a

main_data = data.frame(Day=c(1:100000))

spans_to_filter = 
  data.frame(Span_number = c(1:9),
             Start = c(2,7,1,15,12,23,90,9000,50000),
             End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = {
    filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
    main_data %>% 
      filter(Day %in% filt.vals)}, 
  times=10
  )

Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval cld
       OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648    10   b
   eipi10   24.209111   25.434856   29.526571   26.455813   32.051920   48.277326    10  a 
 eipi10_2    2.505509    2.618668    4.037414    2.892234    6.222845    8.266612    10  a

Subset a dataframe in two dataframes by values in two columns of another dataframe

We can full_join these two data frames and then filter for the rows within the start and end column. The Flag column in the example is just for the join. Finally, we can use distinct to remove duplicated rows.

library(dplyr)

df3 <- df1 %>%
  mutate(Flag = 1) %>%
  full_join(df2 %>% mutate(Flag = 1), by = "Flag") %>%
  filter(position >= start, position <= end) %>%
  distinct(position)
df3
#   position
# 1        3
# 2        6
# 3       12
# 4       25

DATA

df1 <- read.table(text = "position
3
                  6
                  12
                  18
                  25
                  31", header = TRUE)

df2 <- read.table(text = "start   end
2       17
24      29",
                  header = TRUE)

Filtering Data Table Row Vector That Lies Between 2 Numeric Vectors

Your data:

df <- read.table(header=TRUE, text='
ID AltID   Crit1   Crit2   Crit3
1  1       1       5       10
1  2       3       7       15
1  3       2       6       11')
minCutoff = c(0, 5, 10)
maxCutoff = c(4, 7, 12)

TL;DR:

df[rowSums(mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)) >= 3,]
#   ID AltID Crit1 Crit2 Crit3
# 1  1     1     1     5    10
# 3  1     3     2     6    11

Having a variable number of Crit columns is easily handled with a function to apply to each in turn, and then aggregate the results. If you are already using the dplyr package, then you already have dplyr::between, but if not then here is an acceptable replacement:

between <- function(x, low, hi) low <= x & x <= hi

I'll walk you through the work:

isbetween <- mapply(between, df[ grep("Crit", colnames(df)) ], minCutoff, maxCutoff)
isbetween
#      Crit1 Crit2 Crit3
# [1,]  TRUE  TRUE  TRUE
# [2,]  TRUE  TRUE FALSE
# [3,]  TRUE  TRUE  TRUE

df[grepl("Crit", colnames(df)) ] is one way (of several) for looking at just the columns that are of interest to you;
mapply applies a function (between, in this case) with the first value of each of the other lists/vectors. It is effectively the same as:
```
between(df[3], minCutoff[1], maxCutoff[1])
between(df[4], minCutoff[2], maxCutoff[2])
...
```

Now that we have a logical matrix of individual values within their respective cutoffs, we an look at each row to check if they meet your filter requirements of 3 or more. Unfortunately, your listed expected output is not compatible with your rules, so I'll offer some alternatives:

"where any 3 columns fall outside the range", meaning if 3 or more columns are FALSE, then the row should be removed
```
rowSums(!isbetween) >= 3
# [1] FALSE FALSE FALSE
```
"where at least 3 columns fall inside the range", which is what your expected output suggests:
```
rowSums(isbetween) >= 3
# [1]  TRUE FALSE  TRUE
```

Regardless of which you choose, take this logical vector and subset the rows, such as

df[rowSums(isbetween) >= 3,]
#   ID AltID Crit1 Crit2 Crit3
# 1  1     1     1     5    10
# 3  1     3     2     6    11

(The biggest difference between Rui's answer and this is that that answer uses apply on a data.frame for row-wise operations, implicitly converting the involved columns into a matrix. My answer works column-wise (natural operation with frames), so no conversion is done. Other than this conversion, if the frame is not huge then the performance of row-wise versus column-wise should be roughly the same. If it is largely assymmetric (e.g., many many more rows than columns), then it might be a little faster to work column-wise. Vectorized work in R is almost always much faster than iterative.)

Take dates from one dataframe and filter data in another dataframe

library(dplyr)
df<-left_join(df,df1,by="user")
df <- df %>% filter(date>=start_date & date<=end_date)

R: How to filter/subset a sequence of dates

you could use subset

Generating your sample data:

temp<-
read.table(text="date     sessions
2014-12-01  1932
2014-12-02  1828
2014-12-03  2349
2014-12-04  8192
2014-12-05  3188
2014-12-06  3277", header=T)

Making sure it's in date format:

temp$date <- as.Date(temp$date, format= "%Y-%m-%d")

temp

 #        date sessions
 # 1 2014-12-01     1932
 # 2 2014-12-02     1828
 # 3 2014-12-03     2349
 # 4 2014-12-04     8192
 # 5 2014-12-05     3188
 # 6 2014-12-06     3277

Using subset :

subset(temp, date> "2014-12-03" & date < "2014-12-05")

which gives:

  #        date sessions
  # 4 2014-12-04     8192

you could also use []:

temp[(temp$date> "2014-12-03" & temp$date < "2014-12-05"),]

Filter by Ranges Supplied by Two Vectors, Without a Join Operation