How to Filter a Range of Numbers in R

How do I filter a range of numbers in R?

You can use %in%, or as has been mentioned, alternatively dplyrs between():

 library(dplyr)
 
 new_frame <- Mydata %>% filter(x %in% (3:7) )
 new_frame
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97

While %in% works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%, use dplyr's between():

 new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
 new_frame2
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97

To further clarify, note that %in% checks for the presence in a set of values:

3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE

The above return TRUE because 3:7 is shorthand for seq(3, 7) which produces:

3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7

As such, if you were to use %in% to check for values not produced by :, it will return FALSE:

4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE

Whereas between checks against the end points and all values in between:

between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE

Filter rows in a specific range values containing character and number in R

An approach using stringrs str_extract

library(stringr)

val <- as.numeric(str_extract(df$Residue, "[[:digit:]]+"))

df[val > 300 & val < 500,]
  Residue Energy Model
2 R-A-350  -1.89 DELTA
3 R-B-468  -0.25 DELTA
4 R-C-490  -2.67 DELTA

Data

df <- structure(list(Residue = c("R-A-40", "R-A-350", "R-B-468", "R-C-490", 
"R-A-610"), Energy = c(-3.45, -1.89, -0.25, -2.67, -1.98), Model = c("DELTA", 
"DELTA", "DELTA", "DELTA", "DELTA")), class = "data.frame", row.names = c(NA, 
-5L))

Display number of rows containing a range of numbers (between 0-9) in R

filter function returns a dataframe back and counting length on a dataframe returns number of columns and not rows. Also you are using regex to select rows which do not have a number by introducing ! in front.

You can use sum. + grepl :

result <- sum(grepl('[0-9]', data$MESSAGE))

R- filter rows depending on value range across several columns

First test if values in columns are greater or equal 5 and less or equal than 10, then look for rows with 3 or more that fit the condition.

dat[ rowSums( dat >= 5 & dat <= 10 ) >= 3, ]
  column1 column2 column3 column4 column5
1       7       4      10       9       2

Data

dat <- structure(list(column1 = c(7L, 4L), column2 = c(4L, 8L), column3 = c(10L, 
2L), column4 = c(9L, 6L), column5 = c(2, 2)), class = "data.frame", row.names = c(NA, 
-2L))

Efficient way to filter one data frame by ranges in another

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
  rowSums(mapply(function(a,b) between(vals, a, b), 
                 spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>% 
  filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>% 
  filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = main_data %>% 
    filter(Day %in% filt.vals)
  )

Unit: microseconds
     expr      min       lq      mean    median       uq      max neval cld
       OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481   100   c
   eipi10  658.941  686.933  782.8840  714.4440  770.679 2474.941   100  b 
 eipi10_2  579.338  601.355  655.1451  619.2595  672.535 1032.145   100 a

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter = 
  data.frame(Span_number = c(1:9),
             Start = c(2,7,1,15,12,23,90,9000,50000),
             End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = {
    filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
    main_data %>% 
      filter(Day %in% filt.vals)}, 
  times=10
  )

Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval cld
       OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648    10   b
   eipi10   24.209111   25.434856   29.526571   26.455813   32.051920   48.277326    10  a 
 eipi10_2    2.505509    2.618668    4.037414    2.892234    6.222845    8.266612    10  a

Filter variable based on NA 20% in a range—R

library(dplyr)

df2 %>%
  mutate(missing_perc = rowMeans(is.na(select(., mssi1_1: mssi1_4))) * 100)

Output is:

  uci       ID Class   age   sex bhsMean tbMean pbMean acssMean mssi1_1 mssi1_2 mssi1_3 mssi1_4 missing_perc
1 10001h  1.00  1.00  14.0     0  0.470    2.56   2.00     2.29   NA      NA         NA      NA        100  
2 10476h  5.00  1.00  17.0     0  0.300    3.89   3.67     1.86   NA      NA          0       0         50.0
3 10484h  6.00  1.00  14.0     0  0.160    2.67   4.00     1.14    0       0          0       0          0  
4 10580h 13.0   1.00  14.0     0  0.150    2.33   4.50     2.00    1.00    1.00       0       0          0  
5 14280h 20.0   1.00  15.0     0  0.350    4.89   2.17     1.14    1.00    1.00       0       0          0  
6 2313n  28.0   1.00  14.0     0  0.0600   1.44   1.00    NA       0       0          0       0          0

Sample data:

df2 <- structure(list(uci = c("10001h", "10476h", "10484h", "10580h", 
"14280h", "2313n"), ID = c(1, 5, 6, 13, 20, 28), Class = c(1, 
1, 1, 1, 1, 1), age = c(14, 17, 14, 14, 15, 14), sex = c(0, 0, 
0, 0, 0, 0), bhsMean = c(0.47, 0.3, 0.16, 0.15, 0.35, 0.06), 
    tbMean = c(2.56, 3.89, 2.67, 2.33, 4.89, 1.44), pbMean = c(2, 
    3.67, 4, 4.5, 2.17, 1), acssMean = c(2.29, 1.86, 1.14, 2, 
    1.14, NA), mssi1_1 = c(NA, NA, 0, 1, 1, 0), mssi1_2 = c(NA, 
    NA, 0, 1, 1, 0), mssi1_3 = c(NA, 0, 0, 0, 0, 0), mssi1_4 = c(NA, 
    0, 0, 0, 0, 0)), .Names = c("uci", "ID", "Class", "age", 
"sex", "bhsMean", "tbMean", "pbMean", "acssMean", "mssi1_1", 
"mssi1_2", "mssi1_3", "mssi1_4"), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

How to Filter a Range of Numbers in R

How do I filter a range of numbers in R?

Filter rows in a specific range values containing character and number in R

Data

Display number of rows containing a range of numbers (between 0-9) in R

R- filter rows depending on value range across several columns

Data

Efficient way to filter one data frame by ranges in another

Filter variable based on NA 20% in a range—R

Related Topics

Leave a reply