How to Filter a Range of Numbers in R

How do I filter a range of numbers in R?

You can use %in%, or as has been mentioned, alternatively dplyrs between():

 library(dplyr)

new_frame <- Mydata %>% filter(x %in% (3:7) )
new_frame
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97

While %in% works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%, use dplyr's between():

 new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
new_frame2
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97

To further clarify, note that %in% checks for the presence in a set of values:

3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE

The above return TRUE because 3:7 is shorthand for seq(3, 7) which produces:

3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7

As such, if you were to use %in% to check for values not produced by :, it will return FALSE:

4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE

Whereas between checks against the end points and all values in between:

between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE

Filter rows in a specific range values containing character and number in R

An approach using stringrs str_extract

library(stringr)

val <- as.numeric(str_extract(df$Residue, "[[:digit:]]+"))

df[val > 300 & val < 500,]
Residue Energy Model
2 R-A-350 -1.89 DELTA
3 R-B-468 -0.25 DELTA
4 R-C-490 -2.67 DELTA

Data

df <- structure(list(Residue = c("R-A-40", "R-A-350", "R-B-468", "R-C-490", 
"R-A-610"), Energy = c(-3.45, -1.89, -0.25, -2.67, -1.98), Model = c("DELTA",
"DELTA", "DELTA", "DELTA", "DELTA")), class = "data.frame", row.names = c(NA,
-5L))

Display number of rows containing a range of numbers (between 0-9) in R

filter function returns a dataframe back and counting length on a dataframe returns number of columns and not rows. Also you are using regex to select rows which do not have a number by introducing ! in front.

You can use sum. + grepl :

result <- sum(grepl('[0-9]', data$MESSAGE))

R- filter rows depending on value range across several columns

First test if values in columns are greater or equal 5 and less or equal than 10, then look for rows with 3 or more that fit the condition.

dat[ rowSums( dat >= 5 & dat <= 10 ) >= 3, ]
column1 column2 column3 column4 column5
1 7 4 10 9 2

Data

dat <- structure(list(column1 = c(7L, 4L), column2 = c(4L, 8L), column3 = c(10L, 
2L), column4 = c(9L, 6L), column5 = c(2, 2)), class = "data.frame", row.names = c(NA,
-2L))

Efficient way to filter one data frame by ranges in another

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
rowSums(mapply(function(a,b) between(vals, a, b),
spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>%
filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>%
filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = main_data %>%
filter(Day %in% filt.vals)
)

Unit: microseconds
expr min lq mean median uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
OP=main_data %>%
rowwise() %>%
filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
eipi10 = main_data %>%
filter(test.overlap(Day)),
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
main_data %>%
filter(Day %in% filt.vals)},
times=10
)

Unit: milliseconds
expr min lq mean median uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a

Filter variable based on NA 20% in a range—R

library(dplyr)

df2 %>%
mutate(missing_perc = rowMeans(is.na(select(., mssi1_1: mssi1_4))) * 100)

Output is:

  uci       ID Class   age   sex bhsMean tbMean pbMean acssMean mssi1_1 mssi1_2 mssi1_3 mssi1_4 missing_perc
1 10001h 1.00 1.00 14.0 0 0.470 2.56 2.00 2.29 NA NA NA NA 100
2 10476h 5.00 1.00 17.0 0 0.300 3.89 3.67 1.86 NA NA 0 0 50.0
3 10484h 6.00 1.00 14.0 0 0.160 2.67 4.00 1.14 0 0 0 0 0
4 10580h 13.0 1.00 14.0 0 0.150 2.33 4.50 2.00 1.00 1.00 0 0 0
5 14280h 20.0 1.00 15.0 0 0.350 4.89 2.17 1.14 1.00 1.00 0 0 0
6 2313n 28.0 1.00 14.0 0 0.0600 1.44 1.00 NA 0 0 0 0 0

Sample data:

df2 <- structure(list(uci = c("10001h", "10476h", "10484h", "10580h", 
"14280h", "2313n"), ID = c(1, 5, 6, 13, 20, 28), Class = c(1,
1, 1, 1, 1, 1), age = c(14, 17, 14, 14, 15, 14), sex = c(0, 0,
0, 0, 0, 0), bhsMean = c(0.47, 0.3, 0.16, 0.15, 0.35, 0.06),
tbMean = c(2.56, 3.89, 2.67, 2.33, 4.89, 1.44), pbMean = c(2,
3.67, 4, 4.5, 2.17, 1), acssMean = c(2.29, 1.86, 1.14, 2,
1.14, NA), mssi1_1 = c(NA, NA, 0, 1, 1, 0), mssi1_2 = c(NA,
NA, 0, 1, 1, 0), mssi1_3 = c(NA, 0, 0, 0, 0, 0), mssi1_4 = c(NA,
0, 0, 0, 0, 0)), .Names = c("uci", "ID", "Class", "age",
"sex", "bhsMean", "tbMean", "pbMean", "acssMean", "mssi1_1",
"mssi1_2", "mssi1_3", "mssi1_4"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))


Related Topics



Leave a reply



Submit