How Does Dplyr's Between Work

How does dplyr’s between work?

between is nothing special — any other function in R would have led to the same problem. Your confusion stems from the fact that dplyr has a lot of functions that allow you to work on data.frame column names as if they were normal variables; for instance:

filter(flights, month > 9)

However, between is not one of these functions. As mentioned, it’s simply a normal function. So if you want to use it, you need to provide arguments in the conventional way; for instance:

between(flights$month, 7, 9)

This will return a logical vector, and you can now use it to index your data.frame:

flights[between(flights$month, 7, 9), ]

Or, more dplyr-like:

flights %>% filter(between(month, 7, 9))

Note that here we now use non-standard evaluation. But the evaluation is performed by filter, not by between. between is called (by filter) using standard evaluation.

How do I filter a range of numbers in R?

You can use %in%, or as has been mentioned, alternatively dplyrs between():

 library(dplyr)

new_frame <- Mydata %>% filter(x %in% (3:7) )
new_frame
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97

While %in% works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%, use dplyr's between():

 new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
new_frame2
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97

To further clarify, note that %in% checks for the presence in a set of values:

3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE

The above return TRUE because 3:7 is shorthand for seq(3, 7) which produces:

3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7

As such, if you were to use %in% to check for values not produced by :, it will return FALSE:

4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE

Whereas between checks against the end points and all values in between:

between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE

How to use select() inside between() inside filter() to subset data dplyr r

Combine multiple conditions using & -

library(dplyr)

data %>%
filter(SiteID == "A" & between(Seconds, 2, 8) |
SiteID == "B" & between(Seconds, 3, 6) |
SiteID == "C" & between(Seconds, 8, 10)|
SiteID == "D" & between(Seconds, 1, 6) |
SiteID == "E" & between(Seconds, 2, 9))

conditional matching between variables in dplyr

Try that:

parties %>% 
group_by(name) %>%
filter("K" %in% class,
"R" %in% class,
"L" %in% class) %>%
summarise()

# A tibble: 2 x 1
name
<chr>
1 Party2
2 Party4

EDIT: If you want to work with more than 3 parties you can also use:

mask = c("K", "R", "L")
parties %>%
group_by(name) %>%
filter(all(mask %in% class)) %>%
summarise()

Filter between multiple date ranges

With some inspiration from this question on how to Efficient way to filter one data frame by ranges in another, I came up with the following solutions.

One is a very slow with very large datasets:

It takes my data provided above and uses rowwise()

filtered3 <- df %>% 
rowwise() %>%
filter(any(datetime >= start & datetime <= end))

As I mentioned, with more than 3 million rows in my data, this was very slow.

Another option, also from the answer linked above, includes using the data.table package, which has an inrange function. This one works much faster.

library(data.table)
range <- data.table(start = start, end = end)
filtered4 <- setDT(df)[datetime %inrange% range]

Filtering dates in dplyr

If Date is properly formatted as a date, your first try works:

p2p_dt_SKILL_A <-read.table(text="Patch,Date,Prod_DL
P1,9/4/2015,3.43
P11,9/11/2015,3.49
P12,9/18/2015,3.45
P13,12/6/2015,3.57
P14,12/13/2015,3.43
P15,12/20/2015,3.47
",sep=",",stringsAsFactors =FALSE, header=TRUE)

p2p_dt_SKILL_A$Date <-as.Date(p2p_dt_SKILL_A$Date,"%m/%d/%Y")

p2p_dt_SKILL_A%>%
select(Patch,Date,Prod_DL)%>%
filter(Date > "2015-09-04" & Date <"2015-09-18")
Patch Date Prod_DL
1 P11 2015-09-11 3.49



Still works if data is of type tbl_df.

p2p_dt_SKILL_A <-tbl_df(p2p_dt_SKILL_A)

p2p_dt_SKILL_A%>%
select(Patch,Date,Prod_DL)%>%
filter(Date > "2015-09-04" & Date <"2015-09-18")
Source: local data frame [1 x 3]

Patch Date Prod_DL
(chr) (date) (dbl)
1 P11 2015-09-11 3.49

combining loops and some dplyr functions

We can use map to loop over the 'keywords', then filter where the 'word' is that keyword, and frequency is greater than 0, then grouped by 'TI', get the tally and the number of rows

library(purrr)
library(dplyr)
map(keywords, ~ df %>%
filter(word == .x, frequency > 0) %>%
group_by(TI) %>%
tally() %>%
nrow())

Error in `dplyr::between()`: 'left' must be length 1

You need to capture one value, and Tmin is capturing the entire vector of values for each group, so to solve the problem you can use a function that takes out one value out of the vector. Since the vector is made of the same values, many functions can work, e.g. min, or first:

TimeTempReprod %>% 
group_by(Date, Station) %>%
mutate(y = between(Temperature, min(Tmin), min(Tmin) + 2))

gives out:

# A tibble: 96 × 8
# Groups: Date, Station [2]
Station Date Time Temperature Tmin Tmed Tmax y
<chr> <date> <time> <dbl> <dbl> <dbl> <dbl> <lgl>
1 F 2021-10-15 00:11:46 16.8 15.2 17.1 20.4 TRUE
2 F 2021-10-15 00:41:46 16.5 15.2 17.1 20.4 TRUE
3 F 2021-10-15 01:11:46 16.2 15.2 17.1 20.4 TRUE
4 F 2021-10-15 01:41:46 15.6 15.2 17.1 20.4 TRUE
5 F 2021-10-15 02:11:46 15.9 15.2 17.1 20.4 TRUE
6 F 2021-10-15 02:41:46 16.1 15.2 17.1 20.4 TRUE
7 F 2021-10-15 03:11:46 16.4 15.2 17.1 20.4 TRUE
8 F 2021-10-15 03:41:46 16.2 15.2 17.1 20.4 TRUE
9 F 2021-10-15 04:11:46 16 15.2 17.1 20.4 TRUE
10 F 2021-10-15 04:41:46 16 15.2 17.1 20.4 TRUE
# … with 86 more rows


Related Topics



Leave a reply



Submit