How to Do Range Grouping on a Column Using Dplyr

How to do range grouping on a column using dplyr?

We can use cut to do the grouping. We create the 'gr' column within the group_by, use summarise to create the number of elements in each group (n()), and order the output (arrange) based on 'gr'.

library(dplyr)
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05)) ) %>%
summarise(n= n()) %>%
arrange(as.numeric(gr))

As the initial object is data.table, this can be done using data.table methods (included @Frank's suggestion to use keyby)

library(data.table)
DT[,.N , keyby = .(gr=cut(B, breaks=seq(0, 1, by=0.05)))]

EDIT:

Based on the update in the OP's post, we could substract a small number to the seq

lvls <- levels(cut(DT$B, seq(0, 1, by =0.05)))
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05) -
.Machine$double.eps, right=FALSE, labels=lvls)) %>%
summarise(n=n()) %>%
arrange(as.numeric(gr))
# gr n
#1 (0,0.05] 2
#2 (0.05,0.1] 2
#3 (0.1,0.15] 3
#4 (0.15,0.2] 2
#5 (0.7,0.75] 1

R and dplyr: group by value ranges

You can use cut() to create a grouping variable with which to summarise count.

library(dplyr)

df %>%
group_by(grp = cut(value, c(-Inf, 2, 4, Inf))) %>%
summarise(count = sum(count))

# A tibble: 3 x 2
grp count
<fct> <int>
1 (-Inf,2] 30
2 (2,4] 70
3 (4, Inf] 110

Group value in range r

Here is a full solution, including your sample data:

df <- data.frame(name=c("r", "h", "s", "l", "e", "m"), value=c(35,20,16,40,23,40))
# get categories
df$groups <- cut(df$value, breaks=c(0,21,30,Inf))

# calculate group counts:
table(cut(df$value, breaks=c(0,21,30,Inf)))

If Inf is a little too extreme, you can use max(df$value) instead.

Apply a summarise condition to a range of columns when using dplyr group_by?

The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for

Basic usage

across() has two primary arguments:

  • The first argument, .cols, selects the columns you want to operate on.
    It uses tidy selection (like select()) so you can pick variables by
    position, name, and type.
  • The second argument, .fns, is a function or list of functions to apply to
    each column. This can also be a purrr style formula (or list of formulas)
    like ~ .x / 2. (This argument is optional, and you can omit it if you just want
    to get the underlying data; you'll see that technique used in
    vignette("rowwise").)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)

Control how the names are created with the .names argument which takes a glue spec:

iris %>% 
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9

Using multiple functions

my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)

iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5

Created on 2020-03-06 by the reprex package (v0.3.0)

Using dplyr to select a range based on a grouping variable in a separate data.frame

Here is an option with Map

res1 <- do.call(rbind, Map(function(x, y, z) 
data.frame(foo[x:y,], ID = as.character(z), stringsAsFactors = FALSE),
findInterval(bar$xMin, foo$x),
findInterval(bar$xMax, foo$x), bar$ID))
all.equal(res1, res)
#[1] TRUE

Or using data.table

library(data.table)
setDT(foo)[bar, on = .(x >= xMin, x <= xMax)]

Or using tidyverse

library(dplyr)
library(purrr)
library(tidyr)
bar %>%
transmute(ID, col1 = map2(findInterval(xMin, foo$x),
findInterval(xMax, foo$x), ~
foo %>% slice(.x:.y))) %>%
unnest(c(col1))

Create a column in R to compare values within a group and flag as greater than (1), less than (0) or equal (2)


df %>%
group_by(Round) %>%
mutate( Flag1 = replace(rank(Score) - 1, length(unique(Score)) == 1, 2))

Round Team Score Flag Flag1
<int> <chr> <int> <int> <dbl>
1 1 Team1 4 0 0
2 1 Team2 8 1 1
3 2 Team1 9 1 1
4 2 Team2 2 0 0
5 3 Team1 6 2 2
6 3 Team2 6 2 2
7 4 Team1 14 1 1
8 4 Team2 9 0 0

R create new column based on data range at a certain time point

Instead of if_else nested, we could use case_when where we can have multiple conditions created, then do a group_by with 'Patient' and fill the 'Value_status' NA elements with the previous non-NA values

library(dplyr)
library(tidyr)
tb %>%
mutate(Value_status = case_when(Time == 1 & Value < 50 ~ "low",
Time == 1 & Value >= 50 ~ "high"
)) %>%
group_by(Patient) %>%
fill(Value_status) %>%
ungroup

-outupt

# A tibble: 15 x 5
RowID Patient Time Value Value_status
<chr> <chr> <dbl> <dbl> <chr>
1 A1 001 1 NA <NA>
2 A2 001 2 10 <NA>
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 high
6 A6 035 1 10 low
7 A7 035 2 15 low
8 A8 035 3 NA low
9 A9 035 4 60 low
10 A10 035 5 56.7 low
11 A11 100 1 30 low
12 A12 100 2 51 low
13 A13 105 1 3 low
14 A14 105 2 13 low
15 A15 105 3 77 low

How to find the range of dates for each group in a dataframe

I would suggest an approach using first() and last() functions from dplyr package:

library(dplyr)
#Data
data <- data.frame(group = rep(letters[1:3], c(4,5,4)),
Date = as.Date(c("2010-08-09", "2010-09-11", "2010-09-12", "2010-09-18",
"2014-03-15","2014-03-16","2014-03-20","2014-03-21","2014-03-25",
"2016-05-02","2016-08-02","2016-08-03","2016-09-21")))
#Code
data %>% group_by(group) %>% mutate(FirsDate=first(Date),LastDate=last(Date))

Output:

# A tibble: 13 x 4
# Groups: group [3]
group Date FirsDate LastDate
<fct> <date> <date> <date>
1 a 2010-08-09 2010-08-09 2010-09-18
2 a 2010-09-11 2010-08-09 2010-09-18
3 a 2010-09-12 2010-08-09 2010-09-18
4 a 2010-09-18 2010-08-09 2010-09-18
5 b 2014-03-15 2014-03-15 2014-03-25
6 b 2014-03-16 2014-03-15 2014-03-25
7 b 2014-03-20 2014-03-15 2014-03-25
8 b 2014-03-21 2014-03-15 2014-03-25
9 b 2014-03-25 2014-03-15 2014-03-25
10 c 2016-05-02 2016-05-02 2016-09-21
11 c 2016-08-02 2016-05-02 2016-09-21
12 c 2016-08-03 2016-05-02 2016-09-21
13 c 2016-09-21 2016-05-02 2016-09-21

If you just want the variables by each group you can use summarise():

#Code2
data %>% group_by(group) %>% summarise(FirsDate=first(Date),LastDate=last(Date))

Output:

# A tibble: 3 x 3
group FirsDate LastDate
<fct> <date> <date>
1 a 2010-08-09 2010-09-18
2 b 2014-03-15 2014-03-25
3 c 2016-05-02 2016-09-21

Update:

#Code
data2 %>% group_by(group) %>% summarise(FirsDate=min(Date),LastDate=max(Date))

Output:

# A tibble: 3 x 3
group FirsDate LastDate
<fct> <date> <date>
1 a 2010-08-09 2010-09-18
2 b 2014-03-15 2014-03-25
3 c 2016-05-02 2016-09-21

R output BOTH maximum and minimum value by group in dataframe

You can use range to get max and min value and use it in summarise to get different rows for each Name.

library(dplyr)

df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")

# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97

If you have large dataset using data.table might be faster.

library(data.table)
setDT(df)[, .(Value = range(Value)), Name]


Related Topics



Leave a reply



Submit