How to do range grouping on a column using dplyr?
We can use cut
to do the grouping. We create the 'gr' column within the group_by
, use summarise
to create the number of elements in each group (n()
), and order the output (arrange
) based on 'gr'.
library(dplyr)
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05)) ) %>%
summarise(n= n()) %>%
arrange(as.numeric(gr))
As the initial object is data.table
, this can be done using data.table
methods (included @Frank's suggestion to use keyby
)
library(data.table)
DT[,.N , keyby = .(gr=cut(B, breaks=seq(0, 1, by=0.05)))]
EDIT:
Based on the update in the OP's post, we could substract a small number to the seq
lvls <- levels(cut(DT$B, seq(0, 1, by =0.05)))
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05) -
.Machine$double.eps, right=FALSE, labels=lvls)) %>%
summarise(n=n()) %>%
arrange(as.numeric(gr))
# gr n
#1 (0,0.05] 2
#2 (0.05,0.1] 2
#3 (0.1,0.15] 3
#4 (0.15,0.2] 2
#5 (0.7,0.75] 1
R and dplyr: group by value ranges
You can use cut()
to create a grouping variable with which to summarise count.
library(dplyr)
df %>%
group_by(grp = cut(value, c(-Inf, 2, 4, Inf))) %>%
summarise(count = sum(count))
# A tibble: 3 x 2
grp count
<fct> <int>
1 (-Inf,2] 30
2 (2,4] 70
3 (4, Inf] 110
Group value in range r
Here is a full solution, including your sample data:
df <- data.frame(name=c("r", "h", "s", "l", "e", "m"), value=c(35,20,16,40,23,40))
# get categories
df$groups <- cut(df$value, breaks=c(0,21,30,Inf))
# calculate group counts:
table(cut(df$value, breaks=c(0,21,30,Inf)))
If Inf is a little too extreme, you can use max(df$value)
instead.
Apply a summarise condition to a range of columns when using dplyr group_by?
The upcoming version 1.0.0 of dplyr
will have across()
function that does what you wish for
Basic usage
across()
has two primary arguments:
- The first argument,
.cols
, selects the columns you want to operate on.
It uses tidy selection (likeselect()
) so you can pick variables by
position, name, and type.
- The second argument,
.fns
, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like~ .x / 2
. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used invignette("rowwise")
.)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names
argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
Using dplyr to select a range based on a grouping variable in a separate data.frame
Here is an option with Map
res1 <- do.call(rbind, Map(function(x, y, z)
data.frame(foo[x:y,], ID = as.character(z), stringsAsFactors = FALSE),
findInterval(bar$xMin, foo$x),
findInterval(bar$xMax, foo$x), bar$ID))
all.equal(res1, res)
#[1] TRUE
Or using data.table
library(data.table)
setDT(foo)[bar, on = .(x >= xMin, x <= xMax)]
Or using tidyverse
library(dplyr)
library(purrr)
library(tidyr)
bar %>%
transmute(ID, col1 = map2(findInterval(xMin, foo$x),
findInterval(xMax, foo$x), ~
foo %>% slice(.x:.y))) %>%
unnest(c(col1))
Create a column in R to compare values within a group and flag as greater than (1), less than (0) or equal (2)
df %>%
group_by(Round) %>%
mutate( Flag1 = replace(rank(Score) - 1, length(unique(Score)) == 1, 2))
Round Team Score Flag Flag1
<int> <chr> <int> <int> <dbl>
1 1 Team1 4 0 0
2 1 Team2 8 1 1
3 2 Team1 9 1 1
4 2 Team2 2 0 0
5 3 Team1 6 2 2
6 3 Team2 6 2 2
7 4 Team1 14 1 1
8 4 Team2 9 0 0
R create new column based on data range at a certain time point
Instead of if_else
nested, we could use case_when
where we can have multiple conditions created, then do a group_by
with 'Patient' and fill
the 'Value_status' NA
elements with the previous non-NA values
library(dplyr)
library(tidyr)
tb %>%
mutate(Value_status = case_when(Time == 1 & Value < 50 ~ "low",
Time == 1 & Value >= 50 ~ "high"
)) %>%
group_by(Patient) %>%
fill(Value_status) %>%
ungroup
-outupt
# A tibble: 15 x 5
RowID Patient Time Value Value_status
<chr> <chr> <dbl> <dbl> <chr>
1 A1 001 1 NA <NA>
2 A2 001 2 10 <NA>
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 high
6 A6 035 1 10 low
7 A7 035 2 15 low
8 A8 035 3 NA low
9 A9 035 4 60 low
10 A10 035 5 56.7 low
11 A11 100 1 30 low
12 A12 100 2 51 low
13 A13 105 1 3 low
14 A14 105 2 13 low
15 A15 105 3 77 low
How to find the range of dates for each group in a dataframe
I would suggest an approach using first()
and last()
functions from dplyr
package:
library(dplyr)
#Data
data <- data.frame(group = rep(letters[1:3], c(4,5,4)),
Date = as.Date(c("2010-08-09", "2010-09-11", "2010-09-12", "2010-09-18",
"2014-03-15","2014-03-16","2014-03-20","2014-03-21","2014-03-25",
"2016-05-02","2016-08-02","2016-08-03","2016-09-21")))
#Code
data %>% group_by(group) %>% mutate(FirsDate=first(Date),LastDate=last(Date))
Output:
# A tibble: 13 x 4
# Groups: group [3]
group Date FirsDate LastDate
<fct> <date> <date> <date>
1 a 2010-08-09 2010-08-09 2010-09-18
2 a 2010-09-11 2010-08-09 2010-09-18
3 a 2010-09-12 2010-08-09 2010-09-18
4 a 2010-09-18 2010-08-09 2010-09-18
5 b 2014-03-15 2014-03-15 2014-03-25
6 b 2014-03-16 2014-03-15 2014-03-25
7 b 2014-03-20 2014-03-15 2014-03-25
8 b 2014-03-21 2014-03-15 2014-03-25
9 b 2014-03-25 2014-03-15 2014-03-25
10 c 2016-05-02 2016-05-02 2016-09-21
11 c 2016-08-02 2016-05-02 2016-09-21
12 c 2016-08-03 2016-05-02 2016-09-21
13 c 2016-09-21 2016-05-02 2016-09-21
If you just want the variables by each group you can use summarise()
:
#Code2
data %>% group_by(group) %>% summarise(FirsDate=first(Date),LastDate=last(Date))
Output:
# A tibble: 3 x 3
group FirsDate LastDate
<fct> <date> <date>
1 a 2010-08-09 2010-09-18
2 b 2014-03-15 2014-03-25
3 c 2016-05-02 2016-09-21
Update:
#Code
data2 %>% group_by(group) %>% summarise(FirsDate=min(Date),LastDate=max(Date))
Output:
# A tibble: 3 x 3
group FirsDate LastDate
<fct> <date> <date>
1 a 2010-08-09 2010-09-18
2 b 2014-03-15 2014-03-25
3 c 2016-05-02 2016-09-21
R output BOTH maximum and minimum value by group in dataframe
You can use range
to get max
and min
value and use it in summarise
to get different rows for each Name
.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table
might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
Related Topics
R Matrix to Rownames Colnames Values
Best Way to Transpose Data.Table
Selecting Columns in R Data Frame Based on Those *Not* in a Vector
In R Data.Table, How to Pass Variable Parameters to an Expression
R Interpolated Polar Contour Plot
Controlling Order of Facet_Grid/Facet_Wrap in Ggplot2
Is There an R Function to Reshape This Data from Long to Wide
How to Change Xts to Data.Frame and Keep Index
How to Get the Name of the Calling Function Inside the Called Routine
Read and Rbind Multiple CSV Files
How to Delete a Column by Name in Data.Table
Data.Table Row-Wise Sum, Mean, Min, Max Like Dplyr
R: How to Split a Data Frame into Training, Validation, and Test Sets
Can Rbind Be Parallelized in R
Cleaning 'Inf' Values from an R Dataframe
Producing a Vector Graphics Image (I.E. Metafile) in R Suitable for Printing in Word 2007