How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
The method I use is one of these or Hmisc::cut2(value, g=4)
:
temp$quartile <- with(temp, cut(value,
breaks=quantile(value, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
An alternate might be:
temp$quartile <- with(temp, factor(
findInterval( val, c(-Inf,
quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE),
labels=c("Q1","Q2","Q3","Q4")
))
The first one has the side-effect of labeling the quartiles with the values, which I consider a "good thing", but if it were not "good for you", or the valid problems raised in the comments were a concern you could go with version 2. You can use labels=
in cut
, or you could add this line to your code:
temp$quartile <- factor(temp$quartile, levels=c("1","2","3","4") )
Or even quicker but slightly more obscure in how it works, although it is no longer a factor, but rather a numeric vector:
temp$quartile <- as.numeric(temp$quartile)
R: splitting dataset into quartiles/deciles. What is the right method?
Another way would be ntile()
in dplyr
.
library(tidyverse)
foo <- data.frame(a = 1:100,
b = runif(100, 50, 200),
stringsAsFactors = FALSE)
foo %>%
mutate(quantile = ntile(b, 10))
# a b quantile
#1 1 93.94754 2
#2 2 172.51323 8
#3 3 99.79261 3
#4 4 81.55288 2
#5 5 116.59942 5
#6 6 128.75947 6
How to set groups by the percentiles of whole sample?
First part answer is subtract 1 with integer division by 10
and add 1
for start groups from 1
:
df = pd.DataFrame({'a':range(1,101)})
df['b'] = 'group ' + (df.a.sub(1) // 10 + 1).astype(str)
print(df)
a b
0 1 group 1
1 2 group 1
2 3 group 1
3 4 group 1
4 5 group 1
.. ... ...
95 96 group 10
96 97 group 10
97 98 group 10
98 99 group 10
99 100 group 10
EDIT: For deciles use qcut
:
df['b'] = pd.qcut(df.a, 10, labels=False)
findInterval by group with dplyr
You can do this in group_by
+ mutate
step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval
is applied for each gr
separately.
How can I create a function that computes the median and quartiles for each column of data, for each factor of data?
You can write the function like this :
library(dplyr)
apply_fun <- function(data) {
data %>%
group_by(Type) %>%
summarise(across(starts_with('x'), list(med = median,
first_quartile = ~quantile(., 0.25),
second_quartile = ~quantile(., 0.5),
third_quartile = ~quantile(., 0.75))))
}
result <- apply_fun(data1)
You can add/remove functions in the list as per requirement.
How to compute quantiles on groups
Using group_by
you can just do:
library(lubridate)
temp.all = temp.all %>%
# lubridate::date(date) might be necessary if you have datetimes
group_by(date) %>%
mutate(quartile = cut(value, breaks = 4, labels = paste0("Q", 1:4)))
dplyr
also has a function ntile
which should behave similarly to cut
and should give the same results.
Related Topics
Calculate Cumulative Sum (Cumsum) by Group
Chopping a String into a Vector of Fixed Width Character Elements
Read All Files in Directory and Apply Multiple Functions to Each Data Frame
Error - Replacement Has [X] Rows, Data Has [Y]
Difference Between the == and %In% Operators in R
How to Read in Numbers With a Comma as Decimal Separator
Efficiently Generate a Random Sample of Times and Dates Between Two Dates
Addressing X and Y in Aes by Variable Number
Select Groups Which Have At Least One of a Certain Value
Create a Data.Frame Where a Column Is a List
Dplyr: Nonstandard Column Names (White Space, Punctuation, Starts With Numbers)