How to Quickly Form Groups (Quartiles, Deciles, etc) by Ordering Column(S) in a Data Frame

How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame

The method I use is one of these or Hmisc::cut2(value, g=4):

temp$quartile <- with(temp, cut(value, 
breaks=quantile(value, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))

An alternate might be:

temp$quartile <- with(temp, factor(
findInterval( val, c(-Inf,
quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE),
labels=c("Q1","Q2","Q3","Q4")
))

The first one has the side-effect of labeling the quartiles with the values, which I consider a "good thing", but if it were not "good for you", or the valid problems raised in the comments were a concern you could go with version 2. You can use labels= in cut, or you could add this line to your code:

temp$quartile <- factor(temp$quartile, levels=c("1","2","3","4") )

Or even quicker but slightly more obscure in how it works, although it is no longer a factor, but rather a numeric vector:

temp$quartile <- as.numeric(temp$quartile)

R: splitting dataset into quartiles/deciles. What is the right method?

Another way would be ntile() in dplyr.

library(tidyverse)

foo <- data.frame(a = 1:100,
b = runif(100, 50, 200),
stringsAsFactors = FALSE)

foo %>%
mutate(quantile = ntile(b, 10))

# a b quantile
#1 1 93.94754 2
#2 2 172.51323 8
#3 3 99.79261 3
#4 4 81.55288 2
#5 5 116.59942 5
#6 6 128.75947 6

How to set groups by the percentiles of whole sample?

First part answer is subtract 1 with integer division by 10 and add 1 for start groups from 1:

df = pd.DataFrame({'a':range(1,101)})

df['b'] = 'group ' + (df.a.sub(1) // 10 + 1).astype(str)
print(df)
a b
0 1 group 1
1 2 group 1
2 3 group 1
3 4 group 1
4 5 group 1
.. ... ...
95 96 group 10
96 97 group 10
97 98 group 10
98 99 group 10
99 100 group 10

EDIT: For deciles use qcut:

df['b'] = pd.qcut(df.a, 10, labels=False)

findInterval by group with dplyr

You can do this in group_by + mutate step -

library(dplyr)

df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup

# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows

findInterval is applied for each gr separately.

How can I create a function that computes the median and quartiles for each column of data, for each factor of data?

You can write the function like this :

library(dplyr)

apply_fun <- function(data) {

data %>%
group_by(Type) %>%
summarise(across(starts_with('x'), list(med = median,
first_quartile = ~quantile(., 0.25),
second_quartile = ~quantile(., 0.5),
third_quartile = ~quantile(., 0.75))))
}
result <- apply_fun(data1)

You can add/remove functions in the list as per requirement.

How to compute quantiles on groups

Using group_by you can just do:

library(lubridate)

temp.all = temp.all %>%
# lubridate::date(date) might be necessary if you have datetimes
group_by(date) %>%
mutate(quartile = cut(value, breaks = 4, labels = paste0("Q", 1:4)))

dplyr also has a function ntile which should behave similarly to cut and should give the same results.



Related Topics



Leave a reply



Submit