Split a Vector by Percentile

split a vector by percentile

Problem statement

Break a sorted vector x every 10% into 10 chunks.

Note there are two interpretation for this:

Cutting by vector index:

split(x, floor(10 * seq.int(0, length(x) - 1) / length(x)))

Cutting by vector values (say, quantiles):

split(x, cut(x, quantile(x, prob = 0:10 / 10, names = FALSE), include = TRUE))

In the following, I will make demonstration using data:

set.seed(0); x <- sort(round(rnorm(23),1))

Particularly, our example data are Normally distributed rather than uniformly distributed, so cutting by index and cutting by value are substantially different.

Result

cutting by index

#$`0`
#[1] -1.5 -1.2 -1.1
#
#$`1`
#[1] -0.9 -0.9
#
#$`2`
#[1] -0.8 -0.4
#
#$`3`
#[1] -0.3 -0.3 -0.3
#
#$`4`
#[1] -0.3 -0.2
#
#$`5`
#[1] 0.0 0.1
#
#$`6`
#[1] 0.3 0.4 0.4
#
#$`7`
#[1] 0.4 0.8
#
#$`8`
#[1] 1.3 1.3
#
#$`9`
#[1] 1.3 2.4

cutting by quantile

#$`[-1.5,-1.06]`
#[1] -1.5 -1.2 -1.1
#
#$`(-1.06,-0.86]`
#[1] -0.9 -0.9
#
#$`(-0.86,-0.34]`
#[1] -0.8 -0.4
#
#$`(-0.34,-0.3]`
#[1] -0.3 -0.3 -0.3 -0.3
#
#$`(-0.3,-0.2]`
#[1] -0.2
#
#$`(-0.2,0.14]`
#[1] 0.0 0.1
#
#$`(0.14,0.4]`
#[1] 0.3 0.4 0.4 0.4
#
#$`(0.4,0.64]`
#numeric(0)
#
#$`(0.64,1.3]`
#[1] 0.8 1.3 1.3 1.3
#
#$`(1.3,2.4]`
#[1] 2.4

Calculating percentile of dataset column

If you order a vector x, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.

x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile

R: splitting dataset into quartiles/deciles. What is the right method?

Another way would be ntile() in dplyr.

library(tidyverse)

foo <- data.frame(a = 1:100,
                  b = runif(100, 50, 200),
                  stringsAsFactors = FALSE)

foo %>%
    mutate(quantile = ntile(b, 10))

#  a         b quantile
#1 1  93.94754        2
#2 2 172.51323        8
#3 3  99.79261        3
#4 4  81.55288        2
#5 5 116.59942        5
#6 6 128.75947        6

How to automate to split a vector in single element named scalars

Using %=% from collapse

library(collapse)
paste0('n', seq_along(x)) %=% x

-output

> n1
[1] 19
> n2
[1] 8
> n3
[1] 9
> n4
[1] 18

Groupby given percentiles of the values of the chosen DataFrame column

I don't have a computer to test it right now, but I think you can do it by: df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean(). Will update after 150mins.

Some explanations:

In [42]:
#use np.percentile to get the bin edges of any percentile you want 
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
 0.907609,
 3.7436589999999996,
 13.089311200000001,
 19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
                       col0     col1      col2
col0                                          
[0.00679, 0.908]   0.457201     41.0  2.103996
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
                       col0     col1      col2
col0                                          
(0.00679, 0.908]   0.907609     82.0  4.207991
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685

How to split a character vector based on length of a list

You could try the following:

split(x=a, f=rep(seq_along(Length), Length))

f has to be of the same length as x (if it is of length one or a divider of x it would be recycled).

Split a Vector by Percentile