split a vector by percentile
Problem statement
Break a sorted vector x
every 10% into 10 chunks.
Note there are two interpretation for this:
Cutting by vector index:
split(x, floor(10 * seq.int(0, length(x) - 1) / length(x)))
Cutting by vector values (say, quantiles):
split(x, cut(x, quantile(x, prob = 0:10 / 10, names = FALSE), include = TRUE))
In the following, I will make demonstration using data:
set.seed(0); x <- sort(round(rnorm(23),1))
Particularly, our example data are Normally distributed rather than uniformly distributed, so cutting by index and cutting by value are substantially different.
Result
cutting by index
#$`0`
#[1] -1.5 -1.2 -1.1
#
#$`1`
#[1] -0.9 -0.9
#
#$`2`
#[1] -0.8 -0.4
#
#$`3`
#[1] -0.3 -0.3 -0.3
#
#$`4`
#[1] -0.3 -0.2
#
#$`5`
#[1] 0.0 0.1
#
#$`6`
#[1] 0.3 0.4 0.4
#
#$`7`
#[1] 0.4 0.8
#
#$`8`
#[1] 1.3 1.3
#
#$`9`
#[1] 1.3 2.4
cutting by quantile
#$`[-1.5,-1.06]`
#[1] -1.5 -1.2 -1.1
#
#$`(-1.06,-0.86]`
#[1] -0.9 -0.9
#
#$`(-0.86,-0.34]`
#[1] -0.8 -0.4
#
#$`(-0.34,-0.3]`
#[1] -0.3 -0.3 -0.3 -0.3
#
#$`(-0.3,-0.2]`
#[1] -0.2
#
#$`(-0.2,0.14]`
#[1] 0.0 0.1
#
#$`(0.14,0.4]`
#[1] 0.3 0.4 0.4 0.4
#
#$`(0.4,0.64]`
#numeric(0)
#
#$`(0.64,1.3]`
#[1] 0.8 1.3 1.3 1.3
#
#$`(1.3,2.4]`
#[1] 2.4
Calculating percentile of dataset column
If you order a vector x
, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.
x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile
R: splitting dataset into quartiles/deciles. What is the right method?
Another way would be ntile()
in dplyr
.
library(tidyverse)
foo <- data.frame(a = 1:100,
b = runif(100, 50, 200),
stringsAsFactors = FALSE)
foo %>%
mutate(quantile = ntile(b, 10))
# a b quantile
#1 1 93.94754 2
#2 2 172.51323 8
#3 3 99.79261 3
#4 4 81.55288 2
#5 5 116.59942 5
#6 6 128.75947 6
How to automate to split a vector in single element named scalars
Using %=%
from collapse
library(collapse)
paste0('n', seq_along(x)) %=% x
-output
> n1
[1] 19
> n2
[1] 8
> n3
[1] 9
> n4
[1] 18
Groupby given percentiles of the values of the chosen DataFrame column
I don't have a computer to test it right now, but I think you can do it by: df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
. Will update after 150mins.
Some explanations:
In [42]:
#use np.percentile to get the bin edges of any percentile you want
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
0.907609,
3.7436589999999996,
13.089311200000001,
19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
col0 col1 col2
col0
[0.00679, 0.908] 0.457201 41.0 2.103996
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
col0 col1 col2
col0
(0.00679, 0.908] 0.907609 82.0 4.207991
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
How to split a character vector based on length of a list
You could try the following:
split(x=a, f=rep(seq_along(Length), Length))
f
has to be of the same length as x
(if it is of length one or a divider of x
it would be recycled).
Related Topics
Ggplot2: Creating Themed Title, Subtitle with Cowplot
Date-Time Differences Between Rows in R
The Rolling Regression in R Using Roll Apply
R: Compare All the Columns Pairwise in Matrix
Large Integers in Data.Table. Grouping Results Different in 1.9.2 Compared to 1.8.10
R: Matrix by Vector Multiplication
Remove Duplicates Column Combinations from a Dataframe in R
R - Svd() Function - Infinite or Missing Values in 'X'
Accessing Y Columns with Duplicated Names in J of X[Y, J] Merges
Force a Regular Plot Object into a Grob for Use in Grid.Arrange
Q-Q Plot with Ggplot2::Stat_Qq, Colours, Single Group
Extract Data Between a Pattern from a Text File
Text Mining R Package & Regex to Handle Replace Smart Curly Quotes
Extracting Indices for Data Frame Rows That Have Max Value for Named Field