Calculating Percentile of Dataset Column

pandas: find percentile stats of a given column

You can use the pandas.DataFrame.quantile() function, as shown below.

import pandas as pd
import random

A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B })
df
# field_A field_B
# 0 90 72
# 1 63 84
# 2 11 74
# 3 61 66
# 4 78 80
# 5 67 75
# 6 89 47
# 7 12 22
# 8 43 5
# 9 30 64

df.field_A.mean() # Same as df['field_A'].mean()
# 54.399999999999999

df.field_A.median()
# 62.0

# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.

df.field_A.quantile(0.1) # 10th percentile
# 11.9

df.field_A.quantile(0.5) # same as median
# 62.0

df.field_A.quantile(0.9) # 90th percentile
# 89.10000000000001

Calculating percentile of dataset column

If you order a vector x, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.

x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile

Calculate percentile of value in column

Sort the column, and see if the value is in the first 20% or whatever percentile.

for example:

def in_percentile(my_series, val, perc=0.2): 
myList=sorted(my_series.values.tolist())
l=len(myList)
return val>myList[int(l*perc)]

Or, if you want the actual percentile simply use searchsorted:

my_series.values.searchsorted(val)/len(my_series)*100

Calculate percentile for every column in a data frame in R

stack(lapply(df[3:5], quantile, prob = 0.9, names = FALSE))
# values ind
#1 17.0 price1
#2 8.4 price2
#3 10.1 price3

How to: calculate percentile values for each values in R

let's first generate some data

library(tidyverse)
set.seed(1)
df <- tibble(
name = letters,
value1 = rnorm(length(letters)),
value2 = -rnorm(length(letters)),
value3 = abs(rnorm(length(letters))) )

Function for calculating percentile ranks (source: https://stats.stackexchange.com/a/11928)

perc.rank <- function(x) trunc(rank(x))/length(x)

df %>% mutate(
percentile1 = perc.rank(value1),
percentile2 = perc.rank(value2),
percentile3 = perc.rank(value3)
) -> df

> df

name value1 value2 value3 percentile1 percentile2 percentile3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a -0.626 0.156 0.341 0.192 0.615 0.308
2 b 0.184 1.47 1.13 0.462 1 0.731
3 c -0.836 0.478 1.43 0.115 0.808 0.808
4 d 1.60 -0.418 1.98 1 0.308 0.923

Calculate a percentile of dataframe column efficiently

You can implement dplyr::percent_rank() to rank each value based on the percentile. This is different, however, from determining the rank based on a cumulative distribution function dplyr::cume_dist() (Proportion of all values less than or equal to the current rank).

Reproducible example:

set.seed(1)
df <- data.frame(val = rnorm(n = 1000000, mean = 50, sd = 20))

Show that percent_rank() differs from cume_dist() and that cume_dist() is the same as ecdf(x)(x):

library(tidyverse)

head(df) %>%
mutate(pr = percent_rank(val),
cd = ecdf(val)(val),
cd2 = cume_dist(val))

val pr cd cd2
1 37.47092 0.4 0.5000000 0.5000000
2 53.67287 0.6 0.6666667 0.6666667
3 33.28743 0.0 0.1666667 0.1666667
4 81.90562 1.0 1.0000000 1.0000000
5 56.59016 0.8 0.8333333 0.8333333
6 33.59063 0.2 0.3333333 0.3333333

Speed of each approach for this example dataset is roughly similar, not exceeding a factor of 2:

library(microbenchmark)
mbm <- microbenchmark(
pr_dplyr = mutate(df, pr = percent_rank(val)),
cd_dplyr = mutate(df, pr = percent_rank(val)),
cd_base = mutate(df, pr = ecdf(val)(val)),
times = 20
)

autoplot(mbm)

Sample Image

Calculate percentile for every value in a column of dataframe

It seems like you want Series.rank():

x.loc[:, 'pcta'] = x.rank(pct=True) # will be in decimal form

Performance:

import scipy.stats as scs

%timeit [scs.percentileofscore(x["a"].values, i) for i in x["a"].values]
1000 loops, best of 3: 877 µs per loop

%timeit x.rank(pct=True)
10000 loops, best of 3: 107 µs per loop

Return the 90th percentile values in R

You can first group by Julian_date, then use the quantile function to set the probability inside summarise.

library(tidyverse)

df %>%
group_by(Julian_date) %>%
summarise("value (the 90th percentile)" = quantile(temperature, probs=0.9, na.rm=TRUE))

Output

  Julian_date `value (the 90th percentile)`
<int> <dbl>
1 1 2.1
2 2 2.2
3 365 2.5

Data

df <- structure(list(Year = c(1991L, 1991L, 1991L, 1992L, 1992L, 2020L
), Julian_date = c(1L, 2L, 365L, 1L, 365L, 365L), temperature = c(2.1,
2.2, 2.3, 2.1, 2.5, 2.5)), class = "data.frame", row.names = c(NA,
-6L))


Related Topics



Leave a reply



Submit