Calculating Percentile of Dataset Column

pandas: find percentile stats of a given column

You can use the pandas.DataFrame.quantile() function, as shown below.

import pandas as pd
import random

A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B })
df
#    field_A  field_B
# 0       90       72
# 1       63       84
# 2       11       74
# 3       61       66
# 4       78       80
# 5       67       75
# 6       89       47
# 7       12       22
# 8       43        5
# 9       30       64

df.field_A.mean()   # Same as df['field_A'].mean()
# 54.399999999999999

df.field_A.median() 
# 62.0

# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.

df.field_A.quantile(0.1) # 10th percentile
# 11.9

df.field_A.quantile(0.5) # same as median
# 62.0

df.field_A.quantile(0.9) # 90th percentile
# 89.10000000000001

Calculating percentile of dataset column

If you order a vector x, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.

x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile

Calculate percentile of value in column

Sort the column, and see if the value is in the first 20% or whatever percentile.

for example:

def in_percentile(my_series, val, perc=0.2): 
    myList=sorted(my_series.values.tolist())
    l=len(myList)
    return val>myList[int(l*perc)]

Or, if you want the actual percentile simply use searchsorted:

my_series.values.searchsorted(val)/len(my_series)*100

Calculate percentile for every column in a data frame in R

stack(lapply(df[3:5], quantile, prob = 0.9, names = FALSE))
#  values    ind
#1   17.0 price1
#2    8.4 price2
#3   10.1 price3

How to: calculate percentile values for each values in R

let's first generate some data

library(tidyverse)
set.seed(1)
df <- tibble(
name = letters, 
value1 = rnorm(length(letters)),
value2 = -rnorm(length(letters)),
value3 = abs(rnorm(length(letters))) )

Function for calculating percentile ranks (source: https://stats.stackexchange.com/a/11928)

perc.rank <- function(x) trunc(rank(x))/length(x)

df %>% mutate(
percentile1 = perc.rank(value1),
percentile2 = perc.rank(value2),
percentile3 = perc.rank(value3)
) -> df

> df

   name  value1  value2 value3 percentile1 percentile2 percentile3
   <chr>  <dbl>   <dbl>  <dbl>       <dbl>       <dbl>       <dbl>
 1 a     -0.626  0.156  0.341        0.192      0.615        0.308
 2 b      0.184  1.47   1.13         0.462      1            0.731
 3 c     -0.836  0.478  1.43         0.115      0.808        0.808
 4 d      1.60  -0.418  1.98         1          0.308        0.923

Calculate a percentile of dataframe column efficiently

You can implement dplyr::percent_rank() to rank each value based on the percentile. This is different, however, from determining the rank based on a cumulative distribution function dplyr::cume_dist() (Proportion of all values less than or equal to the current rank).

Reproducible example:

set.seed(1)
df <- data.frame(val = rnorm(n = 1000000, mean = 50, sd = 20))

Show that percent_rank() differs from cume_dist() and that cume_dist() is the same as ecdf(x)(x):

library(tidyverse)

head(df) %>% 
  mutate(pr  = percent_rank(val), 
         cd  = ecdf(val)(val), 
         cd2 = cume_dist(val))

       val  pr        cd       cd2
1 37.47092 0.4 0.5000000 0.5000000
2 53.67287 0.6 0.6666667 0.6666667
3 33.28743 0.0 0.1666667 0.1666667
4 81.90562 1.0 1.0000000 1.0000000
5 56.59016 0.8 0.8333333 0.8333333
6 33.59063 0.2 0.3333333 0.3333333

Speed of each approach for this example dataset is roughly similar, not exceeding a factor of 2:

library(microbenchmark)
mbm <- microbenchmark(
    pr_dplyr = mutate(df, pr = percent_rank(val)),
    cd_dplyr = mutate(df, pr = percent_rank(val)),
    cd_base  = mutate(df, pr = ecdf(val)(val)),
    times = 20
)

autoplot(mbm)

Sample Image

Calculate percentile for every value in a column of dataframe

It seems like you want Series.rank():

x.loc[:, 'pcta'] = x.rank(pct=True) # will be in decimal form

Performance:

import scipy.stats as scs

%timeit [scs.percentileofscore(x["a"].values, i) for i in x["a"].values]
1000 loops, best of 3: 877 µs per loop

%timeit x.rank(pct=True)
10000 loops, best of 3: 107 µs per loop

Return the 90th percentile values in R

You can first group by Julian_date, then use the quantile function to set the probability inside summarise.

library(tidyverse)

df %>% 
  group_by(Julian_date) %>% 
  summarise("value (the 90th percentile)" = quantile(temperature, probs=0.9, na.rm=TRUE))

Output

  Julian_date `value (the 90th percentile)`
        <int>                         <dbl>
1           1                           2.1
2           2                           2.2
3         365                           2.5

Data

df <- structure(list(Year = c(1991L, 1991L, 1991L, 1992L, 1992L, 2020L
), Julian_date = c(1L, 2L, 365L, 1L, 365L, 365L), temperature = c(2.1, 
2.2, 2.3, 2.1, 2.5, 2.5)), class = "data.frame", row.names = c(NA, 
-6L))