pandas: find percentile stats of a given column
You can use the pandas.DataFrame.quantile() function, as shown below.
import pandas as pd
import random
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
df = pd.DataFrame({ 'field_A': A, 'field_B': B })
df
# field_A field_B
# 0 90 72
# 1 63 84
# 2 11 74
# 3 61 66
# 4 78 80
# 5 67 75
# 6 89 47
# 7 12 22
# 8 43 5
# 9 30 64
df.field_A.mean() # Same as df['field_A'].mean()
# 54.399999999999999
df.field_A.median()
# 62.0
# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.
df.field_A.quantile(0.1) # 10th percentile
# 11.9
df.field_A.quantile(0.5) # same as median
# 62.0
df.field_A.quantile(0.9) # 90th percentile
# 89.10000000000001
Calculating percentile of dataset column
If you order a vector x
, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.
x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile
Calculate percentile of value in column
Sort the column, and see if the value is in the first 20% or whatever percentile.
for example:
def in_percentile(my_series, val, perc=0.2):
myList=sorted(my_series.values.tolist())
l=len(myList)
return val>myList[int(l*perc)]
Or, if you want the actual percentile simply use searchsorted
:
my_series.values.searchsorted(val)/len(my_series)*100
Calculate percentile for every column in a data frame in R
stack(lapply(df[3:5], quantile, prob = 0.9, names = FALSE))
# values ind
#1 17.0 price1
#2 8.4 price2
#3 10.1 price3
How to: calculate percentile values for each values in R
let's first generate some data
library(tidyverse)
set.seed(1)
df <- tibble(
name = letters,
value1 = rnorm(length(letters)),
value2 = -rnorm(length(letters)),
value3 = abs(rnorm(length(letters))) )
Function for calculating percentile ranks (source: https://stats.stackexchange.com/a/11928)
perc.rank <- function(x) trunc(rank(x))/length(x)
df %>% mutate(
percentile1 = perc.rank(value1),
percentile2 = perc.rank(value2),
percentile3 = perc.rank(value3)
) -> df
> df
name value1 value2 value3 percentile1 percentile2 percentile3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a -0.626 0.156 0.341 0.192 0.615 0.308
2 b 0.184 1.47 1.13 0.462 1 0.731
3 c -0.836 0.478 1.43 0.115 0.808 0.808
4 d 1.60 -0.418 1.98 1 0.308 0.923
Calculate a percentile of dataframe column efficiently
You can implement dplyr::percent_rank()
to rank each value based on the percentile. This is different, however, from determining the rank based on a cumulative distribution function dplyr::cume_dist()
(Proportion of all values less than or equal to the current rank).
Reproducible example:
set.seed(1)
df <- data.frame(val = rnorm(n = 1000000, mean = 50, sd = 20))
Show that percent_rank()
differs from cume_dist()
and that cume_dist()
is the same as ecdf(x)(x)
:
library(tidyverse)
head(df) %>%
mutate(pr = percent_rank(val),
cd = ecdf(val)(val),
cd2 = cume_dist(val))
val pr cd cd2
1 37.47092 0.4 0.5000000 0.5000000
2 53.67287 0.6 0.6666667 0.6666667
3 33.28743 0.0 0.1666667 0.1666667
4 81.90562 1.0 1.0000000 1.0000000
5 56.59016 0.8 0.8333333 0.8333333
6 33.59063 0.2 0.3333333 0.3333333
Speed of each approach for this example dataset is roughly similar, not exceeding a factor of 2:
library(microbenchmark)
mbm <- microbenchmark(
pr_dplyr = mutate(df, pr = percent_rank(val)),
cd_dplyr = mutate(df, pr = percent_rank(val)),
cd_base = mutate(df, pr = ecdf(val)(val)),
times = 20
)
autoplot(mbm)
Calculate percentile for every value in a column of dataframe
It seems like you want Series.rank()
:
x.loc[:, 'pcta'] = x.rank(pct=True) # will be in decimal form
Performance:
import scipy.stats as scs
%timeit [scs.percentileofscore(x["a"].values, i) for i in x["a"].values]
1000 loops, best of 3: 877 µs per loop
%timeit x.rank(pct=True)
10000 loops, best of 3: 107 µs per loop
Return the 90th percentile values in R
You can first group by Julian_date
, then use the quantile
function to set the probability inside summarise
.
library(tidyverse)
df %>%
group_by(Julian_date) %>%
summarise("value (the 90th percentile)" = quantile(temperature, probs=0.9, na.rm=TRUE))
Output
Julian_date `value (the 90th percentile)`
<int> <dbl>
1 1 2.1
2 2 2.2
3 365 2.5
Data
df <- structure(list(Year = c(1991L, 1991L, 1991L, 1992L, 1992L, 2020L
), Julian_date = c(1L, 2L, 365L, 1L, 365L, 365L), temperature = c(2.1,
2.2, 2.3, 2.1, 2.5, 2.5)), class = "data.frame", row.names = c(NA,
-6L))
Related Topics
Move Nas to the End of Each Column in a Data Frame
R: Find Vector in List of Vectors
When and Why Does "Print" Need Two Attempts to Print a "Data.Table"
Dplyr/Rlang: Parse_Expr with Multiple Expressions
Split Character Columns and Get Names of Field in String
Format Date-Time as Seasons in R
Using the Geosphere Distm Function on a Data.Table to Calculate Distances
Best Practice: Should I Try to Change to Utf-8 as Locale or Is It Safe to Leave It as Is
Make a File Writable in Order to Add New Packages
Milliseconds Puzzle When Calling Strptime in R
Given Value of Matrix, Getting It's Coordinate
Apply() Not Working When Checking Column Class in a Data.Frame
How to Add Only Missing Dates in Dataframe
Count Number of Non-Na Values by Group
Filtering Rows in R Unexpectedly Removes Nas When Using Subset or Dplyr::Filter
Data Table - Select Value of Column by Name from Another Column
Sum Object in a Column Between an Interval Defined by Another Column