Definitions of Quantiles in R

Definitions of quantiles in R

General discussion:

There are many different possibilities for sample quantile functions; we want them to have various properties (including being simple to understand and explain!), and depending on which properties we want most, we might prefer different definitions.

As a result, the wide variety of packages between them use many different definitions.

The paper by Hyndman and Fan [1] gives six desirable properties for a sample quantile function, lists nine existing definitions for the quantile function, and mentions which (of a number of common) packages use which definitions. Its Introduction says (sorry, the mathematics in this quote doesn't render properly any more, since it was moved to SO):

the sample quantiles that are used in statistical
packages are all based on one or two order statistics, and
can be written as

\hat{Q}_i(p) = (1 - γ) X_{(j)} + γ X_{(j+1)}\,,

where \frac{j-m}{n}\leq p< \frac{j-m+1}{n} \quad (1)

for some m\in \mathbb{R} and 0\leq\gamma\leq 1.

Which is to say, in general, the sample quantiles can be written as some kind of weighted average of two adjacent order statistics (though it may be that there's only weight on one of them).

In R:

In particular, R offers all nine definitions mentioned in Hyndman & Fan (with $7$ as the default). From Hyndman & Fan we see:

Definition 7. Gumbel (1939) also considered the modal position
$p_k = \text{mode}\,F(X_{(k)}) = (k-l)/(n-1)$. One nice property is that the vertices of $Q_7(p)$ divide the range into $n-1$ intervals, and exactly $100p\%$ of the intervals lie to the left of $Q_7(p$) and $100(1-p)\%$ of the intervals lie to the right of $Q_7(p)$.

What does this mean? Consider n=9. Then for (k-1)/(n-1) = 0.25, you need k = 1+(9-1)/4 = 3. That is, the lower quartile is the 3rd observation of 9.

We can see that in R:

quantile(1:9)
0% 25% 50% 75% 100%
1 3 5 7 9

For its behavior when n is not of the form 4k+1, the easiest thing to do is try it:

> quantile(1:10)
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
> quantile(1:11)
0% 25% 50% 75% 100%
1.0 3.5 6.0 8.5 11.0
> quantile(1:12)
0% 25% 50% 75% 100%
1.00 3.75 6.50 9.25 12.00

When k isn't integer, it's taking a weighted average of the adjacent order statistics, in proportion to the fraction it lies between them (that is, it does linear interpolation).

The nice thing is that on average you get 3 times as many observations above the first quartile as you get below. So for 9 observations, for example, you get 6 above and 2 below the third observation, which divides them into the ratio 3:1.

What's happening with your sample data

You have d=c(1,2,3,3,4,9), so n is 6. You need (k-1)/(n-1) to be 0.25, so k = 1 + 5/4 = 2.25. That is, it takes 25% of the way between the second and third observation (which coincidentally are themselves 2 and 3), so the lower quartile is 2+0.25*(3-2) = 2.25.

Under the hood: some R details:

When you call summary on a data frame, this results in summary.data.frame being applied to the data frame (i.e. the relevant summary for the class you called it on). Its existence is mentioned in the help on summary.

The summary.data.frame function (ultimately -- via summary.default applied to each column) calls quantile to compute quartiles (you won't see this in the help, unfortunately, since ?summary.data.frame simply takes you to the summary help and that doesn't give you any details on what happens when summary is applied to a numeric vector -- this is one of those really bad spots in the help).

So ?quantile (or help(quantile)) describes what R does.

Here are two things it says (based directly off Hyndman & Fan). First, it gives general information:

All sample quantiles are defined as weighted averages of consecutive order
statistics. Sample quantiles of type i are defined by:

Q[i](p) = (1 - γ) x[j] + γ x[j+1],

where 1 ≤ i ≤ 9, (j-m)/n ≤ p < (j-m+1)/n, x[j] is the jth order statistic,
n is the sample size, the value of γ is a function of j = floor(np + m) and
g = np + m - j, and m is a constant determined by the sample quantile type.

Second, there's specific information about method 7:

Type 7
m = 1-p

. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S.

Hopefully the explanation I gave earlier helps to make more sense of what this is saying. The help on quantile pretty much just quotes Hyndman & Fan as far as definitions go, and its behavior is pretty simple.


Reference:

[1]: Rob J. Hyndman and Yanan Fan (1996),

"Sample Quantiles in Statistical Packages,"

The American Statistician, Vol. 50, No. 4. (Nov.), pp. 361-365

Also see the discussion here.

Explain the quantile() function in R

You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.

where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.

The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)

Also, that last sentence is just wrong. It should read

Here gamma depends on the fractional part of np+m, g = np+m-j

As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.

For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):

p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.

This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.

As for the last bit, R is just stating what S uses.

The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.

BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)

understanding the fundamentals of quantile() and quantiles

To expand on @BenBolker, you could consider the type parameter for the quantile() function. You are using a continuous distribution so types 4 through 9 are relevant. For example:

 b[b <  quantile(b, probs = c(.05), type = 9)]

Types 4 and 6 will give what you were probably expecting

 [1] -1.893092 -3.263889

while 5, 7, 8, and 9 will give

 [1] -1.893092 -1.538927 -3.263889

The help file gives much detail about why, but in the end it comes down to the fact that there is no agreed upon method to estimate sample quantiles (including the median).

Quantiles in R Programming Language

Try this:

x <- rnorm(100) # sample data
quantile(x) # quartiles
quantile(x, c(0.1, 0.9)) # 10th and 90th centiles

There are a number of ways to find out about how to do things in R, for example, introductory books (many of them online, e.g., https://cran.r-project.org/other-docs.html) and simple functions such as apropos and RSiteSearch:

apropos("foo") # gives you a list of functions that have "foo" in their name

How to calculate the numbers of the observations in quantiles?

We may use cut with table:

table(cut(a, quantile(a, 0:10 / 10)))

# (0.00202,0.22] (0.22,0.307] (0.307,0.382] (0.382,0.457] (0.457,0.535] (0.535,0.622]
# 99999 100000 100000 100000 100000 100000
# (0.622,0.724] (0.724,0.856] (0.856,1.07] (1.07,3.81]
# 100000 100000 100000 100000

but given what quantiles are, that may be not very interesting. Perhaps you may want to try the theoretical quantiles as well:

table(cut(a, qgamma(0:10 / 10, 3, 5)))

# (0,0.22] (0.22,0.307] (0.307,0.383] (0.383,0.457] (0.457,0.535] (0.535,0.621] (0.621,0.723]
# 99978 100114 100545 99843 99273 99644 100104
# (0.723,0.856] (0.856,1.06] (1.06,Inf]
# 100208 99883 100408

Not much more interesting since, if your data really does follow a gamma distribution and you have a bunch of observations, then you can be quite certain that there will be close to x% of data between q-th and (q+x)-th theoretical quantiles. In smaller samples the second approach can be interesting.


Edit: Given your updated question, it's clear that by 10%, 20% you don't mean quantiles. Assuming that the minimal value is 0 and the maximal is 2, if as 10% you consider 0.2, then you want

table(cut(a, seq(min(a), max(a), length = 10 + 1)))

# (0.00418,0.428] (0.428,0.853] (0.853,1.28] (1.28,1.7] (1.7,2.13] (2.13,2.55]
# 361734 436176 155332 37489 7651 1335
# (2.55,2.97] (2.97,3.4] (3.4,3.82] (3.82,4.25]
# 231 38 11 2

Quantile Function for a Vector of Dates

If x is a vector of Dates and probs is a vector of probabilities:

# test input
x <- as.Date("2018-03-21") + 0:10
probs <- 1:9/10

as.Date(quantile(unclass(x), probs), origin = "1970-01-01")

giving:

         10%          20%          30%          40%          50%          60% 
"2018-03-22" "2018-03-23" "2018-03-24" "2018-03-25" "2018-03-26" "2018-03-27"
70% 80% 90%
"2018-03-28" "2018-03-29" "2018-03-30"

R: what is the vector of quantiles in density function dvmnorm

Just taking bivariate normal (X1, X2) as an example, by passing in x = (0, 0), you get P(X1 = 0, X2 = 0) which is a single value. Why do you expect to get a vector?

If you want a vector, you need to pass in a matrix. For example, x = cbind(c(0,1), c(0,1)) gives

P(X1 = 0, X2 = 0)
P(X1 = 1, X2 = 1)

In this situation, each row of the matrix is processed in parallel.

R calculates quantile wrong or different?

If you look at the help page for quantile:

?quantile

you will see that quantiles can be calculated in different ways, which can be specified using the type = argument, with an integer from 1-9.

Type 6 gives the result that you were expecting:

quantile(c(10, 20, 30, 40), type = 6)

0% 25% 50% 75% 100%
10.0 12.5 25.0 37.5 40.0


Related Topics



Leave a reply



Submit