How to Identify the Distribution of the Given Data Using R

How do I know the distribution of a dataset in R?

I had a similar question some days ago and got helpful insights in the following post:
How to identify the distribution of the given data using r

Adapted to your dataset

library(pder)
data("DemocracyIncome", package = "pder")

demo <- na.omit(DemocracyIncome)

library(fitdistrplus)
descdist(demo$income, discrete = FALSE)

normal_dist <- fitdist(demo$income, "norm")

plot(normal_dist)

The first plot help you to identify distribution, second plot(s) are to check normal distribution! Hope this helps.

How do I know what distribution of data follows in R?

The fitdistrplus package can help with this kind of thing, but you need to know what candidate distributions you want to check. Let's try normal, uniform, and exponential:

library(fitdistrplus)
fit.in1 <- fitdist(dat$IN, "norm")
fit.in2 <- fitdist(dat$IN, "unif")
fit.in3 <- fitdist(dat$IN, "exp")

Then you can plot some diagnostics:

par(mfrow=c(2,2)
denscomp(list(fit.in1,fit.in2,fit.in3),legendtext=c("Normal","Uniform","Exponential"))
qqcomp(list(fit.in1,fit.in2,fit.in3),legendtext=c("Normal","Uniform","Exponential"))
cdfcomp(list(fit.in1,fit.in2,fit.in3),legendtext=c("Normal","Uniform","Exponential"))
ppcomp(list(fit.in1,fit.in2,fit.in3),legendtext=c("Normal","Uniform","Exponential"))

Sample Image

Is it normal? Maybe:

> shapiro.test(dat$IN)

Shapiro-Wilk normality test

data: dat$IN
W = 0.96548, p-value = 0.8352

Is it uniform over [0,14]? Maybe

> ks.test(dat$IN,"punif",0,14)

One-sample Kolmogorov-Smirnov test

data: dat$IN
D = 0.16758, p-value = 0.8024
alternative hypothesis: two-sided

The null hypotheses for these tests are that the distribution is what you think it is. The alternative is that the distribution is NOT what you are testing against. So the tinier p-values mean that a particular distribution is not a good candidate for fit.

How to find the statistical mode?

One more solution, which works for both numeric & character/factor data:

Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

On my dinky little machine, that can generate & find the mode of a 10M-integer vector in about half a second.

If your data set might have multiple modes, the above solution takes the same approach as which.max, and returns the first-appearing value of the set of modes. To return all modes, use this variant (from @digEmAll in the comments):

Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}

How to identify the distribution of the given data in Python?

Fitter in python provides similar functionality. The code looks like:

from fitter import Fitter
f = Fitter(data)
f.fit()

For more information, please take a look at https://pypi.python.org/pypi/fitter

Specify parameters of a gamma distribution given known median and mode

Use the closed-form solution for the mode to get one of the parameters in terms of the other, then solve for the CDF at 0.5 (median) with one of the parameters substituted out.

f <- function(x) pgamma(7, x, (x - 1)/4) - 0.5
(shape <- uniroot(f, c(1, 10))$root)
# 1.905356
(rate <- (shape - 1)/4)
# 0.226339


Related Topics



Leave a reply



Submit