Given a Set of Random Numbers Drawn from a Continuous Univariate Distribution, Find the Distribution

Given a set of random numbers drawn from a continuous univariate distribution, find the distribution

My first approach would be to generate qq plots of the given data against the possible distributions.

x <- c(15.771062,14.741310,9.081269,11.276436,11.534672,17.980860,13.550017,13.853336,11.262280,11.049087,14.752701,4.481159,11.680758,11.451909,10.001488,11.106817,7.999088,10.591574,8.141551,12.401899,11.215275,13.358770,8.388508,11.875838,3.137448,8.675275,17.381322,12.362328,10.987731,7.600881,14.360674,5.443649,16.024247,11.247233,9.549301,9.709091,13.642511,10.892652,11.760685,11.717966,11.373979,10.543105,10.230631,9.918293,10.565087,8.891209,10.021141,9.152660,10.384917,8.739189,5.554605,8.575793,12.016232,10.862214,4.938752,14.046626,5.279255,11.907347,8.621476,7.933702,10.799049,8.567466,9.914821,7.483575,11.098477,8.033768,10.954300,8.031797,14.288100,9.813787,5.883826,7.829455,9.462013,9.176897,10.153627,4.922607,6.818439,9.480758,8.166601,12.017158,13.279630,14.464876,13.319124,12.331335,3.194438,9.866487,11.337083,8.958164,8.241395,4.289313,5.508243,4.737891,7.577698,9.626720,16.558392,10.309173,11.740863,8.761573,7.099866,10.032640)
> qqnorm(x)

For more info see link

Another possibility is based on the fitdistr function in the MASS package. Here is the different distributions ordered by their log-likelihood

> library(MASS)
> fitdistr(x, 't')$loglik
[1] -252.2659
Warning message:
In log(s) : NaNs produced
> fitdistr(x, 'normal')$loglik
[1] -252.2968
> fitdistr(x, 'logistic')$loglik
[1] -252.2996
> fitdistr(x, 'weibull')$loglik
[1] -252.3507
> fitdistr(x, 'gamma')$loglik
[1] -255.9099
> fitdistr(x, 'lognormal')$loglik
[1] -260.6328
> fitdistr(x, 'exponential')$loglik
[1] -331.8191
Warning messages:
1: In dgamma(x, shape, scale, log) : NaNs produced
2: In dgamma(x, shape, scale, log) : NaNs produced

Fill NA values with random numbers generated from a specific distribution in R

Try this

df$id <- ifelse(is.na(df$id) , runif(1,4,8) , df$id)

DataExplorer, customize univariate distribution

You have not specified which variable the B, C or D density graph should apply to.
If there is only one, e.g. B then do it like this:

library(tidyverse)
library(ggpubr)

A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)

df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(B, fill=A)) +
geom_density(alpha=0.2)

Sample Image

You can also do it separately for each of the variables on one plot.

pB = df %>% mutate(A = A %>% fct_inorder()) %>% 
ggplot(aes(B, fill=A)) +
geom_density(alpha=0.2)
pC = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(C, fill=A)) +
geom_density(alpha=0.2)

pD = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(D, fill=A)) +
geom_density(alpha=0.2)

ggarrange(pB, pC, pD,
labels = c("B", "C", "D"))

Sample Image

And if you don't like the fillings, you can do it like this

df %>% mutate(A = A %>% fct_inorder()) %>% 
ggplot(aes(B, color=A)) +
geom_density()

Sample Image

Update 1

It is possible to create charts for any number of columns. I will show it to you in the example below.
First, we'll do it in a very simple, even trivial way.

library(tidyverse)
df = tibble(
A = rep(c(1,2,3,4,5), 200) %>% factor(),
B = rnorm(1000),
C = rnorm(1000, mean = 100, sd=2),
D = rnorm(1000, 2, 2)
)

fPlot = function(x, group) tibble(x=x, group=group) %>%
ggplot(aes(x, color=group)) +
geom_density()

df %>% select_at(vars(B:D)) %>%
map(~fPlot(., df$A))

As you can see, we created three plots for variables B, C and D.

The second way is a bit more difficult to understand. But it will give you some extra bonuses.

fPlot2 = function(df, group) df$data[[1]] %>% 
ggplot(aes(val, color=A)) +
geom_density() +
ggtitle(group)

df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>%
group_by(var) %>%
nest() %>%
group_map(fPlot2)

Note that your tibble after df %>% pivot_longer(B:D, names_to = "var", values_to = "val") looks like this.

# A tibble: 3,000 x 3
A var val
<fct> <chr> <dbl>
1 1 B 1.06
2 1 C 100.
3 1 D 3.54
4 2 B -0.652
5 2 C 100.
6 2 D 1.12
7 3 B 0.652
8 3 C 97.3
9 3 D 3.57
10 4 B -0.0972
# ... with 2,990 more rows

After doing df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% group_by(var) %>% nest() looks like this:

# A tibble: 3 x 2
# Groups: var [3]
var data
<chr> <list>
1 B <tibble [1,000 x 2]>
2 C <tibble [1,000 x 2]>
3 D <tibble [1,000 x 2]>

As you can see the data has been collapsed into three internal tibble in the variable data.
This approach will allow you to easily calculate all statistics for each column separately. Look at this.

fStat = function(df) df$data[[1]] %>% 
group_by(A) %>%
summarise(
n = n(),
min = min(val),
mean = mean(val),
max = max(val),
median = median(val),
sd = sd(val),
sw.stat = stats::shapiro.test(val)$statistic,
sw.p = stats::shapiro.test(val)$p.value,
)

df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>%
group_by(var) %>%
nest() %>%
group_modify(~fStat(.x))

output

# A tibble: 15 x 10
# Groups: var [3]
var A n min mean max median sd sw.stat sw.p
<chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 B 1 200 -2.14 0.139 3.16 0.153 0.960 0.994 0.561
2 B 2 200 -2.00 0.0185 2.61 0.0162 0.923 0.992 0.373
3 B 3 200 -3.15 0.0245 2.42 0.0718 1.02 0.992 0.347
4 B 4 200 -2.75 0.00112 2.73 -0.00691 1.02 0.993 0.496
5 B 5 200 -3.32 -0.00758 3.23 -0.000105 0.993 0.991 0.250
6 C 1 200 94.6 99.8 104. 99.8 1.97 0.992 0.365
7 C 2 200 94.8 100. 104. 100. 1.85 0.991 0.290
8 C 3 200 94.5 100. 106. 100. 1.94 0.996 0.877
9 C 4 200 94.4 99.9 107. 99.9 1.97 0.993 0.463
10 C 5 200 94.3 99.8 106. 99.8 2.07 0.996 0.887
11 D 1 200 -4.89 1.81 8.11 1.90 2.09 0.995 0.750
12 D 2 200 -5.42 2.15 7.57 2.18 2.14 0.995 0.726
13 D 3 200 -4.38 2.09 7.10 2.02 1.97 0.989 0.111
14 D 4 200 -4.73 2.13 8.98 1.93 1.99 0.989 0.138
15 D 5 200 -2.19 2.24 7.79 2.25 1.87 0.996 0.867

Czy to nie fajne?

Draw random numbers from restricted Pareto distribution

The standard approach to sampling from a truncated distribution has three steps. Here's an example with the normal distribution so you can get the idea.

n <- 1000
lower_bound <- -1
upper_bound <- 1

Apply the CDF to your lower and upper bounds to find the quantiles of the edges of your distribution.

(quantiles <- pnorm(c(lower_bound, upper_bound)))
# [1] 0.1586553 0.8413447

Sample from a uniform distribution between those quantiles.

uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])

Apply the inverse CDF.

truncated_normal_random_numbers <- qnorm(uniform_random_numbers)

truncated normal random numbers


The CDF for the pareto distribution is

ppareto <- function(x, scale, shape)
{
ifelse(x > scale, 1 - (scale / x) ^ shape, 0)
}

And the inverse is

qpareto <- function(y, scale, shape)
{
ifelse(
y >= 0 & y <= 1,
scale * ((1 - y) ^ (-1 / shape)),
NaN
)
}

We can rework the above example to use these Pareto functions.

n <- 1000
scale <- 1
shape <- 1
lower_bound <- 2
upper_bound <- 10

(quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape))
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
truncated_pareto_random_numbers <- qpareto(uniform_random_numbers, scale, shape)

truncated Pareto distrbution


To make it easier to reuse this code, we can wrap it into a function. The lower and upper bounds have default values that match the range of the distribution, so if you don't pass values in, then you'll get a non-truncated Pareto distribution.

rpareto <- function(n, scale, shape, lower_bound = scale, upper_bound = Inf)
{
quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape)
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
qpareto(uniform_random_numbers, scale, shape)
}


Related Topics



Leave a reply



Submit