Given a set of random numbers drawn from a continuous univariate distribution, find the distribution
My first approach would be to generate qq plots of the given data against the possible distributions.
x <- c(15.771062,14.741310,9.081269,11.276436,11.534672,17.980860,13.550017,13.853336,11.262280,11.049087,14.752701,4.481159,11.680758,11.451909,10.001488,11.106817,7.999088,10.591574,8.141551,12.401899,11.215275,13.358770,8.388508,11.875838,3.137448,8.675275,17.381322,12.362328,10.987731,7.600881,14.360674,5.443649,16.024247,11.247233,9.549301,9.709091,13.642511,10.892652,11.760685,11.717966,11.373979,10.543105,10.230631,9.918293,10.565087,8.891209,10.021141,9.152660,10.384917,8.739189,5.554605,8.575793,12.016232,10.862214,4.938752,14.046626,5.279255,11.907347,8.621476,7.933702,10.799049,8.567466,9.914821,7.483575,11.098477,8.033768,10.954300,8.031797,14.288100,9.813787,5.883826,7.829455,9.462013,9.176897,10.153627,4.922607,6.818439,9.480758,8.166601,12.017158,13.279630,14.464876,13.319124,12.331335,3.194438,9.866487,11.337083,8.958164,8.241395,4.289313,5.508243,4.737891,7.577698,9.626720,16.558392,10.309173,11.740863,8.761573,7.099866,10.032640)
> qqnorm(x)
For more info see link
Another possibility is based on the fitdistr function in the MASS package. Here is the different distributions ordered by their log-likelihood
> library(MASS)
> fitdistr(x, 't')$loglik
[1] -252.2659
Warning message:
In log(s) : NaNs produced
> fitdistr(x, 'normal')$loglik
[1] -252.2968
> fitdistr(x, 'logistic')$loglik
[1] -252.2996
> fitdistr(x, 'weibull')$loglik
[1] -252.3507
> fitdistr(x, 'gamma')$loglik
[1] -255.9099
> fitdistr(x, 'lognormal')$loglik
[1] -260.6328
> fitdistr(x, 'exponential')$loglik
[1] -331.8191
Warning messages:
1: In dgamma(x, shape, scale, log) : NaNs produced
2: In dgamma(x, shape, scale, log) : NaNs produced
Fill NA values with random numbers generated from a specific distribution in R
Try this
df$id <- ifelse(is.na(df$id) , runif(1,4,8) , df$id)
DataExplorer, customize univariate distribution
You have not specified which variable the B
, C
or D
density graph should apply to.
If there is only one, e.g. B
then do it like this:
library(tidyverse)
library(ggpubr)
A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)
df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(B, fill=A)) +
geom_density(alpha=0.2)
You can also do it separately for each of the variables on one plot.
pB = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(B, fill=A)) +
geom_density(alpha=0.2)
pC = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(C, fill=A)) +
geom_density(alpha=0.2)
pD = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(D, fill=A)) +
geom_density(alpha=0.2)
ggarrange(pB, pC, pD,
labels = c("B", "C", "D"))
And if you don't like the fillings, you can do it like this
df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(B, color=A)) +
geom_density()
Update 1
It is possible to create charts for any number of columns. I will show it to you in the example below.
First, we'll do it in a very simple, even trivial way.
library(tidyverse)
df = tibble(
A = rep(c(1,2,3,4,5), 200) %>% factor(),
B = rnorm(1000),
C = rnorm(1000, mean = 100, sd=2),
D = rnorm(1000, 2, 2)
)
fPlot = function(x, group) tibble(x=x, group=group) %>%
ggplot(aes(x, color=group)) +
geom_density()
df %>% select_at(vars(B:D)) %>%
map(~fPlot(., df$A))
As you can see, we created three plots for variables B
, C
and D
.
The second way is a bit more difficult to understand. But it will give you some extra bonuses.
fPlot2 = function(df, group) df$data[[1]] %>%
ggplot(aes(val, color=A)) +
geom_density() +
ggtitle(group)
df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>%
group_by(var) %>%
nest() %>%
group_map(fPlot2)
Note that your tibble
after df %>% pivot_longer(B:D, names_to = "var", values_to = "val")
looks like this.
# A tibble: 3,000 x 3
A var val
<fct> <chr> <dbl>
1 1 B 1.06
2 1 C 100.
3 1 D 3.54
4 2 B -0.652
5 2 C 100.
6 2 D 1.12
7 3 B 0.652
8 3 C 97.3
9 3 D 3.57
10 4 B -0.0972
# ... with 2,990 more rows
After doing df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% group_by(var) %>% nest()
looks like this:
# A tibble: 3 x 2
# Groups: var [3]
var data
<chr> <list>
1 B <tibble [1,000 x 2]>
2 C <tibble [1,000 x 2]>
3 D <tibble [1,000 x 2]>
As you can see the data has been collapsed into three internal tibble
in the variable data
.
This approach will allow you to easily calculate all statistics for each column separately. Look at this.
fStat = function(df) df$data[[1]] %>%
group_by(A) %>%
summarise(
n = n(),
min = min(val),
mean = mean(val),
max = max(val),
median = median(val),
sd = sd(val),
sw.stat = stats::shapiro.test(val)$statistic,
sw.p = stats::shapiro.test(val)$p.value,
)
df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>%
group_by(var) %>%
nest() %>%
group_modify(~fStat(.x))
output
# A tibble: 15 x 10
# Groups: var [3]
var A n min mean max median sd sw.stat sw.p
<chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 B 1 200 -2.14 0.139 3.16 0.153 0.960 0.994 0.561
2 B 2 200 -2.00 0.0185 2.61 0.0162 0.923 0.992 0.373
3 B 3 200 -3.15 0.0245 2.42 0.0718 1.02 0.992 0.347
4 B 4 200 -2.75 0.00112 2.73 -0.00691 1.02 0.993 0.496
5 B 5 200 -3.32 -0.00758 3.23 -0.000105 0.993 0.991 0.250
6 C 1 200 94.6 99.8 104. 99.8 1.97 0.992 0.365
7 C 2 200 94.8 100. 104. 100. 1.85 0.991 0.290
8 C 3 200 94.5 100. 106. 100. 1.94 0.996 0.877
9 C 4 200 94.4 99.9 107. 99.9 1.97 0.993 0.463
10 C 5 200 94.3 99.8 106. 99.8 2.07 0.996 0.887
11 D 1 200 -4.89 1.81 8.11 1.90 2.09 0.995 0.750
12 D 2 200 -5.42 2.15 7.57 2.18 2.14 0.995 0.726
13 D 3 200 -4.38 2.09 7.10 2.02 1.97 0.989 0.111
14 D 4 200 -4.73 2.13 8.98 1.93 1.99 0.989 0.138
15 D 5 200 -2.19 2.24 7.79 2.25 1.87 0.996 0.867
Czy to nie fajne?
Draw random numbers from restricted Pareto distribution
The standard approach to sampling from a truncated distribution has three steps. Here's an example with the normal distribution so you can get the idea.
n <- 1000
lower_bound <- -1
upper_bound <- 1
Apply the CDF to your lower and upper bounds to find the quantiles of the edges of your distribution.
(quantiles <- pnorm(c(lower_bound, upper_bound)))
# [1] 0.1586553 0.8413447
Sample from a uniform distribution between those quantiles.
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
Apply the inverse CDF.
truncated_normal_random_numbers <- qnorm(uniform_random_numbers)
The CDF for the pareto distribution is
ppareto <- function(x, scale, shape)
{
ifelse(x > scale, 1 - (scale / x) ^ shape, 0)
}
And the inverse is
qpareto <- function(y, scale, shape)
{
ifelse(
y >= 0 & y <= 1,
scale * ((1 - y) ^ (-1 / shape)),
NaN
)
}
We can rework the above example to use these Pareto functions.
n <- 1000
scale <- 1
shape <- 1
lower_bound <- 2
upper_bound <- 10
(quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape))
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
truncated_pareto_random_numbers <- qpareto(uniform_random_numbers, scale, shape)
To make it easier to reuse this code, we can wrap it into a function. The lower and upper bounds have default values that match the range of the distribution, so if you don't pass values in, then you'll get a non-truncated Pareto distribution.
rpareto <- function(n, scale, shape, lower_bound = scale, upper_bound = Inf)
{
quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape)
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
qpareto(uniform_random_numbers, scale, shape)
}
Related Topics
How to Make a Matrix from a List of Vectors in R
How to Add Boxplots to Scatterplot with Jitter
Adjusting Width of Tables Made with Kable() in Rmarkdown Documents
Visualizing R Function Dependencies
Collapsing/Hiding Figures in R Markdown
Multiple Functions in a Single Tapply or Aggregate Statement
Merge Data Frames and Overwrite Values
R Error in Unique.Default(X) Unique() Applies Only to Vectors
Ggplot Year by Year Comparison
How to Change Fontface (Bold/Italics) for a Cell in a Kable Table in Rmarkdown
Compare Two Character Vectors in R
How to Combine Multiple Ggplot2 Elements into the Return of a Function
Showing Different Axis Labels Using Ggplot2 with Facet_Wrap
Get Map with Specified Boundary Coordinates
Faster Way to Subset on Rows of a Data Frame in R
Summarise_At Using Different Functions for Different Variables