Calculating Length of 95%-Ci Using Dplyr

Calculating length of 95%-CI using dplyr

You could do it manually using mutate a few extra functions in summarise

library(dplyr)
mtcars %>%
group_by(vs) %>%
summarise(mean.mpg = mean(mpg, na.rm = TRUE),
sd.mpg = sd(mpg, na.rm = TRUE),
n.mpg = n()) %>%
mutate(se.mpg = sd.mpg / sqrt(n.mpg),
lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)

#> Source: local data frame [2 x 7]
#>
#> vs mean.mpg sd.mpg n.mpg se.mpg lower.ci.mpg upper.ci.mpg
#> (dbl) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
#> 1 0 16.61667 3.860699 18 0.9099756 14.69679 18.53655
#> 2 1 24.55714 5.378978 14 1.4375924 21.45141 27.66287

Calculating upper and lower confidence intervals by group in dplyr summarise()

The output of mean_ci is a vector of length 3. This is maybe unexpected because the package has added a print method so that when you see this in the console it looks like a single character value and not a numeric length > 1 vector. But, you can see the underlying data structure by looking at str.

mean_ci(dat$num) %>% str
# 'qwraps2_mean_ci' Named num [1:3] 2.44 1.05 3.82
# - attr(*, "names")= chr [1:3] "mean" "lcl" "ucl"
# - attr(*, "alpha")= num 0.05

In summarize, each element of each column of the output needs to be length 1, so providing a length 3 object for summarize to put in a single "cell" (column element) results in an error. A workaround is to put the length 3 vector in a list, so that it is now a length 1 list. Then you can use unnest_wider to separate it into 3 columns (and therefore making the table "wider")

library(tidyverse)

dat %>%
group_by(type) %>%
summarise( N=n(),
mean.ci = list(mean_ci(num)),
"Percent"= n_perc(num > 0)) %>%
unnest_wider(mean.ci)
# # A tibble: 2 x 6
# type N mean lcl ucl Percent
# <fct> <int> <dbl> <dbl> <dbl> <chr>
# 1 A 8 2.25 0.523 3.98 "6 (75.00\\%)"
# 2 B 8 2.62 0.344 4.91 "4 (50.00\\%)"

Calculating difference of two means and its confidence interval

This can be done with base R, because t.test also reports a confidence interval for the mean difference:

> res <- t.test(iris$Sepal.Length[iris$Species=="setosa"],
+ iris$Sepal.Length[iris$Species=="virginica"])
> res$conf.int
[1] -1.78676 -1.37724
attr(,"conf.level")
[1] 0.95

The mean values are stored in the entry estimate, and you can compute the difference of the means with

> res$estimate[1] - res$estimate[2]
mean of x
-1.582

or equivalently

> sum(res$estimate * c(1,-1))
[1] -1.582

Bootstrap Confidence Intervals for more than one statistics through boot.ci function

The boot package is (IMO) a little clunky for regular use. The short answer is that you need to specify index (default value is 1) to boot.ci, e.g. boot.ci(boot.out,index=2). The long answer is that it would certainly be convenient to get the bootstrap CIs for all of the bootstrap statistics at once!

Get all CI for a specified result slot:

getCI <- function(x,w) {
b1 <- boot.ci(x,index=w)
## extract info for all CI types
tab <- t(sapply(b1[-(1:3)],function(x) tail(c(x),2)))
## combine with metadata: CI method, index
tab <- cbind(w,rownames(tab),as.data.frame(tab))
colnames(tab) <- c("index","method","lwr","upr")
tab
}
## do it for both parameters
do.call(rbind,lapply(1:2,getCI,x=boot.out))

Results (maybe not what you want, but easy to reshape):

         index  method        lwr        upr
normal 1 normal -1.2533079 -0.3479490
basic 1 basic -1.1547310 -0.4789996
percent 1 percent -0.4841726 0.1915588
bca 1 bca -0.4841726 -0.4628899
normal1 2 normal 0.6288945 1.6086459
basic1 2 basic 0.5727462 1.4789105
percent1 2 percent 0.3589388 1.2651031
bca1 2 bca 0.6819394 1.2651031

Alternatively, if you can live with getting one bootstrap method at a time, my version of the broom package on Github has this capability (I've submitted a pull request)

## devtools::install_github("bbolker/broom")
library(broom)
tidy(boot.out,conf.int=TRUE,conf.method="perc")

ddply summarise with Confidence interval?

updated answer

I'm not sure, but I guess what you actually want to do is filter all values that are larger / smaller than mean(x) -/+ 2*sd(x) and this by each group. The following approach would do that. In the case of ggplot2s Diamond data set it keeps about 97% of all values and just removes the extremes.

library(tidyverse)

diamonds %>%
group_by(cut, color) %>%
mutate(across(c(x,y,z),
list(low = ~ mean(.x, na.rm = TRUE) - 2 * sd(.x, na.rm = TRUE),
high = ~ mean(.x, na.rm = TRUE) + 2 * sd(.x, na.rm = TRUE))
)
) %>%
filter(x >= x_low & x <= x_high,
y >= x_low & y <= y_high,
z >= z_low & z <= z_high)
#> # A tibble: 52,299 x 16
#> # Groups: cut, color [35]
#> carat cut color clarity depth table price x y z x_low x_high
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3.51 6.92
#> 2 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.31 3.52 7.65
#> 3 0.290 Prem~ I VS2 62.4 58 334 4.2 4.23 2.63 3.86 9.12
#> 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 4.14 8.62
#> 5 0.24 Very~ I VVS1 62.3 57 336 3.95 3.98 2.47 3.92 8.62
#> 6 0.26 Very~ H SI1 61.9 55 337 4.07 4.11 2.53 3.66 8.30
#> 7 0.23 Very~ H VS1 59.4 61 338 4 4.05 2.39 3.66 8.30
#> 8 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 4.14 8.62
#> 9 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 3.88 8.76
#> 10 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 3.88 8.76
#> # ... with 52,289 more rows, and 4 more variables: y_low <dbl>, y_high <dbl>,
#> # z_low <dbl>, z_high <dbl>

Created on 2020-06-23 by the reprex package (v0.3.0)

old answer

With better example data we could achieve a more programmatic approach. As example I use ggplot2s diamonds dataset. See my comments in the code below.

library(tidyverse)

diamonds %>%
# set up your groups
nest_by(cut, color) %>%
# define in `across` for which variables you want means and conf int to be calculated
mutate(ttest = list(summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x))))),
ttest = list(unpack(ttest, c(x, y, z), names_sep = "_") %>%
# select only the estimates and conf intervalls
select(contains("estimate"), contains("conf")))) %>%
unnest(ttest)
#> # A tibble: 35 x 12
#> # Groups: cut, color [35]
#> cut color data x_estimate y_estimate z_estimate x_conf.low x_conf.high
#> <ord> <ord> <list<tb> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D [163 × 8] 6.02 5.96 3.84 5.89 6.15
#> 2 Fair E [224 × 8] 5.91 5.86 3.72 5.80 6.02
#> 3 Fair F [312 × 8] 5.99 5.93 3.79 5.89 6.09
#> 4 Fair G [314 × 8] 6.17 6.11 3.96 6.06 6.28
#> 5 Fair H [303 × 8] 6.58 6.50 4.22 6.47 6.69
#> 6 Fair I [175 × 8] 6.56 6.49 4.19 6.43 6.70
#> 7 Fair J [119 × 8] 6.75 6.68 4.32 6.55 6.95
#> 8 Good D [662 × 8] 5.62 5.63 3.50 5.55 5.69
#> 9 Good E [933 × 8] 5.62 5.63 3.50 5.56 5.68
#> 10 Good F [909 × 8] 5.69 5.71 3.54 5.63 5.76
#> # … with 25 more rows, and 4 more variables: y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>

Created on 2020-06-19 by the reprex package (v0.3.0)

If you want to filter observations based on the confidence iIntervalls of the means you can adjust my approach above as follows. Note that this is not the same as filtering the top and bottom 2.5 % of each variable, you will loose a lot of data.

library(tidyverse)

diamonds %>%
nest_by(cut, color) %>%
mutate(ttest = summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x)))) %>%
unpack(c(x,y,z), names_sep = "_")) %>%
unpack(ttest) %>%
select(cut, color, data, contains("estimate"), contains("conf")) %>%
rowwise(cut, color) %>%
mutate(data = list(filter(data,
x >= x_conf.low & x <= x_conf.high,
y >= x_conf.low & y <= y_conf.high,
z >= z_conf.low & z <= z_conf.high))) %>%
unnest(data)
#> # A tibble: 322 x 19
#> # Groups: cut, color [30]
#> cut color carat clarity depth table price x y z x_estimate
#> <ord> <ord> <dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D 0.91 SI2 62.5 66 3079 6.08 6.01 3.78 6.02
#> 2 Fair D 0.9 SI2 65.7 60 3205 5.98 5.93 3.91 6.02
#> 3 Fair D 0.9 SI2 64.7 59 3205 6.09 5.99 3.91 6.02
#> 4 Fair D 0.95 SI2 64.4 60 3384 6.06 6.02 3.89 6.02
#> 5 Fair D 0.9 SI2 64.9 57 3473 6.03 5.98 3.9 6.02
#> 6 Fair D 0.9 SI2 64.5 61 3473 6.1 6 3.9 6.02
#> 7 Fair D 0.9 SI1 64.5 61 3689 6.05 6.01 3.89 6.02
#> 8 Fair D 0.91 SI1 64.7 61 3730 6.06 5.99 3.9 6.02
#> 9 Fair D 0.9 SI2 64.6 59 3847 6.04 6.01 3.89 6.02
#> 10 Fair D 0.91 SI1 64.4 60 3855 6.08 6.04 3.9 6.02
#> # ... with 312 more rows, and 8 more variables: y_estimate <dbl>,
#> # z_estimate <dbl>, x_conf.low <dbl>, x_conf.high <dbl>, y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>

Created on 2020-06-22 by the reprex package (v0.3.0)



Related Topics



Leave a reply



Submit