Calculating Length of 95%-Ci Using Dplyr

Calculating length of 95%-CI using dplyr

You could do it manually using mutate a few extra functions in summarise

library(dplyr)
mtcars %>%
  group_by(vs) %>%
  summarise(mean.mpg = mean(mpg, na.rm = TRUE),
            sd.mpg = sd(mpg, na.rm = TRUE),
            n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
         upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)

#> Source: local data frame [2 x 7]
#> 
#>      vs mean.mpg   sd.mpg n.mpg    se.mpg lower.ci.mpg upper.ci.mpg
#>   (dbl)    (dbl)    (dbl) (int)     (dbl)        (dbl)        (dbl)
#> 1     0 16.61667 3.860699    18 0.9099756     14.69679     18.53655
#> 2     1 24.55714 5.378978    14 1.4375924     21.45141     27.66287

Calculating upper and lower confidence intervals by group in dplyr summarise()

The output of mean_ci is a vector of length 3. This is maybe unexpected because the package has added a print method so that when you see this in the console it looks like a single character value and not a numeric length > 1 vector. But, you can see the underlying data structure by looking at str.

mean_ci(dat$num) %>% str
 # 'qwraps2_mean_ci' Named num [1:3] 2.44 1.05 3.82
 # - attr(*, "names")= chr [1:3] "mean" "lcl" "ucl"
 # - attr(*, "alpha")= num 0.05

In summarize, each element of each column of the output needs to be length 1, so providing a length 3 object for summarize to put in a single "cell" (column element) results in an error. A workaround is to put the length 3 vector in a list, so that it is now a length 1 list. Then you can use unnest_wider to separate it into 3 columns (and therefore making the table "wider")

library(tidyverse)

dat %>%
  group_by(type) %>%
  summarise( N=n(),
            mean.ci = list(mean_ci(num)),
            "Percent"= n_perc(num > 0)) %>% 
  unnest_wider(mean.ci)
# # A tibble: 2 x 6
#   type      N  mean   lcl   ucl Percent       
#   <fct> <int> <dbl> <dbl> <dbl> <chr>         
# 1 A         8  2.25 0.523  3.98 "6 (75.00\\%)"
# 2 B         8  2.62 0.344  4.91 "4 (50.00\\%)"

Calculating difference of two means and its confidence interval

This can be done with base R, because t.test also reports a confidence interval for the mean difference:

> res <- t.test(iris$Sepal.Length[iris$Species=="setosa"],
+               iris$Sepal.Length[iris$Species=="virginica"])
> res$conf.int
[1] -1.78676 -1.37724
attr(,"conf.level")
[1] 0.95

The mean values are stored in the entry estimate, and you can compute the difference of the means with

> res$estimate[1] - res$estimate[2]
mean of x 
   -1.582

or equivalently

> sum(res$estimate * c(1,-1))
[1] -1.582

Bootstrap Confidence Intervals for more than one statistics through boot.ci function

The boot package is (IMO) a little clunky for regular use. The short answer is that you need to specify index (default value is 1) to boot.ci, e.g. boot.ci(boot.out,index=2). The long answer is that it would certainly be convenient to get the bootstrap CIs for all of the bootstrap statistics at once!

Get all CI for a specified result slot:

getCI <- function(x,w) {
   b1 <- boot.ci(x,index=w)
   ## extract info for all CI types
   tab <- t(sapply(b1[-(1:3)],function(x) tail(c(x),2)))
   ## combine with metadata: CI method, index
   tab <- cbind(w,rownames(tab),as.data.frame(tab))
   colnames(tab) <- c("index","method","lwr","upr")
   tab
}
## do it for both parameters
do.call(rbind,lapply(1:2,getCI,x=boot.out))

Results (maybe not what you want, but easy to reshape):

         index  method        lwr        upr
normal       1  normal -1.2533079 -0.3479490
basic        1   basic -1.1547310 -0.4789996
percent      1 percent -0.4841726  0.1915588
bca          1     bca -0.4841726 -0.4628899
normal1      2  normal  0.6288945  1.6086459
basic1       2   basic  0.5727462  1.4789105
percent1     2 percent  0.3589388  1.2651031
bca1         2     bca  0.6819394  1.2651031

Alternatively, if you can live with getting one bootstrap method at a time, my version of the broom package on Github has this capability (I've submitted a pull request)

## devtools::install_github("bbolker/broom")
library(broom)
tidy(boot.out,conf.int=TRUE,conf.method="perc")

ddply summarise with Confidence interval?

updated answer

I'm not sure, but I guess what you actually want to do is filter all values that are larger / smaller than mean(x) -/+ 2*sd(x) and this by each group. The following approach would do that. In the case of ggplot2s Diamond data set it keeps about 97% of all values and just removes the extremes.

library(tidyverse)

diamonds %>% 
  group_by(cut, color) %>% 
  mutate(across(c(x,y,z),
                list(low = ~ mean(.x, na.rm = TRUE) - 2 * sd(.x, na.rm = TRUE),
                     high = ~ mean(.x, na.rm = TRUE) + 2 * sd(.x, na.rm = TRUE))
                )
         ) %>% 
  filter(x >= x_low & x <= x_high,
         y >= x_low & y <= y_high,
         z >= z_low & z <= z_high)
#> # A tibble: 52,299 x 16
#> # Groups:   cut, color [35]
#>    carat cut   color clarity depth table price     x     y     z x_low x_high
#>    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43  3.51   6.92
#>  2 0.21  Prem~ E     SI1      59.8    61   326  3.89  3.84  2.31  3.52   7.65
#>  3 0.290 Prem~ I     VS2      62.4    58   334  4.2   4.23  2.63  3.86   9.12
#>  4 0.31  Good  J     SI2      63.3    58   335  4.34  4.35  2.75  4.14   8.62
#>  5 0.24  Very~ I     VVS1     62.3    57   336  3.95  3.98  2.47  3.92   8.62
#>  6 0.26  Very~ H     SI1      61.9    55   337  4.07  4.11  2.53  3.66   8.30
#>  7 0.23  Very~ H     VS1      59.4    61   338  4     4.05  2.39  3.66   8.30
#>  8 0.3   Good  J     SI1      64      55   339  4.25  4.28  2.73  4.14   8.62
#>  9 0.23  Ideal J     VS1      62.8    56   340  3.93  3.9   2.46  3.88   8.76
#> 10 0.31  Ideal J     SI2      62.2    54   344  4.35  4.37  2.71  3.88   8.76
#> # ... with 52,289 more rows, and 4 more variables: y_low <dbl>, y_high <dbl>,
#> #   z_low <dbl>, z_high <dbl>

^{Created on 2020-06-23 by the reprex package (v0.3.0)}

old answer

With better example data we could achieve a more programmatic approach. As example I use ggplot2s diamonds dataset. See my comments in the code below.

library(tidyverse)

diamonds %>% 
  # set up your groups
  nest_by(cut, color) %>%  
  # define in `across` for which variables you want means and conf int to be calculated 
  mutate(ttest = list(summarise(data, across(c(x,y,z), 
                                          ~ broom::tidy(t.test(.x))))),
         ttest = list(unpack(ttest, c(x, y, z), names_sep = "_") %>% 
   # select only the estimates and conf intervalls                          
                        select(contains("estimate"), contains("conf")))) %>% 
  unnest(ttest)
#> # A tibble: 35 x 12
#> # Groups:   cut, color [35]
#>    cut   color      data x_estimate y_estimate z_estimate x_conf.low x_conf.high
#>    <ord> <ord> <list<tb>      <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
#>  1 Fair  D     [163 × 8]       6.02       5.96       3.84       5.89        6.15
#>  2 Fair  E     [224 × 8]       5.91       5.86       3.72       5.80        6.02
#>  3 Fair  F     [312 × 8]       5.99       5.93       3.79       5.89        6.09
#>  4 Fair  G     [314 × 8]       6.17       6.11       3.96       6.06        6.28
#>  5 Fair  H     [303 × 8]       6.58       6.50       4.22       6.47        6.69
#>  6 Fair  I     [175 × 8]       6.56       6.49       4.19       6.43        6.70
#>  7 Fair  J     [119 × 8]       6.75       6.68       4.32       6.55        6.95
#>  8 Good  D     [662 × 8]       5.62       5.63       3.50       5.55        5.69
#>  9 Good  E     [933 × 8]       5.62       5.63       3.50       5.56        5.68
#> 10 Good  F     [909 × 8]       5.69       5.71       3.54       5.63        5.76
#> # … with 25 more rows, and 4 more variables: y_conf.low <dbl>,
#> #   y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>

^{Created on 2020-06-19 by the reprex package (v0.3.0)}

If you want to filter observations based on the confidence iIntervalls of the means you can adjust my approach above as follows. Note that this is not the same as filtering the top and bottom 2.5 % of each variable, you will loose a lot of data.

library(tidyverse)

diamonds %>% 
  nest_by(cut, color) %>% 
  mutate(ttest = summarise(data, across(c(x,y,z), 
                                             ~ broom::tidy(t.test(.x)))) %>% 
         unpack(c(x,y,z), names_sep = "_")) %>% 
  unpack(ttest) %>% 
  select(cut, color, data, contains("estimate"), contains("conf")) %>% 
  rowwise(cut, color) %>% 
  mutate(data = list(filter(data,
                       x >= x_conf.low & x <= x_conf.high,
                       y >= x_conf.low & y <= y_conf.high,
                       z >= z_conf.low & z <= z_conf.high))) %>% 
  unnest(data)
#> # A tibble: 322 x 19
#> # Groups:   cut, color [30]
#>    cut   color carat clarity depth table price     x     y     z x_estimate
#>    <ord> <ord> <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <dbl>
#>  1 Fair  D      0.91 SI2      62.5    66  3079  6.08  6.01  3.78       6.02
#>  2 Fair  D      0.9  SI2      65.7    60  3205  5.98  5.93  3.91       6.02
#>  3 Fair  D      0.9  SI2      64.7    59  3205  6.09  5.99  3.91       6.02
#>  4 Fair  D      0.95 SI2      64.4    60  3384  6.06  6.02  3.89       6.02
#>  5 Fair  D      0.9  SI2      64.9    57  3473  6.03  5.98  3.9        6.02
#>  6 Fair  D      0.9  SI2      64.5    61  3473  6.1   6     3.9        6.02
#>  7 Fair  D      0.9  SI1      64.5    61  3689  6.05  6.01  3.89       6.02
#>  8 Fair  D      0.91 SI1      64.7    61  3730  6.06  5.99  3.9        6.02
#>  9 Fair  D      0.9  SI2      64.6    59  3847  6.04  6.01  3.89       6.02
#> 10 Fair  D      0.91 SI1      64.4    60  3855  6.08  6.04  3.9        6.02
#> # ... with 312 more rows, and 8 more variables: y_estimate <dbl>,
#> #   z_estimate <dbl>, x_conf.low <dbl>, x_conf.high <dbl>, y_conf.low <dbl>,
#> #   y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>

^{Created on 2020-06-22 by the reprex package (v0.3.0)}

Calculating Length of 95%-Ci Using Dplyr