Calculating length of 95%-CI using dplyr
You could do it manually using mutate
a few extra functions in summarise
library(dplyr)
mtcars %>%
group_by(vs) %>%
summarise(mean.mpg = mean(mpg, na.rm = TRUE),
sd.mpg = sd(mpg, na.rm = TRUE),
n.mpg = n()) %>%
mutate(se.mpg = sd.mpg / sqrt(n.mpg),
lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)
#> Source: local data frame [2 x 7]
#>
#> vs mean.mpg sd.mpg n.mpg se.mpg lower.ci.mpg upper.ci.mpg
#> (dbl) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
#> 1 0 16.61667 3.860699 18 0.9099756 14.69679 18.53655
#> 2 1 24.55714 5.378978 14 1.4375924 21.45141 27.66287
Calculating upper and lower confidence intervals by group in dplyr summarise()
The output of mean_ci
is a vector of length 3. This is maybe unexpected because the package has added a print method so that when you see this in the console it looks like a single character value and not a numeric length > 1 vector. But, you can see the underlying data structure by looking at str
.
mean_ci(dat$num) %>% str
# 'qwraps2_mean_ci' Named num [1:3] 2.44 1.05 3.82
# - attr(*, "names")= chr [1:3] "mean" "lcl" "ucl"
# - attr(*, "alpha")= num 0.05
In summarize, each element of each column of the output needs to be length 1, so providing a length 3 object for summarize to put in a single "cell" (column element) results in an error. A workaround is to put the length 3 vector in a list, so that it is now a length 1 list. Then you can use unnest_wider
to separate it into 3 columns (and therefore making the table "wider")
library(tidyverse)
dat %>%
group_by(type) %>%
summarise( N=n(),
mean.ci = list(mean_ci(num)),
"Percent"= n_perc(num > 0)) %>%
unnest_wider(mean.ci)
# # A tibble: 2 x 6
# type N mean lcl ucl Percent
# <fct> <int> <dbl> <dbl> <dbl> <chr>
# 1 A 8 2.25 0.523 3.98 "6 (75.00\\%)"
# 2 B 8 2.62 0.344 4.91 "4 (50.00\\%)"
Calculating difference of two means and its confidence interval
This can be done with base R, because t.test
also reports a confidence interval for the mean difference:
> res <- t.test(iris$Sepal.Length[iris$Species=="setosa"],
+ iris$Sepal.Length[iris$Species=="virginica"])
> res$conf.int
[1] -1.78676 -1.37724
attr(,"conf.level")
[1] 0.95
The mean values are stored in the entry estimate, and you can compute the difference of the means with
> res$estimate[1] - res$estimate[2]
mean of x
-1.582
or equivalently
> sum(res$estimate * c(1,-1))
[1] -1.582
Bootstrap Confidence Intervals for more than one statistics through boot.ci function
The boot
package is (IMO) a little clunky for regular use. The short answer is that you need to specify index
(default value is 1) to boot.ci
, e.g. boot.ci(boot.out,index=2)
. The long answer is that it would certainly be convenient to get the bootstrap CIs for all of the bootstrap statistics at once!
Get all CI for a specified result slot:
getCI <- function(x,w) {
b1 <- boot.ci(x,index=w)
## extract info for all CI types
tab <- t(sapply(b1[-(1:3)],function(x) tail(c(x),2)))
## combine with metadata: CI method, index
tab <- cbind(w,rownames(tab),as.data.frame(tab))
colnames(tab) <- c("index","method","lwr","upr")
tab
}
## do it for both parameters
do.call(rbind,lapply(1:2,getCI,x=boot.out))
Results (maybe not what you want, but easy to reshape):
index method lwr upr
normal 1 normal -1.2533079 -0.3479490
basic 1 basic -1.1547310 -0.4789996
percent 1 percent -0.4841726 0.1915588
bca 1 bca -0.4841726 -0.4628899
normal1 2 normal 0.6288945 1.6086459
basic1 2 basic 0.5727462 1.4789105
percent1 2 percent 0.3589388 1.2651031
bca1 2 bca 0.6819394 1.2651031
Alternatively, if you can live with getting one bootstrap method at a time, my version of the broom
package on Github has this capability (I've submitted a pull request)
## devtools::install_github("bbolker/broom")
library(broom)
tidy(boot.out,conf.int=TRUE,conf.method="perc")
ddply summarise with Confidence interval?
updated answer
I'm not sure, but I guess what you actually want to do is filter all values that are larger / smaller than mean(x) -/+ 2*sd(x)
and this by each group. The following approach would do that. In the case of ggplot2
s Diamond data set it keeps about 97% of all values and just removes the extremes.
library(tidyverse)
diamonds %>%
group_by(cut, color) %>%
mutate(across(c(x,y,z),
list(low = ~ mean(.x, na.rm = TRUE) - 2 * sd(.x, na.rm = TRUE),
high = ~ mean(.x, na.rm = TRUE) + 2 * sd(.x, na.rm = TRUE))
)
) %>%
filter(x >= x_low & x <= x_high,
y >= x_low & y <= y_high,
z >= z_low & z <= z_high)
#> # A tibble: 52,299 x 16
#> # Groups: cut, color [35]
#> carat cut color clarity depth table price x y z x_low x_high
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3.51 6.92
#> 2 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.31 3.52 7.65
#> 3 0.290 Prem~ I VS2 62.4 58 334 4.2 4.23 2.63 3.86 9.12
#> 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 4.14 8.62
#> 5 0.24 Very~ I VVS1 62.3 57 336 3.95 3.98 2.47 3.92 8.62
#> 6 0.26 Very~ H SI1 61.9 55 337 4.07 4.11 2.53 3.66 8.30
#> 7 0.23 Very~ H VS1 59.4 61 338 4 4.05 2.39 3.66 8.30
#> 8 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 4.14 8.62
#> 9 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 3.88 8.76
#> 10 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 3.88 8.76
#> # ... with 52,289 more rows, and 4 more variables: y_low <dbl>, y_high <dbl>,
#> # z_low <dbl>, z_high <dbl>
Created on 2020-06-23 by the reprex package (v0.3.0)
old answer
With better example data we could achieve a more programmatic approach. As example I use ggplot2
s diamonds
dataset. See my comments in the code below.
library(tidyverse)
diamonds %>%
# set up your groups
nest_by(cut, color) %>%
# define in `across` for which variables you want means and conf int to be calculated
mutate(ttest = list(summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x))))),
ttest = list(unpack(ttest, c(x, y, z), names_sep = "_") %>%
# select only the estimates and conf intervalls
select(contains("estimate"), contains("conf")))) %>%
unnest(ttest)
#> # A tibble: 35 x 12
#> # Groups: cut, color [35]
#> cut color data x_estimate y_estimate z_estimate x_conf.low x_conf.high
#> <ord> <ord> <list<tb> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D [163 × 8] 6.02 5.96 3.84 5.89 6.15
#> 2 Fair E [224 × 8] 5.91 5.86 3.72 5.80 6.02
#> 3 Fair F [312 × 8] 5.99 5.93 3.79 5.89 6.09
#> 4 Fair G [314 × 8] 6.17 6.11 3.96 6.06 6.28
#> 5 Fair H [303 × 8] 6.58 6.50 4.22 6.47 6.69
#> 6 Fair I [175 × 8] 6.56 6.49 4.19 6.43 6.70
#> 7 Fair J [119 × 8] 6.75 6.68 4.32 6.55 6.95
#> 8 Good D [662 × 8] 5.62 5.63 3.50 5.55 5.69
#> 9 Good E [933 × 8] 5.62 5.63 3.50 5.56 5.68
#> 10 Good F [909 × 8] 5.69 5.71 3.54 5.63 5.76
#> # … with 25 more rows, and 4 more variables: y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
Created on 2020-06-19 by the reprex package (v0.3.0)
If you want to filter observations based on the confidence iIntervalls of the means you can adjust my approach above as follows. Note that this is not the same as filtering the top and bottom 2.5 % of each variable, you will loose a lot of data.
library(tidyverse)
diamonds %>%
nest_by(cut, color) %>%
mutate(ttest = summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x)))) %>%
unpack(c(x,y,z), names_sep = "_")) %>%
unpack(ttest) %>%
select(cut, color, data, contains("estimate"), contains("conf")) %>%
rowwise(cut, color) %>%
mutate(data = list(filter(data,
x >= x_conf.low & x <= x_conf.high,
y >= x_conf.low & y <= y_conf.high,
z >= z_conf.low & z <= z_conf.high))) %>%
unnest(data)
#> # A tibble: 322 x 19
#> # Groups: cut, color [30]
#> cut color carat clarity depth table price x y z x_estimate
#> <ord> <ord> <dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D 0.91 SI2 62.5 66 3079 6.08 6.01 3.78 6.02
#> 2 Fair D 0.9 SI2 65.7 60 3205 5.98 5.93 3.91 6.02
#> 3 Fair D 0.9 SI2 64.7 59 3205 6.09 5.99 3.91 6.02
#> 4 Fair D 0.95 SI2 64.4 60 3384 6.06 6.02 3.89 6.02
#> 5 Fair D 0.9 SI2 64.9 57 3473 6.03 5.98 3.9 6.02
#> 6 Fair D 0.9 SI2 64.5 61 3473 6.1 6 3.9 6.02
#> 7 Fair D 0.9 SI1 64.5 61 3689 6.05 6.01 3.89 6.02
#> 8 Fair D 0.91 SI1 64.7 61 3730 6.06 5.99 3.9 6.02
#> 9 Fair D 0.9 SI2 64.6 59 3847 6.04 6.01 3.89 6.02
#> 10 Fair D 0.91 SI1 64.4 60 3855 6.08 6.04 3.9 6.02
#> # ... with 312 more rows, and 8 more variables: y_estimate <dbl>,
#> # z_estimate <dbl>, x_conf.low <dbl>, x_conf.high <dbl>, y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
Created on 2020-06-22 by the reprex package (v0.3.0)
Related Topics
How to Create a Continuous Density Heatmap of 2D Scatter Data in R
R - Common Title and Legend for Combined Plots
Difference Between As.Data.Frame(X) and Data.Frame(X)
Label Minimum and Maximum of Scale Fill Gradient Legend with Text: Ggplot2
Read Lines by Number from a Large File
How to Prevent Exposure of My Password When Using Rgoogledocs
Street Address to Geolocation Lat/Long
R Change All Columns of Type Factor to Numeric
Difference Between Paste() and Paste0()
How to Add Rmse, Slope, Intercept, R^2 to R Plot
How to Do Selective Labeling with Ggplot Geom_Point()
How to Jitter Text to Avoid Overlap in a Ggplot2 Scatterplot
Plotting the Average Values for Each Level in Ggplot2
Updating Column in One Dataframe with Value from Another Dataframe Based on Matching Values
Ggally::Ggpairs Plot Without Gridlines When Plotting Correlation Coefficient
How to Write Dplyr Groups to Separate Files
How to Interpret Lm() Coefficient Estimates When Using Bs() Function for Splines