Mean by factor by level
Just for fun posting the data.table
solution although you probably should do what @lukeA suggested
library(data.table)
A <- setDT(df)[factor == "a", mean(value)]
## [1] 1.5
R: mean of all cases with a certain factor level
Since there are missing values (NA
) in the dataset, you need to specify the argument na.rm = TRUE
within the mean
function. Otherwise, if at least one value is NA
, the mean
function (as well as other functions like sum
, min
, max
, ...) will return NA
.
mean(flights$air_time[flights$carrier == "UA"], na.rm = TRUE)
# [1] 211.7914
Mean by levels of factor in R, append as new column
You could use ave
from base R
test$meanbyname <- with(test, ave(value, name))
Or using mutate
from dplyr
or :=
in data.table
, can get the results
i.e.
library(dplyr)
group_by(test, name) %>%
mutate(meanbyname=mean(value))
Or
library(data.table)
setDT(test)[, meanbyname:= mean(value), by=name]
R column mean by factor
Here's another way
library(data.table)
cols <- paste0("v", 2:5) # set the columns you want to operate on
setDT(data)[, Sums := rowSums(.SD), .SDcols = cols]
data[, list(Means = sum(Sums)/(.N*length(cols))), by = name]
## name Means
## 1: a 3.75
## 2: b 6.50
## 3: c 5.00
Edit
Per @Aruns suggestion, that would be probably much better
setDT(data)[, mean(c(v2,v3,v4,v5)), by=name]
## name V1
## 1: a 3.75
## 2: b 6.50
## 3: c 5.00
Or per @Anandas suggestion
library(reshape2)
melt(setDT(data), id.vars = "name", measure.vars = cols)[, mean(value), by = name]
## name V1
## 1: a 3.75
## 2: b 6.50
## 3: c 5.00
Mean function in R (Dealing with factors)
It looks like x
is a factor. There is a gotcha when converting factors to numbers. You need to use:
mean(as.numeric(as.character(x)), na.rm=TRUE)
If you don't convert to character first, you will get the underlying factor codes.
Comparing all factor levels to the grand mean: can I tweak contrasts in linear model fitting to show all levels?
This answer shows you how to obtain the following coefficient table:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#A -0.19238543 0.6619845 -0.29061922 0.7902750
#B 0.40884591 0.6619845 0.61760645 0.5805485
#C -0.21646049 0.6619845 -0.32698723 0.7651640
Amazing, isn't it? It mimics what you see from summary(fit)
, i.e.,
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#x1 -0.19238543 0.6619845 -0.29061922 0.7902750
#x2 0.40884591 0.6619845 0.61760645 0.5805485
But now we have all factor levels displayed.
Why lm
summary does not display all factor levels?
In 2016, I answered this Stack Overflow question: `lm` summary not display all factor levels and since then, it has become the target for marking duplicated questions on similar topics.
To recap, the basic idea is that in order to have a full-rank design matrix for least squares fitting, we must apply contrasts to a factor variable. Let's say that the factor has N levels, then no matter what type of contrasts we choose (see ?contrasts
for a list), it reduces the raw N dummy variables to a new set of N - 1 variables. Therefore, only N - 1 coefficients are associated with an N-level factor.
However, we can transform the N - 1 coefficients back to the original N coefficients using the contrasts matrix. The transformation enables us to obtain a coefficient table for all factor levels. I will now demonstrate how to do this, based on OP's reproducible example:
set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))
In this example, the sum-to-zero contrast is applied to factor x
. To know more on how to control contrasts for model fitting, see my answer at How to set contrasts for my variable in regression analysis with R?.
R code walk-through
For a factor variable of N levels subject to sum-to-zero contrasts, we can use the following function to get the N x (N - 1) transformation matrix that maps the (N - 1) coefficients estimated by lm
back to the N coefficients for all levels.
ContrSumMat <- function (fctr, sparse = FALSE) {
if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
N <- nlevels(fctr)
Cmat <- contr.sum(N, sparse = sparse)
dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
Cmat
}
For the example 3-level factor x
, this matrix is:
Cmat <- ContrSumMat(x)
# 1 2
#A 1 0
#B 0 1
#C -1 -1
The fitted model fit
reports 3 - 1 = 2 coefficients for this factor. We can extract them as:
## coefficients After Contrasts
coef_ac <- coef(fit)[2:3]
# x1 x2
#-0.1923854 0.4088459
Therefore, the level-specific coefficients are:
## coefficients Before Contrasts
coef_bc <- (Cmat %*% coef_ac)[, 1]
# A B C
#-0.1923854 0.4088459 -0.2164605
## note that they sum to zero as expected
sum(coef_bc)
#[1] 0
Similarly, we can get the covariance matrix after contrasts:
var_ac <- vcov(fit)[2:3, 2:3]
# x1 x2
#x1 0.4382235 -0.2191118
#x2 -0.2191118 0.4382235
and transform it to the one before contrasts:
var_bc <- Cmat %*% var_ac %*% t(Cmat)
# A B C
#A 0.4382235 -0.2191118 -0.2191118
#B -0.2191118 0.4382235 -0.2191118
#C -0.2191118 -0.2191118 0.4382235
Interpretation:
The model intercept
coef(fit)[1]
is the grand mean.The computed
coef_bc
is the deviation of each level's mean from the grand mean.The diagonal entries of
var_bc
gives the estimated variance of these deviations.
We can then proceed to compute t-statistics and p-values for these coefficients, as follows.
## standard error of point estimate `coef_bc`
std.error_bc <- sqrt(diag(var_bc))
# A B C
#0.6619845 0.6619845 0.6619845
## t-statistics (Null Hypothesis: coef_bc = 0)
t.stats_bc <- coef_bc / std.error_bc
# A B C
#-0.2906192 0.6176065 -0.3269872
## p-values of the t-statistics
p.value_bc <- 2 * pt(abs(t.stats_bc), df = fit$df.residual, lower.tail = FALSE)
# A B C
#0.7902750 0.5805485 0.7651640
## construct a coefficient table that mimics `coef(summary(fit))`
stats.tab_bc <- cbind("Estimate" = coef_bc,
"Std. Error" = std.error_bc,
"t value" = t.stats_bc,
"Pr(>|t|)" = p.value_bc)
# Estimate Std. Error t value Pr(>|t|)
#A -0.1923854 0.6619845 -0.2906192 0.7902750
#B 0.4088459 0.6619845 0.6176065 0.5805485
#C -0.2164605 0.6619845 -0.3269872 0.7651640
We can also augment it by including the result for the grand mean (i.e., the model intercept).
## extract statistics of the intercept
intercept.stats <- coef(summary(fit))[1, , drop = FALSE]
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
## augment the coefficient table
stats.tab <- rbind(intercept.stats, stats.tab_bc)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#A -0.19238543 0.6619845 -0.29061922 0.7902750
#B 0.40884591 0.6619845 0.61760645 0.5805485
#C -0.21646049 0.6619845 -0.32698723 0.7651640
We can also print this table with significance stars.
printCoefmat(stats.tab)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902 0.46809 -0.0620 0.9545
#A -0.19239 0.66199 -0.2906 0.7903
#B 0.40885 0.66199 0.6176 0.5805
#C -0.21646 0.66199 -0.3270 0.7652
Emm? Why are there no stars? Well, in this example all p-values are very large. The stars will show up if p-values are small. Here is a convincing demo:
fake.tab <- stats.tab
fake.tab[, 4] <- fake.tab[, 4] / 100
printCoefmat(fake.tab)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902 0.46809 -0.0620 0.009545 **
#A -0.19239 0.66199 -0.2906 0.007903 **
#B 0.40885 0.66199 0.6176 0.005805 **
#C -0.21646 0.66199 -0.3270 0.007652 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Oh, this is so beautiful. For the meaning of these stars, see my answer at: Interpeting R significance codes for ANOVA table?
Closing Remarks
It should be possible to write a function (or even an R package) to perform such table transformation. However, it might take great effort to make such function flexible enough, to handle:
all type of contrasts (this is easy to do);
complicated model terms, like interaction between a factor and other numeric/factor variables (this seems really involving!!).
So, I will stop here for the moment.
Miscellaneous Replies
Are the model scores that I get from the lm's summary still accurate, even though it isn't displaying all levels of the factor?
Yes, they are. lm
conducts accurate least squares fitting.
In addition, the transformation of coefficient table does not affect R-squares, degree of freedom, residuals, fitted values, F-statistics, ANOVA table, etc.
R: do calculation for each factor level separately, then calculate min/mean/max over levels
library(tidyverse)
df %>%
group_by(run) %>%
mutate(scarcityfactor = 1 - discharge / lag(inflow,6)) %>%
group_by(time) %>%
summarise(Mean = mean(scarcityfactor),
Max = max(scarcityfactor),
Min = min(scarcityfactor))
# # A tibble: 24 x 4
# time Mean Max Min
# <dttm> <dbl> <dbl> <dbl>
# 1 2012-01-01 00:00:00 NA NA NA
# 2 2012-01-01 01:00:00 NA NA NA
# 3 2012-01-01 02:00:00 NA NA NA
# 4 2012-01-01 03:00:00 NA NA NA
# 5 2012-01-01 04:00:00 NA NA NA
# 6 2012-01-01 05:00:00 NA NA NA
# 7 2012-01-01 06:00:00 -46.7 -46.7 -46.7
# 8 2012-01-01 07:00:00 -2.96 -2.96 -2.96
# 9 2012-01-01 08:00:00 -1.34 -1.34 -1.34
#10 2012-01-01 09:00:00 -0.776 -0.776 -0.776
# # ... with 14 more rows
Related Topics
How to Store R Ggplot Graph as HTML Code Snippet
Split a Vector into Three Vectors of Unequal Length in R
Have Lubridate Subtraction Return Only a Numeric Value
Passing by Reference a Data.Frame and Updating It with Rcpp
Storing a List Within a Data Frame Element in R
More Efficient Strategy for Which() or Match()
Check If Character String Is a Valid Color Representation
Returning a Vector of Class Posixct with Vapply
Scoping and Functions in R 2.11.1:What's Going Wrong
Extract Column Name in Mutate_If Call
Using Mean with .Sd and .Sdcols in Data.Table
How to Show the Progress of Code in R
How to Remove Specific Special Characters in R
Finding Elements of Lists in R
How to Build Multiclass Svm in R