Blend of na.omit and na.pass using aggregate?
Pass both na.action=na.pass
and na.rm=TRUE
to aggregate
. The former tells aggregate
not to delete rows where NAs exist; and the latter tells mean
to ignore them.
aggregate(cbind(var1, var2, var3) ~ name, test, mean,
na.action=na.pass, na.rm=TRUE)
Using aggregate in a dataframe with NA without dropping rows
using dplyr
df %>%
group_by(cy) %>%
summarize_all(mean, na.rm = TRUE)
# cy bt cl pf ne YH YI
# 1 H 1.785714 0.7209302 53.41463 51.75952 21.92857 29.40476
# 2 K 1.333333 0.8333333 33.33333 47.83333 20.66667 27.33333
# 3 M 1.777778 0.4444444 63.75000 58.68889 24.88889 44.22222
# 4 O 2.062500 0.8750000 31.66667 53.05333 18.06667 30.78571
Aggregate - na.omit and na.pass in R with factor (group by factor)?
If we are getting the mean
of 'TotalPaygrouped by 'JobTitle', the
formula` method would be
aggregate(TotalPay~JobTitle, salaries, mean, na.rm=TRUE, na.action=na.pass)
Or use
aggregate(salaries$TotalPay, list(salaries$JobTitle), FUN=mean, na.rm=TRUE)
data
set.seed(24)
salaries <- data.frame(JobTitle = sample(LETTERS[1:5], 20,
replace=TRUE), TotalPay= sample(c(1:20, NA), 20))
aggregate function in R, sum of NAs are 0
Create a lambda function with a condition to return NaN
when all
elements are NA
aggregate(. ~ name, test, FUN = function(x) if(all(is.na(x))) NaN
else sum(x, na.rm = TRUE), na.action=na.pass)
-output
name var1 var2 var3
1 A 6 10 NaN
2 B 6 26 10
3 C 6 42 26
It is an expected behavior with sum
and na.rm = TRUE
. According to ?sum
the sum of an empty set is zero, by definition.
> sum(c(NA, NA), na.rm = TRUE)
[1] 0
How to omit na in aggregate to calculate SD in R
Try this:
aggregate(age_onset~cohort+status, data = dat, sd, na.rm = TRUE)
# cohort status age_onset
# 1 ADC8_AA -9 NA
# 2 ADC8_AA 2 7.661191
You can use the ...
argument of aggregate
to pass na.rm = TRUE
through to sd
.
You will still get NA
for any groups that only have a single non-missing value. This is because standard deviation isn't defined for a single value.
subset(dat, status == -9)
# cohort status age_onset
# 23 ADC8_AA -9 NA
# 46 ADC8_AA -9 NA
# 49 ADC8_AA -9 82
sd(82)
# [1] NA
Use aggregate and keep NA rows
A work-around is simply not to use NA
for the value groups. First, initialising your data as above:
x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))
x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))
lpp <- lapply(spl,
function(x) { r <- with(x,
data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
val_g_lab=cut(val, seq(0,1,0.1)))); r })
rd <- do.call(rbind, lpp);
ord <- rd[order(rd$idx, decreasing = FALSE), ];
Simply convert to character and covert NAs to some arbitrary string literal:
# Convert to character
ord$val_g_lab <- as.character(ord$val_g_lab)
# Convert NAs
ord$val_g_lab[is.na(ord$val_g_lab)] <- "Unknown"
aggregate(val ~ group + val_g_lab, ord,
FUN=function(x) c(mean(x, na.rm = FALSE), sum(!is.na(x))),
na.action=na.pass)
# group val_g_lab val.1 val.2
#1 e (0,0.1] 0.02292533 1.00000000
#2 g (0.1,0.2] 0.16078353 1.00000000
#3 g (0.2,0.3] 0.20550228 1.00000000
#4 i (0.2,0.3] 0.26986665 1.00000000
#5 j (0.2,0.3] 0.23176149 1.00000000
#6 d (0.3,0.4] 0.39196441 1.00000000
#7 e (0.3,0.4] 0.39303518 1.00000000
#8 g (0.3,0.4] 0.35646994 1.00000000
#9 i (0.3,0.4] 0.35724889 1.00000000
#10 a (0.4,0.5] 0.48809261 1.00000000
#11 b (0.4,0.5] 0.40993166 1.00000000
#12 d (0.4,0.5] 0.42394859 1.00000000
# ...
#20 b (0.9,1] 0.99562918 1.00000000
#21 c (0.9,1] 0.92018049 1.00000000
#22 f (0.9,1] 0.91379088 1.00000000
#23 h (0.9,1] 0.93445802 1.00000000
#24 j (0.9,1] 0.93325098 1.00000000
#25 b Unknown NA 0.00000000
#26 c Unknown NA 0.00000000
#27 d Unknown NA 0.00000000
#28 i Unknown NA 0.00000000
#29 j Unknown NA 0.00000000
Does this do what you want?
Edit:
To answer your question in the comments. Note NaN
and NA
are not quite the same (See here). Note also that these two are very different from "NaN"
and "NA"
, which are string literals (i.e. just text).
But anyway, NA
s are special 'atomic' elements which are nearly always handled exceptionally by functions. So you have to look into the documentation how a particular function handles NA
s. In this case, the na.action
argument applies to the values that you aggregate over, not the 'classes' in your formula. The drop=FALSE
argument could also be used, but then you get all combinations of the (in this case) two classifications. Redefining the NA
to a string literal works because the new name is treated like any other class.
aggregate function - NA is still outputted as na.action is set to omit
According to the help on aggregate
, na.action = na.omit
is the default in the method for formula objects, but not in the method for data frames. Which method is used is determined by the class of the first argument in your function call.
I don't have your data, so I show you what this means using the data set mtcars
, which is included in R, with a modification (which is needed, because mtcars
contains no NA
):
mtcars[5, "disp"] <- NA
Now, I aggregate the columns disp
and mpg
by cyl
. First, I use the data frame method:
aggregate(mtcars[, c("mpg", "disp")], list(cyl = mtcars$cyl), mean)
## cyl mpg disp
## 1 4 26.66364 105.1364
## 2 6 19.74286 183.3143
## 3 8 15.10000 NA
Clearly, the NA
values are not omitted. However, mean()
comes with an argument na.rm
, which I can set to TRUE
as follows:
aggregate(mtcars[, c("mpg","disp")], list(cyl = mtcars$cyl), mean, na.rm = TRUE)
## cyl mpg disp
## 1 4 26.66364 105.1364
## 2 6 19.74286 183.3143
## 3 8 15.10000 352.5692
(The reason that this works can also be found in the documentation of aggregate()
. The function has an argument ...
(as many R functions do), which will match all the expressions that you pass to the function that do not match one of its arguments. These expressions are than passed on to the function that you use for aggregation. Since aggregate()
has no argument called na.rm
, this argument will sent on to mean()
.)
Now back to what caused your confusion: you can also use aggregate by giving a formula as the first argument (which I find more readable and thus preferable). The call then reads as follows:
aggregate(cbind(mpg, disp) ~ cyl, data = mtcars, mean)
## cyl mpg disp
## 1 4 26.66364 105.1364
## 2 6 19.74286 183.3143
## 3 8 14.82308 352.5692
As you can see, in this form the NA
values are indeed omitted by default.
Inconsistency of na.action between xtabs and aggregate in R
It's difficult to give a cannonical answer without describing how xtabs
works. If we step through the main points of its source code, we'll see clearly what's going on.
After some basic type checking, the call to xtabs
works internally by first creating a data frame of all the variables contained in your formula using stats::model.frame
, and it is to this that the na.action
parameter is passed.
The way it does this is quite clever. xtabs
first copies the call you made to it via match.call
, like this:
m <- match.call(expand.dots = FALSE)
Then it strips out the parameters that don't need passed to stats::model.frame
like this:
m$... <- m$exclude <- m$drop.unused.levels <- m$sparse <- m$addNA <- NULL
As promised in the help file, if addNA
is TRUE
and na.action
is missing, it will now default to na.pass
:
if (addNA && missing(na.action))
m$na.action <- quote(na.pass)
Then it changes the function to be called from xtabs
to stats::model.frame
like this:
m[[1L]] <- quote(stats::model.frame)
So the object m
is a call (and is also a standalone reprex), which in your case looks like this:
stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L,
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = NULL)
Note that your na.action = NULL
has been passed to this call. This has the effect of keeping all NA
values in the frame. When the above call is evaluated, it gives this data frame:
eval(m)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>
Note that this is the same result you would get if you passed na.action = na.pass
:
stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L,
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = na.pass)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>
However, if you passed na.action = na.omit
, you would only be left with a single row, since only row 2 has no NA
values.
In any case, the "model frame" result is stored in the variable mf
. This is then split into the independent variable(s), - in your case, column A, and the response variable - in your case cbind(B, C)
.
The response is stored in y
and the variable in by
:
i <- attr(attr(mf, "terms"), "response")
by <- mf[-i]
y <- mf[[i]]
Now, by
is processed to ensure each independent variable is a factor, and that any NA
values are converted into factor levels if you have specified addNA = TRUE
:
by <- lapply(by, function(u) {
if (!is.factor(u))
u <- factor(u, exclude = exclude)
else if (has.exclude)
u <- factor(as.character(u), levels = setdiff(levels(u),
exclude), exclude = NULL)
if (addNA)
u <- addNA(u, ifany = TRUE)
u[, drop = drop.unused.levels]
})
Now we come to the crux. The na.action
is used again to determine how the NA
values in the response variable will be counted. In your case, since you passed na.action = NULL
, you will see that naAct
will get the value stored in getOption("na.action")
, which if you have never changed it, should be set to na.omit
. This in turn will cause the value of the variable na.rm,
to be TRUE
:
naAct <- if (!is.null(m$na.action)) {
m$na.action
}else {getOption("na.action", default = quote(na.omit))}
na.rm <- identical(naAct, quote(na.omit)) || identical(naAct,
na.omit) || identical(naAct, "na.omit")
Note that if you had passed na.action = na.pass
, then na.rm
would be FALSE
if you trace this piece of code.
Finally, we come to the section where your xtabs
table is built using sum
inside a tapply
, which is itself inside an lapply
.
lapply(as.data.frame(y), tapply, by, sum, na.rm = na.rm, default = 0L)
You can see that the na.rm
variable is used to determine whether to remove NA
s from the columns before attempting to sum them. The result of this lapply
is then coerced into the final cross tab.
So how does this answer your question?
It is true when the documentation says that if you don't pass an na.action
, it will default to na.pass
. However, the na.action
is used in two places: once in the call to model.frame
and once to determine the value of na.rm
. It is very clear from the source code that if na.action
is na.pass
, then na.rm
will be FALSE
, so you will miss out on the counts of any response groups containing NA
values. This is the opposite of what is written in the help file.
The only way round this is to pass na.action = NULL
, since this will allow model.frame
to keep NA
values, but will also cause the sum
function to default to na.rm
.
TL;DR The documentation for xtabs
is wrong on this point.
Related Topics
Draw Lines Between Different Elements in a Stacked Bar Plot
Tm_Map Has Parallel::Mclapply Error in R 3.0.1 on MAC
Add Missing Rows to a Data Table
Store Arrangegrob to Object, Does Not Create Printable Object
Converting to Date in a Character Column That Contains Two Date Formats
Print a List of Dynamically-Sized Plots in Knitr
How to Calculate a Table of Pairwise Counts from Long-Form Data Frame
Using Anti_Join() from the Dplyr on Two Tables from Two Different Databases
Outputting Difftime as Hh:Mm:Ss:Mm in R
Hyperlink Bar Chart in Highcharter
Difference Between Sort(), Rank(), and Order()
Accessing Y Columns with Duplicated Names in J of X[Y, J] Merges
Q-Q Plot with Ggplot2::Stat_Qq, Colours, Single Group
Ggplot2 Scale_X_Log10() Destroys/Doesn't Apply for Function Plotted via Stat_Function()