Aggregate Methods Treat Missing Values (Na) Differently

aggregate methods treat missing values (NA) differently

Good question, but in my opinion, this shouldn't have caused a major debugging headache because it is documented quite clearly in multiple places in the manual page for aggregate.

First, in the usage section:

## S3 method for class 'formula'
aggregate(formula, data, FUN, ...,
subset, na.action = na.omit)

Later, in the description:

na.action: a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.


I can't answer why the formula mode was written differently---that's something the function authors would have to answer---but using the above information, you can probably use the following:

aggregate(.~Name, M, FUN=sum, na.rm=TRUE, na.action=NULL)
# Name Col1 Col2
# 1 name 1 2

Why does aggregate NOT ignore NA values as per documentation?

You are using the wrong S3 method. The default method does not have an na.action parameter. Use the formula method, which has it:

aggregate(grade ~ user, v, sum)
# user grade
#1 joe 170
#2 pat 100
#3 tom 70

The S3 methods and their parameters are documented on the help page. The formula method is the only one with this parameter and to my knowledge it is not called internally by other methods.

Aggregate with na.action=na.pass gives unexpected answer

aggregate makes use of tapply, which in turn makes use of factor on its grouping variable.

But, look at what happens with NA values in factor:

factor(c(1, 2, NA))
# [1] 1 2 <NA>
# Levels: 1 2

Note the levels. You can make use of addNA to keep the NA:

addNA(factor(c(1, 2, NA)))
# [1] 1 2 <NA>
# Levels: 1 2 <NA>

Thus, you would probably need to do something like:

aggregate(y ~ addNA(x), d, sum)
# addNA(x) y
# 1 1 2
# 2 <NA> 3

Or something like:

d$x <- addNA(factor(d$x))
str(d)
# 'data.frame': 2 obs. of 2 variables:
# $ x: Factor w/ 2 levels "1",NA: 1 2
# $ y: num 2 3
aggregate(y ~ x, d, sum)
# x y
# 1 1 2
# 2 <NA> 3

(Alternatively, make the upgrade to something like "data.table", which will not just be faster than aggregate, but which will also give you more consistent behavior with NA values. No need to pay heed to whether you're using the formula method of aggregate or not.)

library(data.table)
as.data.table(d)[, sum(y), by = x]
# x V1
# 1: 1 2
# 2: NA 3

Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate]

Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.

library(dplyr)

Using aggregate

 Data %>% 
aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))

Using summarize

Data %>% 
group_by(Year, Farms) %>%
summarize(MeanCow = mean(Cow, na.rm=T),
MeanDuck = mean(Duck, na.rm=T),
MeanChicken = mean(Chicken, na.rm=T),
MeanSheep = mean(Sheep, na.rm=T),
MeanHorse = mean(Horse, na.rm=T)) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))

Solution for both

      Year  Farms  Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0 110 25.0 22.5
2 2019 Farm 1 14.0 6.0 65 10.0 13.5
3 2018 Farm 1 8.0 NA 10 14.5 12.0
4 2020 Farm 2 31.0 20.5 29 15.0 14.0
5 2019 Farm 2 11.5 40.5 43 18.5 42.5
6 2018 Farm 2 36.5 26.5 28 30.0 11.0
7 2020 Farm 3 38.5 9.0 37 30.0 42.0
8 2019 Farm 3 NA 10.5 NA 20.0 11.5
9 2018 Farm 3 NA 7.0 24 38.0 42.0

aggregate with list and data frame, how does function know the aggregation level?

It is just a matter of order. Let's first compute the result of your above data:

aggregate(list,
by = list(country = df$country),
FUN = mean)
country X1.5
1 Canada 2.0
2 US 4.5

Now let's reverse the order of the countries:

aggregate(list,
by = list(country = rev(df$country)),
FUN = mean)
country X1.5
1 Canada 4.0
2 US 1.5

As you can see, the result is different; it's what you would have expected with this data.frame:

data.frame(country = c("US", "US", "Canada","Canada","Canada"),
state = c("state1", "state2", "state3", "state4", "state5"),
randomnumb = c(1:5))

So it depends on the order. As Duck said, try to use the formula notation to be clear:

aggregate(randomnumb~country, data = df, mean)
country randomnumb
1 Canada 2.0
2 US 4.5

aggregate toString ignoring NA values / Concatenate rows including NAs

df %>%
group_by(id, year) %>%
summarise(across(everything(), ~toString(na.omit(.x))))

# A tibble: 3 x 4
# Groups: id [3]
id year cat_1 cat_2
<int> <int> <chr> <chr>
1 1 2021 Too high, YOY error "YOY error"
2 2 2021 Too high "Too low, YOY error"
3 3 2021 Too high, YOY error ""

Base R:

aggregate(.~id + year, df, \(x)toString(na.omit(x)), na.action = identity)

id year cat_1 cat_2
1 1 2021 Too high, YOY error YOY error
2 2 2021 Too high Too low, YOY error
3 3 2021 Too high, YOY error

Use aggregate and keep NA rows

A work-around is simply not to use NA for the value groups. First, initialising your data as above:

x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))

x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))

lpp <- lapply(spl,
function(x) { r <- with(x,
data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
val_g_lab=cut(val, seq(0,1,0.1)))); r })


rd <- do.call(rbind, lpp);
ord <- rd[order(rd$idx, decreasing = FALSE), ];

Simply convert to character and covert NAs to some arbitrary string literal:

# Convert to character
ord$val_g_lab <- as.character(ord$val_g_lab)
# Convert NAs
ord$val_g_lab[is.na(ord$val_g_lab)] <- "Unknown"

aggregate(val ~ group + val_g_lab, ord,
FUN=function(x) c(mean(x, na.rm = FALSE), sum(!is.na(x))),
na.action=na.pass)
# group val_g_lab val.1 val.2
#1 e (0,0.1] 0.02292533 1.00000000
#2 g (0.1,0.2] 0.16078353 1.00000000
#3 g (0.2,0.3] 0.20550228 1.00000000
#4 i (0.2,0.3] 0.26986665 1.00000000
#5 j (0.2,0.3] 0.23176149 1.00000000
#6 d (0.3,0.4] 0.39196441 1.00000000
#7 e (0.3,0.4] 0.39303518 1.00000000
#8 g (0.3,0.4] 0.35646994 1.00000000
#9 i (0.3,0.4] 0.35724889 1.00000000
#10 a (0.4,0.5] 0.48809261 1.00000000
#11 b (0.4,0.5] 0.40993166 1.00000000
#12 d (0.4,0.5] 0.42394859 1.00000000
# ...
#20 b (0.9,1] 0.99562918 1.00000000
#21 c (0.9,1] 0.92018049 1.00000000
#22 f (0.9,1] 0.91379088 1.00000000
#23 h (0.9,1] 0.93445802 1.00000000
#24 j (0.9,1] 0.93325098 1.00000000
#25 b Unknown NA 0.00000000
#26 c Unknown NA 0.00000000
#27 d Unknown NA 0.00000000
#28 i Unknown NA 0.00000000
#29 j Unknown NA 0.00000000

Does this do what you want?

Edit:

To answer your question in the comments. Note NaN and NA are not quite the same (See here). Note also that these two are very different from "NaN" and "NA", which are string literals (i.e. just text).
But anyway, NAs are special 'atomic' elements which are nearly always handled exceptionally by functions. So you have to look into the documentation how a particular function handles NAs. In this case, the na.action argument applies to the values that you aggregate over, not the 'classes' in your formula. The drop=FALSE argument could also be used, but then you get all combinations of the (in this case) two classifications. Redefining the NA to a string literal works because the new name is treated like any other class.

Aggregate and missing values

Here's a solution with ddply from the plyr package:

library(plyr)
ddply(d, .(Stock, Soil, Nitrogen), summarise,
Respiration = mean(as.numeric(as.character(Respiration))))

# Stock Soil Nitrogen Respiration
# 1 A Blank <NA> 112.5
# 2 A Clay 20 138.0
# 3 A Control 0 125.0
# 4 B Blank <NA> 110.0
# 5 B Clay 20 135.0
# 6 B Control 0 123.0

Please note that cbind is not a good way to create a data frame. You should use data.frame(Stock, Soil, Nitrogen, Respiration) instead. Due to your approach, all columns of d are factors. I used as.numeric(as.character(Respiration)) to obtain the numeric values of this column.



Related Topics



Leave a reply



Submit