Behavior of Summing !Is.Na() Results

dplyr expression summing sum(!is.na(Field1) + !is.na(Field2)...) giving wrong number

I have a sneaking suspicion this has to do with the precedence of the ! and + operators and has little to nothing to do with dplyr itself. See this previous post: Behavior of summing is.na results

I can thus make it work using summarise by adding some extra parentheses:

df %.% 
group_by(id,date) %.%
summarise(new=
(!is.na(Field1)) + (!is.na(Field2)) + (!is.na(Field3)) +
(!is.na(Field4)) + (!is.na(Field5))
) %.%
arrange(id,date)

#Source: local data frame [9 x 3]
#Groups: id
#
# id date new
#1 1 2005 0
#2 1 2006 3
#3 1 2007 2
#4 2 2005 4
#5 2 2006 2
#6 2 2007 3
#7 3 2005 1
#8 3 2006 3
#9 3 2007 3

Custom sum function in dplyr returns inconsistent results

The issue seems to be with dplyr determining the column type in reference to the first returned result. If you force the NA value, which is by default a logical value, to be an NA_real_ or NA_integer_, then you will be sorted:

##Just to show what NA normally does first:
class(NA)
#[1] "logical"

sum0 <- function(x, ...){
# remove NAs unless all are NA
if(is.na(mean(x, na.rm=TRUE))) return(NA_real_)
else(sum(x, ..., na.rm=TRUE))
}

dta %>%
group_by(year) %>%
summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))

#Source: local data frame [7 x 3]
#
# year rrconf enrolled
#1 2007 79 NA
#2 2008 NA NA
#3 2009 474 458
#4 2010 2792 1222
#5 2011 1686 1155
#6 2012 3313 1906
#7 2013 3456 2184

Sum two dataframes with NA values and factors

Base R Version:

library(dplyr) # only for pipe operator
rbind(data1, data2) %>%
split(.$NAMES) %>%
lapply(function(x){
data.frame(NAMES = unique(x$NAMES),as.list(colSums(x[,-1])))
}) %>%
do.call(rbind, .)

# NAMES X1 X2
# name1 name1 5 NA
# name2 name2 NA 22
# name3 name3 9 24

Notice that NAMES now also appears as rownames. This is because split outputs a named list. You can either keep the rownames and remove NAMES = unique(x$NAMES), or add an unname() pipe after split:

rbind(data1, data2) %>%
split(.$NAMES) %>%
lapply(function(x){
data.frame(as.list(colSums(x[,-1])))
}) %>%
do.call(rbind, .)

# X1 X2
# name1 5 NA
# name2 NA 22
# name3 9 24

rbind(data1, data2) %>%
split(.$NAMES) %>%
unname() %>%
lapply(function(x){
data.frame(NAMES = unique(x$NAMES),as.list(colSums(x[,-1])))
}) %>%
do.call(rbind, .)

# NAMES X1 X2
# 1 name1 5 NA
# 2 name2 NA 22
# 3 name3 9 24

To treat NA's as zeros, just add na.rm = TRUE to colSums:

rbind(data1, data2) %>%
split(.$NAMES) %>%
unname() %>%
lapply(function(x){
data.frame(NAMES = unique(x$NAMES),as.list(colSums(x[,-1], na.rm = TRUE)))
}) %>%
do.call(rbind, .)

# NAMES X1 X2
# 1 name1 5 10
# 2 name2 0 22
# 3 name3 9 24

dplyr + purrr Version:

library(purrr)
library(dplyr)

list(data1, data2) %>%
reduce(function(x, y) cbind(NAMES = x$NAMES, x[,-1] + y[-1]))

Result:

  NAMES X1 X2
1 name1 5 NA
2 name2 NA 22
3 name3 9 24

To treat NA's as zero:

list(data1, data2) %>%
map(function(x){
modify_if(x, is.numeric, function(y) ifelse(is.na(y), 0, y))
}) %>%
reduce(function(x, y) cbind(NAMES = x$NAMES, x[,-1] + y[-1]))

Result:

  NAMES X1 X2
1 name1 5 10
2 name2 0 22
3 name3 9 24

Important Note:

Replacing NA's with zeros is often a bad idea since they mean different things. NA could mean that the data is missing, not necessarily zero, so replacing all NA's with zeros could bias your results. Please only do it if you are sure that NA's mean zero in the context of your data.

Additional Notes:

  1. Both map and modify_if are from the purrr package. map applies a function to each element of a list and always returns a list. modify does the same except that it returns the same type as the input.
  2. modify_if only "maps" the elements that satisfy a condition.
  3. In the first pipe, I used map to "map" each element of list(data1, data2) with the modify_if function, while modify_if replaces NA's with zeros for each numeric column only. This way I can use the + operator in the next pipe without worrying about NA's.
  4. reduce does matrix addition on data1 and data2, then cbinds it with NAMES column from data1.

Sum NA cases in dplyr's summarise

The reason for this behavior is that we assigned Endemic as a new summarized variable. Instead we should be having a new column name

mydata %>%
group_by(Group, Scenario, year, random) %>%
summarise(All = n(),
EndemicS = sum(Endemic, na.rm = TRUE),
noEndemic = sum(is.na(Endemic))) %>%
rename(Endemic = EndemicS)
# A tibble: 3 x 7
# Groups: Group, Scenario, year [3]
# Group Scenario year random All Endemic noEndemic
# <fctr> <fctr> <dbl> <chr> <int> <dbl> <int>
#1 Amphibians Present 1940 obs 6 3 3
#2 Amphibians RCP 4.5 1940 obs 6 3 3
#3 Amphibians RCP 8.5 1940 obs 6 3 3

Using `:=` in data.table to sum the values of two columns in R, ignoring NAs

It's not a lack of understanding of data.table but rather one regarding vectorized functions in R. You can define a dyadic operator that will behave differently than the "+" operator with regard to missing values:

 `%+na%` <- function(x,y) {ifelse( is.na(x), y, ifelse( is.na(y), x, x+y) )}

mat[ , col3:= col1 %+na% col2]
#-------------------------------
col1 col2 col3
1: NA 0.003745 0.003745
2: 0.000000 0.007463 0.007463
3: -0.015038 -0.007407 -0.022445
4: 0.003817 -0.003731 0.000086
5: -0.011407 -0.007491 -0.018898

You can use mrdwad's comment to do it with sum(... , na.rm=TRUE):

mat[ , col4 := sum(col1, col2, na.rm=TRUE), by=1:NROW(mat)]

R: How to aggregate with NA values

As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.

library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))

# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100

Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.

Simple proportion error in using the sum() function in Tidyverse in R

Your data is still grouped when you are using mutate in the last line.

One way is to ungroup after count

library(dplyr)

gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
ungroup %>%
mutate(prop = n/sum(n))

Or a simpler method is to not group at all and use variables in count.

gss_cat %>% 
filter(!is.na(age)) %>%
count(age, marital) %>%
mutate(prop = n/sum(n))

# A tibble: 351 x 4
# age marital n prop
# <int> <fct> <int> <dbl>
# 1 18 Never married 89 0.00416
# 2 18 Married 2 0.0000934
# 3 19 Never married 234 0.0109
# 4 19 Divorced 3 0.000140
# 5 19 Widowed 1 0.0000467
# 6 19 Married 11 0.000514
# 7 20 Never married 227 0.0106
# 8 20 Separated 1 0.0000467
# 9 20 Divorced 2 0.0000934
#10 20 Married 21 0.000981
# … with 341 more rows

Inconsistency of na.action between xtabs and aggregate in R

It's difficult to give a cannonical answer without describing how xtabs works. If we step through the main points of its source code, we'll see clearly what's going on.

After some basic type checking, the call to xtabs works internally by first creating a data frame of all the variables contained in your formula using stats::model.frame, and it is to this that the na.action parameter is passed.

The way it does this is quite clever. xtabs first copies the call you made to it via match.call, like this:

m <- match.call(expand.dots = FALSE)

Then it strips out the parameters that don't need passed to stats::model.frame like this:

m$... <- m$exclude <- m$drop.unused.levels <- m$sparse <- m$addNA <- NULL

As promised in the help file, if addNA is TRUE and na.action is missing, it will now default to na.pass:

    if (addNA && missing(na.action)) 
m$na.action <- quote(na.pass)

Then it changes the function to be called from xtabs to stats::model.frame like this:

m[[1L]] <- quote(stats::model.frame)

So the object m is a call (and is also a standalone reprex), which in your case looks like this:

stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L, 
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = NULL)

Note that your na.action = NULL has been passed to this call. This has the effect of keeping all NA values in the frame. When the above call is evaluated, it gives this data frame:

eval(m)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>

Note that this is the same result you would get if you passed na.action = na.pass:

stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L, 
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = na.pass)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>

However, if you passed na.action = na.omit, you would only be left with a single row, since only row 2 has no NA values.

In any case, the "model frame" result is stored in the variable mf. This is then split into the independent variable(s), - in your case, column A, and the response variable - in your case cbind(B, C).

The response is stored in y and the variable in by:

        i <- attr(attr(mf, "terms"), "response")
by <- mf[-i]
y <- mf[[i]]

Now, by is processed to ensure each independent variable is a factor, and that any NA values are converted into factor levels if you have specified addNA = TRUE:

    by <- lapply(by, function(u) {
if (!is.factor(u))
u <- factor(u, exclude = exclude)
else if (has.exclude)
u <- factor(as.character(u), levels = setdiff(levels(u),
exclude), exclude = NULL)
if (addNA)
u <- addNA(u, ifany = TRUE)
u[, drop = drop.unused.levels]
})

Now we come to the crux. The na.action is used again to determine how the NA values in the response variable will be counted. In your case, since you passed na.action = NULL, you will see that naAct will get the value stored in getOption("na.action"), which if you have never changed it, should be set to na.omit. This in turn will cause the value of the variable na.rm, to be TRUE:

    naAct <- if (!is.null(m$na.action)) {
m$na.action
}else {getOption("na.action", default = quote(na.omit))}
na.rm <- identical(naAct, quote(na.omit)) || identical(naAct,
na.omit) || identical(naAct, "na.omit")

Note that if you had passed na.action = na.pass, then na.rm would be FALSE if you trace this piece of code.

Finally, we come to the section where your xtabs table is built using sum inside a tapply, which is itself inside an lapply.

lapply(as.data.frame(y), tapply, by, sum, na.rm = na.rm, default = 0L)

You can see that the na.rm variable is used to determine whether to remove NAs from the columns before attempting to sum them. The result of this lapply is then coerced into the final cross tab.


So how does this answer your question?

It is true when the documentation says that if you don't pass an na.action, it will default to na.pass. However, the na.action is used in two places: once in the call to model.frame and once to determine the value of na.rm. It is very clear from the source code that if na.action is na.pass, then na.rm will be FALSE, so you will miss out on the counts of any response groups containing NA values. This is the opposite of what is written in the help file.

The only way round this is to pass na.action = NULL, since this will allow model.frame to keep NA values, but will also cause the sum function to default to na.rm.


TL;DR The documentation for xtabs is wrong on this point.



Related Topics



Leave a reply



Submit