dplyr expression summing sum(!is.na(Field1) + !is.na(Field2)...) giving wrong number
I have a sneaking suspicion this has to do with the precedence of the !
and +
operators and has little to nothing to do with dplyr
itself. See this previous post: Behavior of summing is.na results
I can thus make it work using summarise
by adding some extra parentheses:
df %.%
group_by(id,date) %.%
summarise(new=
(!is.na(Field1)) + (!is.na(Field2)) + (!is.na(Field3)) +
(!is.na(Field4)) + (!is.na(Field5))
) %.%
arrange(id,date)
#Source: local data frame [9 x 3]
#Groups: id
#
# id date new
#1 1 2005 0
#2 1 2006 3
#3 1 2007 2
#4 2 2005 4
#5 2 2006 2
#6 2 2007 3
#7 3 2005 1
#8 3 2006 3
#9 3 2007 3
Custom sum function in dplyr returns inconsistent results
The issue seems to be with dplyr
determining the column type in reference to the first returned result. If you force the NA
value, which is by default a logical value, to be an NA_real_
or NA_integer_
, then you will be sorted:
##Just to show what NA normally does first:
class(NA)
#[1] "logical"
sum0 <- function(x, ...){
# remove NAs unless all are NA
if(is.na(mean(x, na.rm=TRUE))) return(NA_real_)
else(sum(x, ..., na.rm=TRUE))
}
dta %>%
group_by(year) %>%
summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))
#Source: local data frame [7 x 3]
#
# year rrconf enrolled
#1 2007 79 NA
#2 2008 NA NA
#3 2009 474 458
#4 2010 2792 1222
#5 2011 1686 1155
#6 2012 3313 1906
#7 2013 3456 2184
Sum two dataframes with NA values and factors
Base R
Version:
library(dplyr) # only for pipe operator
rbind(data1, data2) %>%
split(.$NAMES) %>%
lapply(function(x){
data.frame(NAMES = unique(x$NAMES),as.list(colSums(x[,-1])))
}) %>%
do.call(rbind, .)
# NAMES X1 X2
# name1 name1 5 NA
# name2 name2 NA 22
# name3 name3 9 24
Notice that NAMES now also appears as rownames. This is because split
outputs a named list. You can either keep the rownames and remove NAMES = unique(x$NAMES)
, or add an unname()
pipe after split
:
rbind(data1, data2) %>%
split(.$NAMES) %>%
lapply(function(x){
data.frame(as.list(colSums(x[,-1])))
}) %>%
do.call(rbind, .)
# X1 X2
# name1 5 NA
# name2 NA 22
# name3 9 24
rbind(data1, data2) %>%
split(.$NAMES) %>%
unname() %>%
lapply(function(x){
data.frame(NAMES = unique(x$NAMES),as.list(colSums(x[,-1])))
}) %>%
do.call(rbind, .)
# NAMES X1 X2
# 1 name1 5 NA
# 2 name2 NA 22
# 3 name3 9 24
To treat NA's as zeros, just add na.rm = TRUE
to colSums
:
rbind(data1, data2) %>%
split(.$NAMES) %>%
unname() %>%
lapply(function(x){
data.frame(NAMES = unique(x$NAMES),as.list(colSums(x[,-1], na.rm = TRUE)))
}) %>%
do.call(rbind, .)
# NAMES X1 X2
# 1 name1 5 10
# 2 name2 0 22
# 3 name3 9 24
dplyr
+ purrr
Version:
library(purrr)
library(dplyr)
list(data1, data2) %>%
reduce(function(x, y) cbind(NAMES = x$NAMES, x[,-1] + y[-1]))
Result:
NAMES X1 X2
1 name1 5 NA
2 name2 NA 22
3 name3 9 24
To treat NA's as zero:
list(data1, data2) %>%
map(function(x){
modify_if(x, is.numeric, function(y) ifelse(is.na(y), 0, y))
}) %>%
reduce(function(x, y) cbind(NAMES = x$NAMES, x[,-1] + y[-1]))
Result:
NAMES X1 X2
1 name1 5 10
2 name2 0 22
3 name3 9 24
Important Note:
Replacing NA's with zeros is often a bad idea since they mean different things. NA could mean that the data is missing, not necessarily zero, so replacing all NA's with zeros could bias your results. Please only do it if you are sure that NA's mean zero in the context of your data.
Additional Notes:
- Both
map
andmodify_if
are from thepurrr
package.map
applies a function to each element of a list and always returns a list.modify
does the same except that it returns the same type as the input. modify_if
only "maps" the elements that satisfy a condition.- In the first pipe, I used
map
to "map" each element oflist(data1, data2)
with themodify_if
function, whilemodify_if
replaces NA's with zeros for each numeric column only. This way I can use the+
operator in the next pipe without worrying about NA's. reduce
does matrix addition ondata1
anddata2
, thencbind
s it withNAMES
column fromdata1
.
Sum NA cases in dplyr's summarise
The reason for this behavior is that we assigned Endemic
as a new summarized variable. Instead we should be having a new column name
mydata %>%
group_by(Group, Scenario, year, random) %>%
summarise(All = n(),
EndemicS = sum(Endemic, na.rm = TRUE),
noEndemic = sum(is.na(Endemic))) %>%
rename(Endemic = EndemicS)
# A tibble: 3 x 7
# Groups: Group, Scenario, year [3]
# Group Scenario year random All Endemic noEndemic
# <fctr> <fctr> <dbl> <chr> <int> <dbl> <int>
#1 Amphibians Present 1940 obs 6 3 3
#2 Amphibians RCP 4.5 1940 obs 6 3 3
#3 Amphibians RCP 8.5 1940 obs 6 3 3
Using `:=` in data.table to sum the values of two columns in R, ignoring NAs
It's not a lack of understanding of data.table but rather one regarding vectorized functions in R. You can define a dyadic operator that will behave differently than the "+" operator with regard to missing values:
`%+na%` <- function(x,y) {ifelse( is.na(x), y, ifelse( is.na(y), x, x+y) )}
mat[ , col3:= col1 %+na% col2]
#-------------------------------
col1 col2 col3
1: NA 0.003745 0.003745
2: 0.000000 0.007463 0.007463
3: -0.015038 -0.007407 -0.022445
4: 0.003817 -0.003731 0.000086
5: -0.011407 -0.007491 -0.018898
You can use mrdwad's comment to do it with sum(... , na.rm=TRUE
):
mat[ , col4 := sum(col1, col2, na.rm=TRUE), by=1:NROW(mat)]
R: How to aggregate with NA values
As you have already mentioned dplyr
before your data, you can use dplyr::summarise
function. The summarise
function supports grouping on NA
values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
Simple proportion error in using the sum() function in Tidyverse in R
Your data is still grouped when you are using mutate
in the last line.
One way is to ungroup
after count
library(dplyr)
gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
ungroup %>%
mutate(prop = n/sum(n))
Or a simpler method is to not group at all and use variables in count
.
gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
mutate(prop = n/sum(n))
# A tibble: 351 x 4
# age marital n prop
# <int> <fct> <int> <dbl>
# 1 18 Never married 89 0.00416
# 2 18 Married 2 0.0000934
# 3 19 Never married 234 0.0109
# 4 19 Divorced 3 0.000140
# 5 19 Widowed 1 0.0000467
# 6 19 Married 11 0.000514
# 7 20 Never married 227 0.0106
# 8 20 Separated 1 0.0000467
# 9 20 Divorced 2 0.0000934
#10 20 Married 21 0.000981
# … with 341 more rows
Inconsistency of na.action between xtabs and aggregate in R
It's difficult to give a cannonical answer without describing how xtabs
works. If we step through the main points of its source code, we'll see clearly what's going on.
After some basic type checking, the call to xtabs
works internally by first creating a data frame of all the variables contained in your formula using stats::model.frame
, and it is to this that the na.action
parameter is passed.
The way it does this is quite clever. xtabs
first copies the call you made to it via match.call
, like this:
m <- match.call(expand.dots = FALSE)
Then it strips out the parameters that don't need passed to stats::model.frame
like this:
m$... <- m$exclude <- m$drop.unused.levels <- m$sparse <- m$addNA <- NULL
As promised in the help file, if addNA
is TRUE
and na.action
is missing, it will now default to na.pass
:
if (addNA && missing(na.action))
m$na.action <- quote(na.pass)
Then it changes the function to be called from xtabs
to stats::model.frame
like this:
m[[1L]] <- quote(stats::model.frame)
So the object m
is a call (and is also a standalone reprex), which in your case looks like this:
stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L,
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = NULL)
Note that your na.action = NULL
has been passed to this call. This has the effect of keeping all NA
values in the frame. When the above call is evaluated, it gives this data frame:
eval(m)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>
Note that this is the same result you would get if you passed na.action = na.pass
:
stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L,
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = na.pass)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>
However, if you passed na.action = na.omit
, you would only be left with a single row, since only row 2 has no NA
values.
In any case, the "model frame" result is stored in the variable mf
. This is then split into the independent variable(s), - in your case, column A, and the response variable - in your case cbind(B, C)
.
The response is stored in y
and the variable in by
:
i <- attr(attr(mf, "terms"), "response")
by <- mf[-i]
y <- mf[[i]]
Now, by
is processed to ensure each independent variable is a factor, and that any NA
values are converted into factor levels if you have specified addNA = TRUE
:
by <- lapply(by, function(u) {
if (!is.factor(u))
u <- factor(u, exclude = exclude)
else if (has.exclude)
u <- factor(as.character(u), levels = setdiff(levels(u),
exclude), exclude = NULL)
if (addNA)
u <- addNA(u, ifany = TRUE)
u[, drop = drop.unused.levels]
})
Now we come to the crux. The na.action
is used again to determine how the NA
values in the response variable will be counted. In your case, since you passed na.action = NULL
, you will see that naAct
will get the value stored in getOption("na.action")
, which if you have never changed it, should be set to na.omit
. This in turn will cause the value of the variable na.rm,
to be TRUE
:
naAct <- if (!is.null(m$na.action)) {
m$na.action
}else {getOption("na.action", default = quote(na.omit))}
na.rm <- identical(naAct, quote(na.omit)) || identical(naAct,
na.omit) || identical(naAct, "na.omit")
Note that if you had passed na.action = na.pass
, then na.rm
would be FALSE
if you trace this piece of code.
Finally, we come to the section where your xtabs
table is built using sum
inside a tapply
, which is itself inside an lapply
.
lapply(as.data.frame(y), tapply, by, sum, na.rm = na.rm, default = 0L)
You can see that the na.rm
variable is used to determine whether to remove NA
s from the columns before attempting to sum them. The result of this lapply
is then coerced into the final cross tab.
So how does this answer your question?
It is true when the documentation says that if you don't pass an na.action
, it will default to na.pass
. However, the na.action
is used in two places: once in the call to model.frame
and once to determine the value of na.rm
. It is very clear from the source code that if na.action
is na.pass
, then na.rm
will be FALSE
, so you will miss out on the counts of any response groups containing NA
values. This is the opposite of what is written in the help file.
The only way round this is to pass na.action = NULL
, since this will allow model.frame
to keep NA
values, but will also cause the sum
function to default to na.rm
.
TL;DR The documentation for xtabs
is wrong on this point.
Related Topics
How to Adjust the Font Size of Tablegrob
R - Scaling Numeric Values Only in a Dataframe with Mixed Types
Calling Library() in R with a Variable as the Argument
R Crashing While Displaying Ggplot After Update (Process Memory Read Out of Range)
How to Add Abline with Lattice Xyplot Function
Disabling/Enabling Sidebar from Server Side
R - Reading Lines from a .Txt-File After a Specific Line
How to Include Custom CSS in HTMLwidgets for R And/Or Leafletr
Select Columns by Class (E.G. Numeric) from a Data.Table
Check to See If a Value Is Within a Range
Update Rows of Data Frame in R
R Dynamically Build "List" in Data.Table (Or Ddply)
How to Remove Na Data in Only One Columns
How to Replace Multiple Values at Once
Extract Name of Data.Frame in R as Character
Error in Terms.Formula(Formula):'.' in Formula and No 'Data' Argument