Ignore Na in Dplyr Row Sum

ignore NA in dplyr row sum

You could use this:

library(dplyr)
data %>%
#rowwise will make sure the sum operation will occur on each row
rowwise() %>%
#then a simple sum(..., na.rm=TRUE) is enough to result in what you need
mutate(sum = sum(a,b,c, na.rm=TRUE))

Output:

Source: local data frame [4 x 4]
Groups: <by row>

a b c sum
(dbl) (dbl) (dbl) (dbl)
1 1 4 7 12
2 2 NA 8 10
3 3 5 9 17
4 4 6 NA 10

Calculating Sum Column and ignoring Na

Add in na.rm=TRUE

i.e

rowSums(na.rm=TRUE)

Sum values from rows ignoring certain values in R

One way to do it in base:

rowSums(dta[, 2:4] * (dta[, 2:4] < 7))

# [1] 0 4 2 2 NA 9

Adding explanation, according to @tjebo comment

  • With dta[, 2:4] < 7 you produce a dataframe populated with logical values, where TRUE or FALSE corresponds to the values which are less or greater than 7. It is possible to do in one line, since this operation is vectorized;
  • Than, you multiply above logical dataframe, and a dataframe populated with your original values. Under the hood, R converts logical types into numeric types, so all FALSE and TRUEs from your logical dataset, are converted to 0s and 1s. Which means that you multiply your original values by 1 if they are less than 7, and by 0s otherwise;
  • Since NA < 7 produces NA, and following multiplication by NA will produce NAs as well - you preserve the original NAs;
  • Last step is to call rowSums() on a resulting dataframe, which will sum up the values for each particular row. Since those of them that exceed 7 are turned into 0s, you exclude them from resulting sum;
  • In case, when you want to get a sum for the rows where at least one value is not NA, you can use na.rm = TRUE argument to your rowSums() call. However, in this case, for the rows with NAs only you will get 0.

Ignoring NA when summing multiple columns with dplyr

The problem with your rowSums is the reference to DF (which is undefined). This works:

mutate(iris, sum2 = rowSums(cbind(Sepal.Length, Petal.Length), na.rm = T))

For difference, you could of course use a negative: rowSums(cbind(Sepal.Length, -Petal.Length), na.rm = T)

The general solution is to use ifelse or similar to set the missing values to 0 (or whatever else is appropriate):

mutate(iris, sum2 = Sepal.Length + ifelse(is.na(Petal.Length), 0, Petal.Length))

More efficient than ifelse would be an implementation of coalesce, see examples here. This uses @krlmlr's answer from the previous link (see bottom for the code or use the kimisc package).

mutate(iris, sum2 = Sepal.Length + coalesce.na(Petal.Length, 0))

To replace missing values data-set wide, there is replace_na in the tidyr package.


@krlmlr's coalesce.na, as found here

coalesce.na <- function(x, ...) {
x.len <- length(x)
ly <- list(...)
for (y in ly) {
y.len <- length(y)
if (y.len == 1) {
x[is.na(x)] <- y
} else {
if (x.len %% y.len != 0)
warning('object length is not a multiple of first object length')
pos <- which(is.na(x))
x[pos] <- y[(pos - 1) %% y.len + 1]
}
}
x
}

Sum 2 columns, ignore NA, except when both are NA

I used the following. It gives sums even when there are NAs, but returns NA when all sumed elements are NA.

rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0)

Ignore NA in vector sum

You can try rowSums with na.rm = TRUE (as @akrun said in the comment) like below

data$cat <- rowSums(data[-1] * c(0.05, 0.05, 0.05)[col(data[-1])], na.rm = TRUE)

which gives

> data
id v1 v2 v3 cat
1 1 1 78 101 9.00
2 1 2 85 NA 4.35
3 1 5 56 452 25.65
4 1 4 47 NA 2.55
5 1 58 12 NA 3.50
6 1 6 3 45 2.70
7 1 4 65 7 3.80
8 1 9 98 56 8.15
9 2 1 78 101 9.00
10 2 2 85 NA 4.35
11 2 5 56 452 25.65
12 2 4 47 NA 2.55
13 2 58 12 NA 3.50
14 2 6 3 45 2.70
15 2 4 65 7 3.80
16 2 9 98 56 8.15
17 3 1 78 101 9.00
18 3 2 85 NA 4.35
19 3 5 56 452 25.65
20 3 4 47 NA 2.55
21 3 58 12 NA 3.50
22 3 6 3 45 2.70
23 3 4 65 7 3.80
24 3 9 98 56 8.15
25 4 1 78 101 9.00
26 4 2 85 NA 4.35
27 4 5 56 452 25.65
28 4 4 47 NA 2.55
29 4 58 12 NA 3.50
30 4 6 3 45 2.70
31 4 4 65 7 3.80
32 4 9 98 56 8.15

How to exclude NA values from being counted in dplyr summarize()?

length when compared (==) with NA returns NA and when you subset a vector with NA it returns NA, hence NA is calculated in length.

Check this example :

x <- c(1:3, NA, 2:3, NA)
length(x)
#[1] 7

x == 3
#[1] FALSE FALSE TRUE NA FALSE TRUE NA
x[x == 3]
#[1] 3 NA 3 NA
length(x[x == 3])
#[1] 4

Here, you expected output to be 2 but it gives 4 because of NA values. Perhaps, you can use :

length(na.omit(x[x == 3])) 
#[1] 2

but that is very convoluted use sum on logical values instead.

sum(x == 3, na.rm = TRUE)
#[1] 2

So try :

library(dplyr)
t1 %>%
group_by(year) %>%
mutate(YES = sum(characteristic == "1", na.rm = TRUE),
NO = sum(characteristic == "0", na.rm = TRUE))

rowSums but keeping NA values

If you have a variable number of columns you could try this approach:

mm <- merge(dd1,dd2)
mm$m <- rowSums(mm, na.rm=TRUE) * ifelse(rowSums(is.na(mm)) == ncol(mm), NA, 1)
# or, as @JoshuaUlrich commented:
#mm$m <- ifelse(apply(is.na(mm),1,all),NA,rowSums(mm,na.rm=TRUE))
tail(mm, 10)
# dd1 dd2 m
#2013-08-02 NA NA NA
#2013-08-03 NA NA NA
#2013-08-04 NA NA NA
#2013-08-05 1.2542692 -1.2542692 0.000000
#2013-08-06 NA 1.3325804 1.332580
#2013-08-07 NA 0.7726740 0.772674
#2013-08-08 0.8158402 -0.8158402 0.000000
#2013-08-09 NA 1.2292919 1.229292
#2013-08-10 NA NA NA
#2013-08-11 NA 0.9334900 0.933490

Sum of two Columns of Data Frame with NA Values

dat$e <- rowSums(dat[,c("b", "c")], na.rm=TRUE)
dat
# a b c d e
# 1 1 2 3 4 5
# 2 5 NA 7 8 7


Related Topics



Leave a reply



Submit