Using ':=' in Data.Table to Sum the Values of Two Columns in R, Ignoring Nas

Using `:=` in data.table to sum the values of two columns in R, ignoring NAs

It's not a lack of understanding of data.table but rather one regarding vectorized functions in R. You can define a dyadic operator that will behave differently than the "+" operator with regard to missing values:

 `%+na%` <- function(x,y) {ifelse( is.na(x), y, ifelse( is.na(y), x, x+y) )}

mat[ , col3:= col1 %+na% col2]
#-------------------------------
col1 col2 col3
1: NA 0.003745 0.003745
2: 0.000000 0.007463 0.007463
3: -0.015038 -0.007407 -0.022445
4: 0.003817 -0.003731 0.000086
5: -0.011407 -0.007491 -0.018898

You can use mrdwad's comment to do it with sum(... , na.rm=TRUE):

mat[ , col4 := sum(col1, col2, na.rm=TRUE), by=1:NROW(mat)]

Sum 2 columns, ignore NA, except when both are NA

I used the following. It gives sums even when there are NAs, but returns NA when all sumed elements are NA.

rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0)

Sum of two Columns of Data Frame with NA Values

dat$e <- rowSums(dat[,c("b", "c")], na.rm=TRUE)
dat
# a b c d e
# 1 1 2 3 4 5
# 2 5 NA 7 8 7

Skip NAs when using Reduce() in data.table

Consider this example :

library(data.table)
dt <- data.table(a = 1:5, b = c(3, NA, 1, 2, 4), c = c(NA, 1, NA, 3, 4))
dt

# a b c
#1: 1 3 NA
#2: 2 NA 1
#3: 3 1 NA
#4: 4 2 3
#5: 5 4 4

If you want to carry previous value to NA values you can use :

dt[, names(dt) := lapply(.SD, function(x) cumsum(replace(x, is.na(x), 0))), 
.SDcols = names(dt)]
dt

# a b c
#1: 1 3 0
#2: 3 3 1
#3: 6 4 1
#4: 10 6 4
#5: 15 10 8

If you want to keep NA as NA :

dt[, names(dt) := lapply(.SD, function(x) {
x1 <- cumsum(replace(x, is.na(x), 0))
x1[is.na(x)] <- NA
x1
}), .SDcols = names(dt)]

dt

# a b c
#1: 1 3 NA
#2: 3 NA 1
#3: 6 4 NA
#4: 10 6 4
#5: 15 10 8

Summing across rows of a data.table for specific columns with NA

We can have several options for this i.e. either do the rowSums first and then replace the rows where all are NA or create an index in i to do the sum only for those rows with at least one non-NA.

library(data.table)
TEST[, SumAbundance := replace(rowSums(.SD, na.rm = TRUE),
Reduce(`&`, lapply(.SD, is.na)), NA), .SDcols = 4:6]

Or slightly more compact option

TEST[, SumAbundance :=  (NA^!rowSums(!is.na(.SD))) * 
rowSums(.SD, na.rm = TRUE), .SDcols = 4:6]

Or construct a function and reuse

rowSums_new <- function(dat) {
fifelse(rowSums(is.na(dat)) != ncol(dat), rowSums(dat, na.rm = TRUE), NA_real_)
}
TEST[, SumAbundance := rowSums_new(.SD), .SDcols = 4:6]

Summing many columns with data.table in R, remove NA

First, create the object variables for the names in use:

colsToSum <- names(dt1)  # or whatever you need
summedNms <- paste0( "y", seq_along(colsToSum) )

If you'd like to copy it to a new data.table

dt2 <- dt1[, lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
setnames(dt2, summedNms)

If alternatively, youd like to append the columns to the original

dt1[, c(summedNms) := lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]

As far as a general na.rm process, there is not one specific to data.table, but have a look at ?na.omit and ?na.exclude

Sum values from rows ignoring certain values in R

One way to do it in base:

rowSums(dta[, 2:4] * (dta[, 2:4] < 7))

# [1] 0 4 2 2 NA 9

Adding explanation, according to @tjebo comment

  • With dta[, 2:4] < 7 you produce a dataframe populated with logical values, where TRUE or FALSE corresponds to the values which are less or greater than 7. It is possible to do in one line, since this operation is vectorized;
  • Than, you multiply above logical dataframe, and a dataframe populated with your original values. Under the hood, R converts logical types into numeric types, so all FALSE and TRUEs from your logical dataset, are converted to 0s and 1s. Which means that you multiply your original values by 1 if they are less than 7, and by 0s otherwise;
  • Since NA < 7 produces NA, and following multiplication by NA will produce NAs as well - you preserve the original NAs;
  • Last step is to call rowSums() on a resulting dataframe, which will sum up the values for each particular row. Since those of them that exceed 7 are turned into 0s, you exclude them from resulting sum;
  • In case, when you want to get a sum for the rows where at least one value is not NA, you can use na.rm = TRUE argument to your rowSums() call. However, in this case, for the rows with NAs only you will get 0.

How to sum values from two adjacent columns in a data.frame in R but keep 0s as such?

Here's a pretty simple way. We do a cumulative sum by row, and multiply by the original data frame -- multiplying by 0 zeros out the 0 entries, and multiplying by 1 keeps the summed entries as-is. Since you have quotes around your numbers making them character class, we start by converting all your columns to numeric:

df[] = lapply(df, as.numeric)
result = t(apply(df, 1, cumsum)) * df
result
# Year1 Year2 Year3 Year4 Year5 Year6
# 1 1 2 3 0 0 0
# 2 0 1 2 3 0 0
# 3 0 1 2 3 4 0
# 4 0 0 1 2 3 0
# 5 0 0 1 2 3 0
# 6 0 0 0 1 2 0


Related Topics



Leave a reply



Submit