Efficiently Counting Non-Na Elements in Data.Table

Efficiently counting non-NA elements in data.table

Yes the option 3rd seems to be the best one. I've added another one which is valid only if you consider to change the key of your data.table from id to var, but still option 3 is the fastest on your data.

library(microbenchmark)
library(data.table)

dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key=c("var"))

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)

microbenchmark(times=10L,
dt1[!is.na(var),.N,by=id][,max(N,na.rm=T),by=id],
dt2[,length(var[!is.na(var)]),by=id],
dt3[,sum(!is.na(var)),by=id],
dt4[.(c(1,0)),.N,id,nomatch=0L])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1[!is.na(var), .N, by = id][, max(N, na.rm = T), by = id] 95.14981 95.79291 105.18515 100.16742 112.02088 131.87403 10
# dt2[, length(var[!is.na(var)]), by = id] 83.17203 85.91365 88.54663 86.93693 89.56223 100.57788 10
# dt3[, sum(!is.na(var)), by = id] 45.99405 47.81774 50.65637 49.60966 51.77160 61.92701 10
# dt4[.(c(1, 0)), .N, id, nomatch = 0L] 78.50544 80.95087 89.09415 89.47084 96.22914 100.55434 10

Counting non NAs in a data frame; getting answer as a vector

Try this:

# define "demo" dataset
ZZZ <- data.frame(n=c(1,2,NA),m=c(6,NA,NA),o=c(7,8,8))
# apply the counting function per columns
apply(ZZZ, 2, function(x) length(which(!is.na(x))))

Having run:

> apply(ZZZ, 2, function(x) length(which(!is.na(x))))
n m o
2 1 3

If you really insist on returning a vector, you might use as.vector, e.g. by defining this function:

nonNAs <- function(x) {
as.vector(apply(x, 2, function(x) length(which(!is.na(x)))))
}

You could simply run nonNAs(ZZZ):

> nonNAs(ZZZ)
[1] 2 1 3

Count non-NA values by group

You can use this

mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))

# A tibble: 2 x 2
col_1 non_na_count
<fctr> <int>
1 A 1
2 B 2

Get value of last non-NA row per column in data.table

If the dataset is data.table, loop through the Subset of Data.table (.SD), subset the non-NA element (x[!is.na(x)]) and extract the last element among those with tail.

df1[, lapply(.SD, function(x) tail(x[!is.na(x)],1))]
# a b c
#1: 63 57 4

Count number of rows that are not NA

After grouping by the columns of interest, get the sum of logical vector as the count i.e. - is.na(valor) returns a logical vector with TRUE where there are NA and FALSE for non-NA, negate (!) to reverse it and get the sum of the logical such as each TRUE (-> 1) represents one non-NA element

library(dplyr)
df1 %>%
group_by(id_station, id_parameter, year, day, month) %>%
summarise(Count = sum(!is.na(valor)))

Counting the NA's in a part of a row in data.table

Using data.table, you could do this:

df[, NonNA := sum(!is.na(questionA), !is.na(questionB), !is.na(questionC)), by = .(nr)]

A base solution:

df$nonNA <- rowSums(!is.na(df[,c("questionA", "questionB", "questionC")]))

How to 'count' number of non-empty values in a single row across multiple columns in a dataframe

If you are talking about missing values in R, it's represented in capital letter NA instead of na, otherwise, R will treat it as a string, which is not empty.

Also, I have artificially included some Name in your df to act like each row represents one Name, and a artificial Comp5 which includes some NAs but will not be included in the calculation.

rowSums() as its name suggests, calculates the sum of the row.

is.na(df[, 2:4]) makes it only counts the NA in df from column 2 to column 4.

df <-read.table(header = T, 
text =
"Name Comp1 Comp2 Comp3 Comp4 Comp5
A 0.5 0.4 NA 0.6 NA
B 0.6 NA NA 0.7 1
C NA 0.4 NA 1.1 NA")

df$Count_NA <- rowSums(is.na(df[, 2:4]))

Output

  Name Comp1 Comp2 Comp3 Comp4 Comp5 Count_NA
1 A 0.5 0.4 NA 0.6 NA 1
2 B 0.6 NA NA 0.7 1 2
3 C NA 0.4 NA 1.1 NA 2

Count number of non-NA values greater than 0 by group

We can use

colSums(df[c("L2", "L3", "L4")] > 0, na.rm = TRUE)

Or you may want a sum per person:

m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)

# L2 L3 L4
#Carl 1 1 2
#Joe 1 2 1

There is something fun here. df[c("L2", "L3", "L4")] > 0 is a logical matrix (with NA):

  • Although colSums can work with it without trouble, rowsum can not. So a fix is to add a 0 to this matrix to cast it to a 0-1 numerical matrix;
  • when adding this 0, we must do (df[c("L2", "L3", "L4")] > 0) + 0 not df[c("L2", "L3", "L4")] > 0 + 0. The operation precedence in R means + is prior to >. Have a try on this toy example:

    5 > 4 + 0  ## FALSE
    (5 > 4) + 0 ## 1

    So we want a bracket to evaluate > first, then +.

If you want the result to be a data frame, just cast the resulting matrix into a data frame by:

data.frame(m)

Follow-up

People stop responding, because your specific question on getting a function is less interesting than getting the summary dataset.

Well, if you still take my approach, I would define such function as:

extract <- function (person) {
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
rowSums(m)[[person]]
}

Then you can call

extract("Joe")
# 4
extract("Carl")
# 4

Note, this is obviously not the most efficient way to write such a function. Because if you only want to extract the sum for one person, there is no need to proceed all data. We can do:

extract2 <- function (person) {
## subset data
sub <- subset(df, df$Name == person, select = c("L2", "L3", "L4"))
## get sum
sum(sub > 0, na.rm = TRUE)
}

Then you can call

extract2("Joe")
# 4
extract2("Carl")
# 4

subsetting a data.table using !=some non-NA excludes NA too

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE TRUE FALSE

x[!a %in% ""]
# a
# 1: 1
# 2: NA

To find out why this is happening in data.table:

(as opposted to data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
# Part (1)
isub = substitute(i)

# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}

.....
# "isub" is being evaluated using "eval" to result in a logical vector

# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

First, why dt[a != ""] doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

why does x[!(a== "")] work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.



Related Topics



Leave a reply



Submit