Select Na in a Data.Table in R

Select NA in a data.table in R

Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",], so in practice, this may not really matter much:

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)  

benchmark(DT["a",],
          DT[is.na(x),],
          replications=20)
#             test replications elapsed relative user.self sys.self user.child
# 1      DT["a", ]           20    9.18    1.000      7.31     1.83         NA
# 2 DT[is.na(x), ]           20   10.55    1.149      8.69     1.85         NA

===

Addition from Matthew (won't fit in comment) :

The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).

benchmark(DT["a",],  # repeat select of large subset on my netbook
    DT[is.na(x),],
    replications=3)
          test replications elapsed relative user.self sys.self
     DT["a", ]            3   2.406    1.000     2.357    0.044
DT[is.na(x), ]            3   3.876    1.611     3.812    0.056

benchmark(DT["a",which=TRUE],   # isolate search time
    DT[is.na(x),which=TRUE],
    replications=3)
                      test replications elapsed relative user.self sys.self
     DT["a", which = TRUE]            3   0.492    1.000     0.492    0.000
DT[is.na(x), which = TRUE]            3   2.941    5.978     2.932    0.004

As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.

Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_ is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey going slower. But it's on the list to revisit.

R data.table, select columns with no NA

Try:

example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]

Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable

Here some data.table-based solutions.

setDT(df_id_year_and_type)

method 1

na.omit(df_id_year_and_type, cols="type") drops NA rows based on column type.
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE) finds all the groups.
And by joining them (using the last match: mult="last"), we obtain the desired output.

na.omit(df_id_year_and_type, cols="type"
        )[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE), 
          on=c('id', 'year'), 
          mult="last"]

#       id  year   type
#    <num> <num> <char>
# 1:     1  2002      A
# 2:     2  2008      B
# 3:     3  2010      D
# 4:     3  2013   <NA>
# 5:     4  2020      C
# 6:     5  2009      A
# 7:     6  2010      B
# 8:     6  2012   <NA>

method 2

df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]

method 3

(likely slower because of [ overhead)

df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]

Selecting rows with at least one missing value (NA) in a data.table

Use complete.cases and just take the opposite of it.

myDT <- DT[!complete.cases(V1,V2), ]

Data.table replace sequence of values with NA

Try to use

dt[2:5, (specific_column) := NA]

Select only rows if its value in a particular column is 'NA' in R

You can do it also without subset(). To select NA values you should use function is.na().

data[is.na(data$ColWtCL_6),]

Or with subset()

subset(data,is.na(ColWtCL_6))

How to use i in data.tables to select rows of all columns based on a conditional

Literal implementation:

toy[rowSums(sapply(toy, is.na)) == ncol(toy), ]
#    A1 B1 C1 D1 E1
# 1: NA NA NA NA NA

toy[rowSums(toy == 1) == ncol(toy),]
#    A1 B1 C1 D1 E1
# 1:  1  1  1  1  1

Slight improvement, removing a call to ncol(toy), though I suspect that believing this will give a speed improvement is wishful-thinking:

toy[rowSums(sapply(toy, Negate(is.na))) == 0, ]
toy[rowSums(toy != 1) == 0,]

Select set of columns so that each row has at least one non-NA entry

Using a while loop, this should work to get the minimum set of variables with at least one non-NA per row.

best <- function(df){
  best <- which.max(colSums(sapply(df, complete.cases)))
  while(any(rowSums(sapply(df[best], complete.cases)) == 0)){
    best <- c(best, which.max(sapply(df[is.na(df[best]), ], \(x) sum(complete.cases(x)))))
  }
  best
}

testing

best(df)
#d c 
#4 3

df[best(df)]
#   d  c
#1  1  1
#2  1 NA
#3  1 NA
#4  1 NA
#5 NA  1

First, select the column with the least NAs (stored in best). Then, update the vector with the column that has the highest number of non-NA rows on the remaining rows (where best has still NAs), until you get every rows with a complete case.

Select Na in a Data.Table in R