Select NA in a data.table in R
Fortunately, DT[is.na(x),]
is nearly as fast as (e.g.) DT["a",]
, so in practice, this may not really matter much:
library(data.table)
library(rbenchmark)
DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)
benchmark(DT["a",],
DT[is.na(x),],
replications=20)
# test replications elapsed relative user.self sys.self user.child
# 1 DT["a", ] 20 9.18 1.000 7.31 1.83 NA
# 2 DT[is.na(x), ] 20 10.55 1.149 8.69 1.85 NA
===
Addition from Matthew (won't fit in comment) :
The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).
benchmark(DT["a",], # repeat select of large subset on my netbook
DT[is.na(x),],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", ] 3 2.406 1.000 2.357 0.044
DT[is.na(x), ] 3 3.876 1.611 3.812 0.056
benchmark(DT["a",which=TRUE], # isolate search time
DT[is.na(x),which=TRUE],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", which = TRUE] 3 0.492 1.000 0.492 0.000
DT[is.na(x), which = TRUE] 3 2.941 5.978 2.932 0.004
As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.
Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_
is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey
going slower. But it's on the list to revisit.
R data.table, select columns with no NA
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable
Here some data.table-based solutions.
setDT(df_id_year_and_type)
method 1
na.omit(df_id_year_and_type, cols="type")
drops NA
rows based on column type
.unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)
finds all the groups.
And by joining them (using the last match: mult="last"
), we obtain the desired output.
na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]
# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>
method 2
df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]
method 3
(likely slower because of [
overhead)
df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]
Selecting rows with at least one missing value (NA) in a data.table
Use complete.cases
and just take the opposite of it.
myDT <- DT[!complete.cases(V1,V2), ]
Data.table replace sequence of values with NA
Try to use
dt[2:5, (specific_column) := NA]
Select only rows if its value in a particular column is 'NA' in R
You can do it also without subset()
. To select NA values you should use function is.na()
.
data[is.na(data$ColWtCL_6),]
Or with subset()
subset(data,is.na(ColWtCL_6))
How to use i in data.tables to select rows of all columns based on a conditional
Literal implementation:
toy[rowSums(sapply(toy, is.na)) == ncol(toy), ]
# A1 B1 C1 D1 E1
# 1: NA NA NA NA NA
toy[rowSums(toy == 1) == ncol(toy),]
# A1 B1 C1 D1 E1
# 1: 1 1 1 1 1
Slight improvement, removing a call to ncol(toy)
, though I suspect that believing this will give a speed improvement is wishful-thinking:
toy[rowSums(sapply(toy, Negate(is.na))) == 0, ]
toy[rowSums(toy != 1) == 0,]
Select set of columns so that each row has at least one non-NA entry
Using a while
loop, this should work to get the minimum set of variables with at least one non-NA per row.
best <- function(df){
best <- which.max(colSums(sapply(df, complete.cases)))
while(any(rowSums(sapply(df[best], complete.cases)) == 0)){
best <- c(best, which.max(sapply(df[is.na(df[best]), ], \(x) sum(complete.cases(x)))))
}
best
}
testing
best(df)
#d c
#4 3
df[best(df)]
# d c
#1 1 1
#2 1 NA
#3 1 NA
#4 1 NA
#5 NA 1
First, select the column with the least NAs (stored in best
). Then, update the vector with the column that has the highest number of non-NA rows on the remaining rows (where best has still NAs), until you get every rows with a complete case.
Related Topics
What Is the Benefit of Import in a Namespace in R
Row Operations in Data.Table Using 'By = .I'
Get Width of Plot Area in Ggplot2
How to Create Md5 Hash of a Column in R
Ggplot2: Font Style in Label Expression
Error with Select Function from Dplyr
Determine the Number of Na Values in a Column
Ggplot X-Axis Labels with All X-Axis Values
Installing Package - Cannot Open File - Permission Denied
Make Sequential Numeric Column Names Prefixed with a Letter
Get First and Last Values Per Group - Dplyr Group_By with Last() and First()