Subset Data.Table by Logical Column

Subset data.table by logical column

From ?data.table

Advanced: When i is a single variable name, it is not considered an
expression of column names and is instead evaluated in calling scope.

So dt[x] will try to evaluate x in the calling scope (in this case the global environment)

You can get around this by using ( or { or force

dt[(x)]
dt[{x}]
dt[force(x)]

data.table subsetting rows using a logical column: why do I have to explicitly compare with TRUE?

Use this instead:

DT[(bmask), .(out=number)]
#    out
# 1:   2
# 2:   4

The role of the parentheses is to put the symbol bmask inside of a function call, from whose evaluation environment the columns of the DT will be visible¹. Any other function call that simply returns bmask's value (e.g. c(bmask), I(bmask), or bmask==TRUE) or the indices of its true elements (e.g. which(bmask)) will work just as well but may take slightly longer to compute.

If bmask is not located inside a function call, it will be searched for in calling scope (here the global environment), which can also be handy at times. Here's the relevant explanation from ?data.table:

Advanced: When 'i' is a single variable name, it is not
considered an expression of column names and is instead
evaluated in calling scope.

¹To see that () is itself a function call, type is(`(`).

Efficient way to subset data.table based on value in any of selected columns

One option is to specify the 'cols' of interest in .SDcols, loop through the Subset of Data.table (.SD), generate a list of logical vectors, Reduce it to single logical vector with (|) and use that to subset the rows

i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE

Select columns in data.table based on logical vector

Update 2020-04-22

In current CRAN version of data.table, DT[ , c(TRUE, TRUE, FALSE)] would work -- no need for with=FALSE. Leaving this older answer here for posterity:

We need with=FALSE

DT[, c(TRUE, TRUE, FALSE), with=FALSE]

Based on the documentation in ?data.table

By default with=TRUE and j is evaluated within the frame of x; column
names can be used as variables. When with=FALSE j is a character
vector of column names or a numeric vector of column positions to
select, and the value returned is always a data.table. with=FALSE is
often useful in data.table to select columns dynamically.

Subsetting multiple columns of a data.table with the same column name

You can pass a logical vector to select columns.

library(data.table)
dt[, names(dt) == 'a', with = FALSE]

#   a a
#1: 1 7
#2: 2 8
#3: 3 9

Subset data.table based on different kind of values in column of type list

Here is an answer which, is not exhaustive, and certainly not the only one possible, shows how to subset rows from a data.table for each type of values in the column of type list.

Subset DT rows containing NULL in 'ColofTypeList':

DT[lapply(ColofTypeList, is.null)==TRUE]

Subset DT rows containing NAs in 'ColofTypeList':

DT[lapply(ColofTypeList, is.na)==TRUE]

Subset DT rows containing the character 'hello' in 'ColofTypeList':

DT[lapply(ColofTypeList, grepl, pattern="hello")==TRUE]

Subset DT rows containing the integer 5 in 'ColofTypeList':

DT[lapply(ColofTypeList, match,5)==1]

So through all these case each time the key is to use lapply() directly on the column of type list.
While the first 3 cases return a logical (boolean), the last and 4th case using match() counts the occurence of the value you look for. Which is the reason why I wrote ==1.

I find this way of writing quite elegant and simple for subsetting rows.

Again, I might miss something already documented somewhere else for subsetting rows of a data.table based on content of a column of type list, but I did not find much about these cases.

So I hope this will be useful to those facing this.

Please share your solutions if you already had to deal with this to see if there could be a simpler way.

subset rows in data table with many columns

I could not get your example to load in my console session but this is a much more "minima;" example that demonstrates a method. Not sure if it has the usual data.table efficiency though:

DT <- setDT( data.frame(x=1:2, y=0,z=0))
DT[, apply(.SD, 1, function(x){any(x>=2)}) ] gets you a logical vector for each row
# [1] FALSE  TRUE
DT[ DT[, apply(.SD, 1, function(x){any(x>=2)}) ]] # uses that vector to select rows
   x y z
1: 2 0 0

This should succeed as well:

DT[ as.logical(rowSums(DT >= 2))]
   x y z
1: 2 0 0

For the second part consider this:

cols <- sapply(DT, function(x){ any(x>0)})
DT2[ ,.SD, .SDcols=names(cols[cols])]

using %in% to subset a data.table

The expression

 DT[x==a | x==b]

returns all rows in DT where the values in x and a are equal or x and b are equal. This is the desired result.

On the other hand

 DT[x%in%c(a,b)]

returns all rows where x matches any value in c(a, b), not just the corresponding value. Thus your second row appears because x == 3 and 3 appears (somewhere) in a.

(R, Data.Tables): Subset rows based on logical values in columns with dynamically assigned column names

Use get :

library(data.table)
DT[, (nmMatch) := FALSE ]
DT[get(nm1)== get(nm2), (nmMatch) := TRUE]

Selecting single elements in data.table based on logical condition

%in% would not directly works on dataframe/data.table. Use lapply to iterate over the columns and replace the values which are invalid_values to -99.

library(data.table)
DT[, lapply(.SD, function(x) replace(x, x %in% invalid_values, -99))]

#      a   b   c d
# 1: -99 -99   1 4
# 2:   9   5 -99 8
# 3: -99 -99   1 4
# 4:   9   5 -99 8
# 5: -99 -99   1 4
# 6:   9   5 -99 8
# 7: -99 -99   1 4
# 8:   9   5 -99 8
# 9: -99 -99   1 4
#10:   9   5 -99 8

Subset Data.Table by Logical Column