How to Subset a Table Object in R

How to subset a table object in R?

You need to use the computed value twice, so its useful to use an intermediate variable:

x <- with(chickwts, table(feed))
x[x>11]
feed
casein linseed soybean sunflower
12 12 14 12

Subsetting a table in R

Subset the data before running table, example:

ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0

# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2

Embed subset in a table function

xtabs() has a subset argument so one option is to use that instead of table().

xtabs(~ Sex + tabac, retinol, subset = tabac != 3)

How to subset a data.table object based on passed parameter within a function?

data.table way

In most cases you can do everything in data.table without any iteration controls (for or lapply)

dt <- data.table(iris)
group.by.name <- "Species"
res <- dt[, .(count = .N), by = group.by.name]

split-transform-rbind strategy:

If you need to do a complex transformations over data.table, you can split-transform-rbind data like this:

library('data.table')
dt <- data.table(iris)
group.by.name <- "Species"
res <- lapply(split(dt, by = group.by.name), function(data) {
data[, .(count = .N)]
})
res <- rbindlist(res, idcol = group.by.name)

You have a trade of between readability and speed.
With mcapply you might event gain speed on larger instances.

Usually you will be able to move complex logic into vector functions and do it data.table way without loosing readability.

Efficient way to subset data.table based on value in any of selected columns

One option is to specify the 'cols' of interest in .SDcols, loop through the Subset of Data.table (.SD), generate a list of logical vectors, Reduce it to single logical vector with (|) and use that to subset the rows

i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE

Subset data.table by another data.table without merging all columns

Following @akrun's answer, you can identify the rows in the join and use them to subset the table:

w = sort(DT1[DT2, on=.(A,B), which=TRUE, nomatch=0])
DT1[w]

# A B C
# 1: 1 1 1
# 2: 3 1 3
# 3: 2 3 1

or more compactly

DT1[sort(DT1[DT2, on=.(A,B), which=TRUE, nomatch=0])]

If you want to keep rows in the order from DT2, don't sort; and if you want unmatched rows included, skip nomatch=0.

Subsetting a data.table with a variable (when varname identical to colname)

If you don't mind doing it in 2 steps, you can just subset out of the scope of your data.table (though it's usually not what you want to do when working with data.table...):

wh_v1 <- my_data_table[, V1]==V1
my_data_table[wh_v1]
# V1 V2
#1: A 1
#2: A 4

data.table: transforming subset of columns with a function, row by row

If what you need is really to scale by row, you can try doing it in 2 steps:

# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]

# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]

data.table subset rows when there is a lubridate interval object column

So as it turns out, as this github issue states, this is a bug in data.table and handling columns that are S4 objects. There is also a workaround given here by making each element of the S4 column a list. So in my case the following fixes the issue. Notice that since the S4 columns are now lists, I had to change from using [ to [[.

x[, authInterval := interval(x$AUTH_DT, x$AUTH_END_DT)]
x[, authInterval := as.list(authInterval)]

# Find sequential auth intervals that overlap
overlap <- sapply(1:(nrow(x) - 1), function(y) {
int_overlaps(x$authInterval[[y]], x$authInterval[[y + 1]])
})

x[, overlap := c(NA, overlap)]

# which two rows have overlap
whichOverlap <- lapply(which(x$overlap), function(y) {c(y - 1, y)})
whichOverlap

x[unlist(whichOverlap)]

How to apply a function to a subset of data.table using by and exposing all columns to the function?

The OP has a function which takes a list as argument which should contain all columns of the data.table including the columns used for grouping in by.

According to help(".SD"):

.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by (or keyby).

(emphasis mine)

.BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance.

So, .BY and .SD complement each other to access all columns of the data.table.

Instead of explicitely repeating the by columns in the function call

x[, myfun(c(list(b, a), .SD)), by = .(b, a)]

we can use

x[, myfun(c(.BY, .SD)), by = .(b, a)]
   b a                                                                 V1
1: a a a a -1.02091215130492a a -0.295107569536843a a 0.77776326093429
2: a b b a -0.369037832486311b a -0.716211663822323b a -0.264799143319049
3: b c c b -1.39603530693486c b 1.4707902839894c b 0.721925347069227
4: b d d b -1.15220308230505d b -0.736782242593426d b 0.420986999145651

The OP has used debugonce() to show the argument passed to myfun():

> debugonce(myfun)
> x[, myfun(c(.BY, .SD)), by = .(b, a)]
debugging in: myfun(c(.BY, .SD))
debug at #1: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
$b
[1] "a"

$a
[1] "a"

$c
[1] -1.0209122 -0.2951076 0.7777633


Another example

With another sample data set and function it might be easier to exemplify the core of the question:

x <- data.table(a = rep(letters[3:6], each = 3), b = rep(c("x", "y"), each = 6), c = 1:12)
myfun <- function(y) paste(y$a, y$b, y$c, sep = "/", collapse = "-")

x[, myfun(.SD), by = .(b, a)]
   b a             V1
1: x c //1-//2-//3
2: x d //4-//5-//6
3: y e //7-//8-//9
4: y f //10-//11-//12

So, columns band a do appear in the output as grouping variables but they aren't passed via .SD to the function.

Now, with .BY complementing .SD

x[, myfun(c(.BY, .SD)), by = .(b, a)]
   b a                   V1
1: x c c/x/1-c/x/2-c/x/3
2: x d d/x/4-d/x/5-d/x/6
3: y e e/y/7-e/y/8-e/y/9
4: y f f/y/10-f/y/11-f/y/12

all columns of the data.table are passed to the function.

Separate arguments in the function call

Roland has suggested to pass .BY and .SD as separate parameters to the function. Indeed, .BY is a list object and .SD is a data.table object (which essentially is also a list which allowed us to use c(.BY, .SD)). There might be cases where the difference might matter.

To verify, we can define a function which prints str() as a side effect. The function is only called for the first group (.GRP == 1L).

myfun1 <- function(y) str(y)
x[, if (.GRP == 1L) myfun1(.SD), by = .(b, a)]
Classes ‘data.table’ and 'data.frame':    3 obs. of  1 variable:
$ c: int 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, ".data.table.locked")= logi TRUE
Empty data.table (0 rows) of 2 cols: b,a
x[, if (.GRP == 1L) myfun1(.BY), by = .(b, a)]
List of 2
$ b: chr "x"
$ a: chr "c"
Empty data.table (0 rows) of 2 cols: b,a
x[, if (.GRP == 1L) myfun1(c(.BY, .SD)), by = .(b, a)]
List of 3
$ b: chr "x"
$ a: chr "c"
$ c: int [1:3] 1 2 3
Empty data.table (0 rows) of 2 cols: b,a

Additional links

Beside help(".SD") the comments & answers to the following SO questions might by useful:

  • What does .SD stand for in data.table in R
  • Use of lapply .SD in data.table R


Related Topics



Leave a reply



Submit