How to subset a table object in R?
You need to use the computed value twice, so its useful to use an intermediate variable:
x <- with(chickwts, table(feed))
x[x>11]
feed
casein linseed soybean sunflower
12 12 14 12
Subsetting a table in R
Subset the data before running table
, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2
Embed subset in a table function
xtabs()
has a subset argument so one option is to use that instead of table()
.
xtabs(~ Sex + tabac, retinol, subset = tabac != 3)
How to subset a data.table object based on passed parameter within a function?
data.table way
In most cases you can do everything in data.table
without any iteration controls (for
or lapply
)
dt <- data.table(iris)
group.by.name <- "Species"
res <- dt[, .(count = .N), by = group.by.name]
split-transform-rbind strategy:
If you need to do a complex transformations over data.table
, you can split-transform-rbind data like this:
library('data.table')
dt <- data.table(iris)
group.by.name <- "Species"
res <- lapply(split(dt, by = group.by.name), function(data) {
data[, .(count = .N)]
})
res <- rbindlist(res, idcol = group.by.name)
You have a trade of between readability and speed.
With mcapply
you might event gain speed on larger instances.
Usually you will be able to move complex logic into vector functions and do it data.table way without loosing readability.
Efficient way to subset data.table based on value in any of selected columns
One option is to specify the 'cols' of interest in .SDcols
, loop through the Subset of Data.table (.SD
), generate a list
of logical vectors, Reduce
it to single logical vector with (|
) and use that to subset the rows
i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE
Subset data.table by another data.table without merging all columns
Following @akrun's answer, you can identify the rows in the join and use them to subset the table:
w = sort(DT1[DT2, on=.(A,B), which=TRUE, nomatch=0])
DT1[w]
# A B C
# 1: 1 1 1
# 2: 3 1 3
# 3: 2 3 1
or more compactly
DT1[sort(DT1[DT2, on=.(A,B), which=TRUE, nomatch=0])]
If you want to keep rows in the order from DT2, don't sort; and if you want unmatched rows included, skip nomatch=0
.
Subsetting a data.table with a variable (when varname identical to colname)
If you don't mind doing it in 2 steps, you can just subset out of the scope of your data.table
(though it's usually not what you want to do when working with data.table...):
wh_v1 <- my_data_table[, V1]==V1
my_data_table[wh_v1]
# V1 V2
#1: A 1
#2: A 4
data.table: transforming subset of columns with a function, row by row
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
data.table subset rows when there is a lubridate interval object column
So as it turns out, as this github issue states, this is a bug in data.table
and handling columns that are S4 objects. There is also a workaround given here by making each element of the S4 column a list. So in my case the following fixes the issue. Notice that since the S4 columns are now lists, I had to change from using [
to [[
.
x[, authInterval := interval(x$AUTH_DT, x$AUTH_END_DT)]
x[, authInterval := as.list(authInterval)]
# Find sequential auth intervals that overlap
overlap <- sapply(1:(nrow(x) - 1), function(y) {
int_overlaps(x$authInterval[[y]], x$authInterval[[y + 1]])
})
x[, overlap := c(NA, overlap)]
# which two rows have overlap
whichOverlap <- lapply(which(x$overlap), function(y) {c(y - 1, y)})
whichOverlap
x[unlist(whichOverlap)]
How to apply a function to a subset of data.table using by and exposing all columns to the function?
The OP has a function which takes a list as argument which should contain all columns of the data.table including the columns used for grouping in by
.
According to help(".SD")
:
.SD
is a data.table containing the Subset ofx
's Data for each group, excluding any columns used inby
(orkeyby
).
(emphasis mine)
.BY
is a list containing a length 1 vector for each item inby
. This can be useful whenby
is not known in advance.
So, .BY
and .SD
complement each other to access all columns of the data.table.
Instead of explicitely repeating the by
columns in the function call
x[, myfun(c(list(b, a), .SD)), by = .(b, a)]
we can use
x[, myfun(c(.BY, .SD)), by = .(b, a)]
b a V1
1: a a a a -1.02091215130492a a -0.295107569536843a a 0.77776326093429
2: a b b a -0.369037832486311b a -0.716211663822323b a -0.264799143319049
3: b c c b -1.39603530693486c b 1.4707902839894c b 0.721925347069227
4: b d d b -1.15220308230505d b -0.736782242593426d b 0.420986999145651
The OP has used debugonce()
to show the argument passed to myfun()
:
> debugonce(myfun)
> x[, myfun(c(.BY, .SD)), by = .(b, a)]
debugging in: myfun(c(.BY, .SD))
debug at #1: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
$b
[1] "a"
$a
[1] "a"
$c
[1] -1.0209122 -0.2951076 0.7777633
Another example
With another sample data set and function it might be easier to exemplify the core of the question:
x <- data.table(a = rep(letters[3:6], each = 3), b = rep(c("x", "y"), each = 6), c = 1:12)
myfun <- function(y) paste(y$a, y$b, y$c, sep = "/", collapse = "-")
x[, myfun(.SD), by = .(b, a)]
b a V1
1: x c //1-//2-//3
2: x d //4-//5-//6
3: y e //7-//8-//9
4: y f //10-//11-//12
So, columns b
and a
do appear in the output as grouping variables but they aren't passed via .SD
to the function.
Now, with .BY
complementing .SD
x[, myfun(c(.BY, .SD)), by = .(b, a)]
b a V1
1: x c c/x/1-c/x/2-c/x/3
2: x d d/x/4-d/x/5-d/x/6
3: y e e/y/7-e/y/8-e/y/9
4: y f f/y/10-f/y/11-f/y/12
all columns of the data.table are passed to the function.
Separate arguments in the function call
Roland has suggested to pass .BY
and .SD
as separate parameters to the function. Indeed, .BY
is a list object and .SD
is a data.table object (which essentially is also a list which allowed us to use c(.BY, .SD)
). There might be cases where the difference might matter.
To verify, we can define a function which prints str()
as a side effect. The function is only called for the first group (.GRP == 1L
).
myfun1 <- function(y) str(y)
x[, if (.GRP == 1L) myfun1(.SD), by = .(b, a)]
Classes ‘data.table’ and 'data.frame': 3 obs. of 1 variable:
$ c: int 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, ".data.table.locked")= logi TRUE
Empty data.table (0 rows) of 2 cols: b,a
x[, if (.GRP == 1L) myfun1(.BY), by = .(b, a)]
List of 2
$ b: chr "x"
$ a: chr "c"
Empty data.table (0 rows) of 2 cols: b,a
x[, if (.GRP == 1L) myfun1(c(.BY, .SD)), by = .(b, a)]
List of 3
$ b: chr "x"
$ a: chr "c"
$ c: int [1:3] 1 2 3
Empty data.table (0 rows) of 2 cols: b,a
Additional links
Beside help(".SD")
the comments & answers to the following SO questions might by useful:
- What does .SD stand for in data.table in R
- Use of lapply .SD in data.table R
Related Topics
Fill Missing Values in The Data.Frame with The Data from The Same Data Frame
How to Fix Axis Margin with Ggplot2
Aws Dynamodb Support for "R" Programming Language
How to Give Numbers to Each Group of a Dataframe with Dplyr::Group_By
Get Country (And Continent) from Longitude and Latitude Point in R
How to Generate Multivariate Random Numbers with Different Marginal Distributions
Ggplot2 Equivalent of 'Factorization or Categorization' in Googlevis in R
Finding Which Element of a Vector Is Between Two Values in R
How to Round Percentage to 2 Decimal Places in Ggplot2
How to Create a Rank Variable Under Certain Conditions
How to Append R Data Frame into Existing Excel Without Overwriting
Clear R Environment of All Objetcs & Packages
Include Link to Local HTML File in Datatable in Shiny
Filtering Single-Column Data Frames
Dynamic Number of Actionbuttons Tied to Unique Observeevent
How to Filter an R Simple Features Collection Using Sf Methods Like St_Intersects()
R: Apply Function to Matrix with Elements of Vector as Argument