Dynamic column names in data.table
From data.table 1.9.4
, you can just do this:
## A parenthesized symbol, `(cn)`, gets evaluated to "blah" before `:=` is carried out
test_dtb[, (cn) := mean(a), by = id]
head(test_dtb, 4)
# a b id blah
# 1: 41 19 1 54.2
# 2: 4 99 2 50.0
# 3: 49 85 3 46.7
# 4: 61 4 4 57.1
See Details in ?:=
:
DT[i, (colvector) := val]
[...] NOW PREFERRED [...] syntax. The parens are enough to stop the LHS being a symbol; same as
c(colvector)
Original answer:
You were on exactly the right track: constructing an expression to be evaluated within the call to [.data.table
is the data.table way to do this sort of thing. Going just a bit further, why not construct an expression that evaluates to the entire j
argument (rather than just its left hand side)?
Something like this should do the trick:
## Your code so far
library(data.table)
test_dtb <- data.table(a=sample(1:100, 100),b=sample(1:100, 100),id=rep(1:10,10))
cn <- "blah"
## One solution
expr <- parse(text = paste0(cn, ":=mean(a)"))
test_dtb[,eval(expr), by=id]
## Checking the result
head(test_dtb, 4)
# a b id blah
# 1: 30 26 1 38.4
# 2: 83 82 2 47.4
# 3: 47 66 3 39.5
# 4: 87 23 4 65.2
How to assign dynamic column names in data.table under `:=`?
We can place the values in a list
or use .(...)
and then assign (:=
) it to new columns
carsDT[speed < 15, paste0("col", 1:2) := list(1, 2)]
R data.table dynamic column name of group by returning new table
We can use setNames
library(data.table)
dt[, setNames(list(mean(a)), column_name), by = id]
# id mean
# 1: 1 56.8
# 2: 2 50.5
# 3: 3 50.5
# 4: 4 42.4
# 5: 5 49.9
# 6: 6 47.8
# 7: 7 60.6
# 8: 8 57.4
# 9: 9 54.6
#10: 10 34.5
data
set.seed(123)
dt <- data.table(a = sample(1:100, 100), b = sample(1:100, 100), id = rep(1:10,10))
column_name <- "mean"
Using dynamic column names in `data.table`
you should use .SDcols
(especially if you've too many columns and you require a particular operation to be performed only on a subset of the columns (apart from the grouping variable columns).
dtb[, lapply(.SD, mean), by=condition, .SDcols=2:4]
# condition var1 var2 var3
# 1: one 101.0 1001.0 10001.0
# 2: two 104.0 1004.0 10004.0
# 3: three 107.0 1007.0 10007.0
# 4: four 109.5 1009.5 10009.5
You could also get all the column names you'd want to take mean of first in a variable and then pass it to .SDcols
like this:
keys <- setdiff(names(dtb), "condition")
# keys = var1, var2, var3
dtb[, lapply(.SD, mean), by=condition, .SDcols=keys]
Edit: As Matthew Dowle rightly pointed out, since you require mean to be computed on every other column after grouping by condition
, you could just do:
dtb[, lapply(.SD, mean), by=condition]
David's edit: (which got rejected): Read more about .SD
from this post. I find this is relevant here. Thanks @David.
Edit 2: Suppose you have a data.table
with 1000 rows and 301 columns (one column for grouping and 300 numeric columns):
require(data.table)
set.seed(45)
dt <- data.table(grp = sample(letters[1:15], 1000, replace=T))
m <- matrix(rnorm(300*1000), ncol=300)
dt <- cbind(dt, m)
setkey(dt, "grp")
and you wanted to find the mean of the columns, say, 251:300 alone,
you can compute the mean of all the columns and then subset these columns (which is not very efficient as you'll compute on the whole data).
dt.out <- dt[, lapply(.SD, mean), by=grp]
dim(dt.out) # 15 * 301, not efficient.you can filter the
data.table
first to just these columns and then compute the mean (which is again not necessarily the best solution as you have to create an extra subset'd data.table every time you want operations on certain columns.dt.sub <- dt[, c(1, 251:300)]
setkey(dt.sub, "grp")
dt.out <- dt.sub[, lapply(.SD, mean), by=grp]you can specify each of the columns one by one as you'd normally do (but this is desirable for smaller data.tables)
# if you just need one or few columns
dt.out <- dt[, list(m.v251 = mean(V251)), by = grp]
So what's the best solution? The answer is .SDcols.
As the documentation states, for a data.table x, .SDcols specifies the columns that are included in .SD.
This basically implicitly filters the columns that will be passed to .SD instead of creating a subset (as we did before), only it is VERY efficient and FAST!
How can we do this?
By specifiying either the column numbers:
dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = 251:300]
dim(dt.out) # 15 * 51 (what we expect)Or alternatively by specifying the column id:
ids <- paste0("V", 251:300) # get column ids
dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = ids]
dim(dt.out) # 15 * 51 (what we expect)
It accepts both column names and numbers as arguments. In both these cases, .SD will be provided only with these columns we've specified.
Hope this helps.
dynamic column names seem to work when := is used but not when = is used in data.table
One option is to use the base R function setNames
aggregate_mtcars <- mtcars_copy[, setNames(.(sum(carb)), new_col)]
Or you could use data.table::setnames
aggregate_mtcars <- setnames(mtcars_copy[, .(sum(carb))], new_col)
Dynamically add column names to data.table when aggregating
As mentioned in the comments by lukeA, setNames
can be used:
m <- c("blah", "foo")
test_dtb[ , setNames(list(mean(b), median(b)), m), by = id]
Pass column name in data.table using variable
Use the quote()
and eval()
functions to pass a variable to j
. You don't need double-quotes on the column names when you do it this way, because the quote()
-ed string will be evaluated inside the DT[]
temp <- quote(x)
DT[ , eval(temp)]
# [1] "b" "b" "b" "a" "a"
With a single column name, the result is a vector. If you want a data.table result, or several columns, use list form
temp <- quote(list(x, v))
DT[ , eval(temp)]
# x v
# 1: b 1.52566586
# 2: b 0.66057253
# 3: b -1.29654641
# 4: a -1.71998260
# 5: a 0.03159933
Related Topics
Issue When Importing Dataset: 'Error in Scan(...): Line 1 Did Not Have 145 Elements'
Displaying a Greater Than or Equal Sign
Cumulative Count of Each Value
Split Dataframe by Levels of a Factor and Name Dataframes by Those Levels
How to Reorder Data.Table Columns (Without Copying)
How to Aggregate a Dataframe by Week
How to Extract Month from Date in R
Include Space for Missing Factor Level Used in Fill Aesthetics in Geom_Boxplot
How to Generate Distributions Given, Mean, Sd, Skew and Kurtosis in R
Why Is the Terminology of Labels and Levels in Factors So Weird
Emulate Split() with Dplyr Group_By: Return a List of Data Frames
R Error "Sum Not Meaningful for Factors"
Increase Resolution of Color Scale for Values Close to Zero
What's the Difference Between Integer Class and Numeric Class in R
How to Tell Cran to Install Package Dependencies Automatically