Data.table meta-programming
While your functions certainly look interesting, I believe you are asking if there are other ways to go about it.
Personally, I like to use something like this:
## SAMPLE DATA
DT1 <- data.table(id=sample(LETTERS[1:4], 20, TRUE), Col1=1:20, Col2=rnorm(20))
DT2 <- data.table(id=sample(LETTERS[3:8], 20, TRUE), Col1=sample(100:500, 20), Col2=rnorm(20))
DT3 <- data.table(id=sample(LETTERS[19:20], 20, TRUE), Col1=sample(100:500, 20), Col2=rnorm(20))
ACCESSING A TABLE BY REFERENCE TO THE TABLE NAME:
This is straightforward, much like any object in R
# use strings to select the table
tablesSelected <- "DT3"
# use get to access them
get(tablesSelected)
# and we can perform operations:
get(tablesSelected)[, list(C1mean=mean(Col1), C2mean=mean(Col2))]
SELECTING COLUMNS BY REFERENCE
To select columns by reference to their names, use the .SDcols
argument.
Given a vector of column names:
columnsSelected <- c("Col1", "Col2")
Assign that vector to the .SDcols argument:
## Here we are simply accessing those columns
DT3[, .SD, .SDcols = columnsSelected]
We can also apply a function to each column named in the string vector:
## apply a function to each column
DT3[, lapply(.SD, mean), .SDcols = columnsSelected]
Note that if our goal is simply to output the columns we can turn off with
:
# This works for displaying
DT3[, columnsSelected, with=FALSE]
Note: a more "modern" way of doing this is to use the ..
shortcut to access columnsSelected
from "up one level":
DT3[ , ..columnsSelected]
However, if using with=FALSE
, we cannot then operate directly on the columns in the usual fashion
## This does NOT work:
DT3[, someFunc(columnsSelected), with=FALSE]
## This DOES work:
DT3[, someFunc(.SD), .SDcols=columnsSelected]
## This also works, but is less ideal, ie assigning to new columns is more cumbersome
DT3[, columnsSelected, with=FALSE][, someFunc(.SD)]
We can also use get
, but it is a bit trickier.
I am leaving it here for reference, but .SDcols
is the way to go
## we need to use `get`, but inside `j`
## AND IN A WRAPPER FUNCTION <~~~~~ THIS IS VITAL
DT3[, lapply(columnsSelected, function(.col) get(.col))]
## We can execute functions on the columns:
DT3[, lapply(columnsSelected, function(.col) mean( get(.col) ))]
## And of course, we can use more involved-functions, much like any *ply call:
# using .SDcols
DT3[, lapply(.SD, function(.col) c(mean(.col) + 2*sd(.col), mean(.col) - 2*sd(.col))), .SDcols = columnsSelected]
# using `get` and assigning the value to a var.
# Note that this method has memory drawbacks, so using .SDcols is preferred
DT3[, lapply(columnsSelected, function(.col) {TheCol <- get(.col); c(mean(TheCol) + 2*sd(TheCol), mean(TheCol) - 2*sd(TheCol))})]
For reference, if you try the following, you will notice that they do not produce the results we are after.
## this DOES NOT work (need ..columnsSelected)
DT3[, columnsSelected]
## netiher does this
DT3[, eval(columnsSelected)]
## still does not work:
DT3[, lapply(columnsSelected, get)]
If you want to change the name of the columns:
# Using the `.SDcols` method: change names using `setnames` (lowercase "n")
DT3[, setnames(.SD, c("new.Name1", "new.Name2")), .SDcols =columnsSelected]
# Using the `get` method:
## The names of the new columns will be the names of the `columnsSelected` vector
## Thus, if we want to preserve the names, use the following:
names(columnsSelected) <- columnsSelected
DT3[, lapply(columnsSelected, function(.col) get(.col))]
## we can also use this trick to give the columns new names
names(columnsSelected) <- c("new.Name1", "new.Name2")
DT3[, lapply(columnsSelected, function(.col) get(.col))]
Clearly, using .SDcols is easier and more elegant.
What about by
?
# `by` is straight forward, you can use a vector of strings in the `by` argument.
# lets add another column to show how to use two columns in `by`
DT3[, secondID := sample(letters[1:2], 20, TRUE)]
# here is our string vector:
byCols <- c("id", "secondID")
# and here is our call
DT3[, lapply(columnsSelected, function(.col) mean(get(.col))), by=byCols]
PUTTING IT ALL TOGETHER
We can access the data.table by reference to its name and then select its columns also by name:
get(tablesSelected)[, .SD, .SDcols=columnsSelected]
## OR WITH MULTIPLE TABLES
tablesSelected <- c("DT1", "DT3")
lapply(tablesSelected, function(.T) get(.T)[, .SD, .SDcols=columnsSelected])
# we may want to name the vector for neatness, since
# the resulting list inherits the names.
names(tablesSelected) <- tablesSelected
THIS IS THE BEST PART:
Since so much in data.table
is pass-by-reference, it is easy to have a list of tables, a separate list of columns to add and yet another list of columns to operate on, and put all together to add perform similar operations -- but with different inputs -- on all your tables.
As opposed to doing something similar with data.frame
, there is no need to reassign the end result.
newColumnsToAdd <- c("UpperBound", "LowerBound")
FunctionToExecute <- function(vec) c(mean(vec) - 2*sd(vec), mean(vec) + 2*sd(vec))
# note the list of column names per table!
columnsUsingPerTable <- list("DT1" = "Col1", DT2 = "Col2", DT3 = "Col1")
tablesSelected <- names(columnsUsingPerTable)
byCols <- c("id")
# TADA:
dummyVar <- # I use `dummyVar` because I do not want to display the output
lapply(tablesSelected, function(.T)
get(.T)[, c(newColumnsToAdd) := lapply(.SD, FunctionToExecute), .SDcols=columnsUsingPerTable[[.T]], by=byCols ] )
# Take a look at the tables now:
DT1
DT2
DT3
r data.table functional programming / metaprogramming / computing on the language
I don't recommend eval(parse())
. You can achieve the same as in approach three without it:
fn_dt_agg4 <-
function(dt, metric, metric_name, dimension, dimension_name, agg_type) {
e <- function(x) getFunction(agg_type)(x, na.rm = T)
temp <- dt[, setNames(lapply(.SD, e),
metric_name),
keyby = dimension, .SDcols = metric]
temp[]
}
This also avoids some security risks.
PS: You can check what data.table is doing regarding optimizations by setting options("datatable.verbose" = TRUE)
.
metaprogramming map on data.table list-columns
First Variation: using variables from the nested objetive
lapply
is enough. See the @diaggy's Answer.
Second Variation: using variables from and Outside the nested objetive
If you have to load a parameter from other column, it is neccesary pass from lapply
, to mapply
.
for(i in 1:length(calc_name)){
set(DT, NULL, j = calc_name[i],
value = mapply(function(x, m){
calc_metric_mean(a = x[[choose_a[i]]], b = x[[choose_b[i]]], metric = m)
}, x = DT$data, m = DT$metric, SIMPLIFY = FALSE
)
)
}
> DT
gp data metric d1d3 d2d1 d2d2
1: A <data.table[3x3]> max 1.7 3.5 4
2: B <data.table[3x3]> min 5.1 15 20
SIMPLIFY = FALSE
is required if it will return a list instead a vector.
data.table assign new columns based on variable
Not pretty but you could do:
dt[yes_no == 1,
(dt[yes_no == 1, as.character(dates)]) := as.data.table(diag(.N))]
Aggregation in data.table by reference to column names
We can use c
with names
x[, sum(Sepal.Width), by = c(names(x)[c(1, 5)])]
How do I correctly use the env variable for data.tables within a function
It's because dots
isn't a call, it's a list of calls. So when data.table evaluates j
it's trying to insert that list into a new column.
To fix this you need to splice the list of calls into a single call. You can do this in a call to ':='()
directly (Option 1 below), but you can also break this into multiple steps that mirrors what you were doing above by converting dots
to be a call to list()
(Option 2).
library(data.table)
data <- data.table::data.table(a = 1:5, b = 2:6)
# Option 1 - call to ':='
test <- function(data, ...) {
dots <- eval(substitute(alist(...)))
j <- bquote(':='(..(dots)), splice = TRUE)
print(j)
data[, j, env = list(j = j)][]
}
# # Option 2 - convert dots to a call to a list
# test <- function(data, ...) {
# dots <- eval(substitute(alist(...)))
# dots_names <- names(dots)
# dots <- bquote(list(..(unname(dots))), splice = TRUE)
# j <- call(":=", dots_names, dots)
# print(j)
# data[, j, env = list(j = j)][]
# }
test(data = data, c = a + 1, double_b = b * 2)
#> `:=`(c = a + 1, double_b = b * 2)
#> a b c double_b
#> <int> <int> <num> <num>
#> 1: 1 2 2 4
#> 2: 2 3 3 6
#> 3: 3 4 4 8
#> 4: 4 5 5 10
#> 5: 5 6 6 12
Edit: You can also use test2()
if you want to be able to edit the same column or use newly made columns.
test2 <- function(data, ...) {
dots <- eval(substitute(alist(...)))
dots_names <- names(dots)
for (i in seq_along(dots)) {
dot_name <- dots_names[[i]]
dot <- dots[[i]]
j <- call(":=", dot_name, dot)
print(j)
data[, j, env = list(j = j)]
}
data[]
}
Related Topics
Using the %≫% Pipe, and Dot (.) Notation
Clang-7: Error: Linker Command Failed With Exit Code 1 For Macos Big Sur
Read All Worksheets in an Excel Workbook into an R List With Data.Frames
Intelligent Point Label Placement in R
How to Replace Na Values in a Table For Selected Columns
Ggplot2 Two-Line Label With Expression
How to Use Facets With a Dual Y-Axis Ggplot
Why Is the Parallel Package Slower Than Just Using Apply
Chopping a String into a Vector of Fixed Width Character Elements
Collapsing Rows Where Some Are All Na, Others Are Disjoint With Some Nas
Rename Multiple Columns by Names
Convert Data.Frame Column Format from Character to Factor
Limit Ggplot2 Axes Without Removing Data (Outside Limits): Zoom
Ggplot2 Geom_Bar - How to Keep Order of Data.Frame
How to Sum a Numeric List Elements