Apply a function to every specified column in a data.table and update by reference
This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
- Because there are parentheses in
(cols) :=
, the result is assigned to the columns specified incols
, instead of to some new variable named "cols". .SDcols
tells the call that we're only looking at those columns, and allows us to use.SD
, theS
ubset of theD
ata associated with those columns.lapply(.SD, ...)
operates on.SD
, which is a list of columns (like all data.frames and data.tables).lapply
returns a list, so in the endj
looks likecols := list(...)
.
EDIT: Here's another way that is probably faster, as @Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])
Apply a function to every column
In dplyr
, you can use across
to apply a function to multiple columns.
library(dplyr)
df <- df %>% mutate(across(starts_with('var'), ~./sd(.)))
df
# var1 var2 var3
# <dbl> <dbl> <dbl>
# 1 0.0384 0.118 0.707
# 2 0.0767 0.237 0.354
# 3 1.34 0.474 1.06
# 4 0.192 0.632 1.06
# 5 0.844 0.809 1.24
# 6 1.02 0.987 1.41
In base R, we can use lapply
-
df[] <- lapply(df, function(x) x/sd(x))
To apply this to selected columns (1:168
) you can do
df[1:168] <- lapply(df[1:168], function(x) x/sd(x))
Apply a function to a subset of data.table columns, by column-indices instead of name
The idiomatic approach is to use .SD
and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]
data.table apply function on each column
Seems like you are setting the years columns with 1 if it is between begy
and begy + SeqLen - 1
for each row. Here is another way to do this:
y[order(rn),
(grep("^20", names(y), value=TRUE)) :=
dcast(y[, seq(begy, by=1, length.out=SeqLen), by=.(rn)], rn ~ V1, length)[,-1L]]
y
output:
rn number sex birthy 2004 2005 2006 2007 2008 2009 2010 begy SeqLen
1: 2 150 1 1980 1 1 1 0 0 0 0 2004 3
2: 3 99 1 1980 1 1 1 1 0 0 0 2004 4
3: 4 899 1 1980 0 0 0 1 1 1 1 2007 4
4: 5 789 1 1982 1 1 1 0 0 0 0 2004 3
Explanation:
Create a sequence of years for each row, then use dcast
to do a one-hot encoding. Use the output to overwrite the years columns.
order(rn)
will ensure that we don't overwrite rows incorrectly after dcast
Frank's method is way faster:
y[, as.character(2004:2010) :=
lapply(2004:2010, function(x) as.integer(between(x, begy, begy + SeqLen - 1)))]
R: convert mutate across character call to data.table
We can use
library(data.table)
nm1 <- names(mtcars)[sapply(mtcars, is.character)]
setDT(mtcars)[, (nm1) := lapply(.SD, function(x) replace(x, x == "", NA)), .SDcols = nm1]
How do I reference a function parameter inside inside a data.table with a column of the same name?
- One possible option is this:
myfunc <- function(dt, t){
env <- environment()
dt <- dt[t==get('t',env)]
mean(dt$b)
}
- Another approach: while perhaps not strictly a "solution" to your current problem, you may find it of interest. Consider
data.table
version>= 1.14.3. In this case, we can useenv
param ofDT[i,j,by,env,...]
, to indicate the datatable column as"t"
, and the function parameter ast
. Notice that this will work on columnt
with function parametert
, even ifdt
contains columns namedcol
andval
myfunc <- function(dt, t){
dt <- dt[col==val, env=list(col="t", val=t)]
mean(dt$b)
}
In both case, usage and output is as below:
Usage
myfunc(dt = foo, t = 3)
Output:
[1] 0.1292877
Input:
set.seed(123)
foo <- data.table(t = c(1,1,2,2,3), b = rnorm(5))
foo
looks like this:
> foo
t b
1: 1 -0.56047565
2: 1 -0.23017749
3: 2 1.55870831
4: 2 0.07050839
5: 3 0.12928774
How to apply function over subset of columns in data.table while grouping by some other column?
We could specify the .SDcols
and assign the output back to the same columns.
foo[, names(foo)[3:5] := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = x:z]
Note that the type should be the same for the output and the input. If the input is integer
and output is numeric
it would have problem. So, change the class
to numeric
first and then do the assignment
nm1 <- names(foo)[3:5]
#or programmatically based on checking whether column is numeric
#nm1 <- foo[, which(unlist(lapply(.SD, is.numeric)))]
foo[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1
][, (nm1) := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = nm1][]
A tidyverse
approach to the above would be
library(dplyr)
foo %>%
group_by(group) %>%
mutate_if(is.numeric, funs(./sum(.)))
data.table, apply function to portion of a table
data.table
needs more work to apply scale
:
library(data.table)
cols <- names(dt)[-1]
dt[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
dt[2:5, (cols) := Map(scale, .SD, c(1,2,3), c(4,5,6)), .SDcols = cols]
Related Topics
How to Find the Difference in Value in Every Two Consecutive Rows in R
Append Data Frames Together in a for Loop
How to Keep Columns When Grouping/Summarizing
Subtracting Two Columns to Give a New Column in R
How to Remove the Negative Values from a Data Frame in R
Removing Columns That Are All 0
How to Change Y Axis Limits in Decimal Points in R
Minimum (Or Maximum) Value of Each Row Across Multiple Columns
How to Find the Closest Date to a Given Date
Pass a String as Variable Name in Dplyr::Filter
Setting Individual Axis Limits With Facet_Wrap and Scales = "Free" in Ggplot2
Join 3 Columns of Different Lengths in R
I Want to Split Street Address into Two Columns. One With Street Number Other With Street Name
How to Combine Multiple Variable Data to a Single Variable Data
R Markdown - Changing Font Size and Font Type in HTML Output
R - Test If a String Vector Contains Any Element of Another List