Apply a Function to Every Specified Column in a Data.Table and Update by Reference

Apply a function to every specified column in a data.table and update by reference

This seems to work:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

  • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
  • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
  • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

EDIT: Here's another way that is probably faster, as @Arun mentioned:

for (j in cols) set(dt, j = j, value = -dt[[j]])

Apply a function to every column

In dplyr, you can use across to apply a function to multiple columns.

library(dplyr)
df <- df %>% mutate(across(starts_with('var'), ~./sd(.)))
df

# var1 var2 var3
# <dbl> <dbl> <dbl>
# 1 0.0384 0.118 0.707
# 2 0.0767 0.237 0.354
# 3 1.34 0.474 1.06
# 4 0.192 0.632 1.06
# 5 0.844 0.809 1.24
# 6 1.02 0.987 1.41

In base R, we can use lapply -

df[] <- lapply(df, function(x) x/sd(x))

To apply this to selected columns (1:168) you can do

df[1:168] <- lapply(df[1:168], function(x) x/sd(x))

Apply a function to a subset of data.table columns, by column-indices instead of name

The idiomatic approach is to use .SD and .SDcols

You can force the RHS to be evaluated in the parent frame by wrapping in ()

a[, (b) := lapply(.SD, as.numeric), .SDcols = b]

For columns 2:3

a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]

or

mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]

data.table apply function on each column

Seems like you are setting the years columns with 1 if it is between begy and begy + SeqLen - 1 for each row. Here is another way to do this:

y[order(rn), 
(grep("^20", names(y), value=TRUE)) :=
dcast(y[, seq(begy, by=1, length.out=SeqLen), by=.(rn)], rn ~ V1, length)[,-1L]]
y

output:

   rn number sex birthy 2004 2005 2006 2007 2008 2009 2010 begy SeqLen
1: 2 150 1 1980 1 1 1 0 0 0 0 2004 3
2: 3 99 1 1980 1 1 1 1 0 0 0 2004 4
3: 4 899 1 1980 0 0 0 1 1 1 1 2007 4
4: 5 789 1 1982 1 1 1 0 0 0 0 2004 3

Explanation:

Create a sequence of years for each row, then use dcast to do a one-hot encoding. Use the output to overwrite the years columns.

order(rn) will ensure that we don't overwrite rows incorrectly after dcast


Frank's method is way faster:

y[, as.character(2004:2010) := 
lapply(2004:2010, function(x) as.integer(between(x, begy, begy + SeqLen - 1)))]

R: convert mutate across character call to data.table

We can use

library(data.table)
nm1 <- names(mtcars)[sapply(mtcars, is.character)]
setDT(mtcars)[, (nm1) := lapply(.SD, function(x) replace(x, x == "", NA)), .SDcols = nm1]

How do I reference a function parameter inside inside a data.table with a column of the same name?

  1. One possible option is this:
myfunc <- function(dt, t){
env <- environment()
dt <- dt[t==get('t',env)]
mean(dt$b)
}

  1. Another approach: while perhaps not strictly a "solution" to your current problem, you may find it of interest. Consider data.table version>= 1.14.3. In this case, we can use env param of DT[i,j,by,env,...], to indicate the datatable column as "t", and the function parameter as t. Notice that this will work on column t with function parameter t, even if dt contains columns named col and val
myfunc <- function(dt, t){
dt <- dt[col==val, env=list(col="t", val=t)]
mean(dt$b)
}

In both case, usage and output is as below:

Usage

myfunc(dt = foo, t = 3)

Output:

[1] 0.1292877

Input:

set.seed(123)
foo <- data.table(t = c(1,1,2,2,3), b = rnorm(5))

foo looks like this:

> foo
t b
1: 1 -0.56047565
2: 1 -0.23017749
3: 2 1.55870831
4: 2 0.07050839
5: 3 0.12928774

How to apply function over subset of columns in data.table while grouping by some other column?

We could specify the .SDcols and assign the output back to the same columns.

foo[, names(foo)[3:5]  := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = x:z]

Note that the type should be the same for the output and the input. If the input is integer and output is numeric it would have problem. So, change the class to numeric first and then do the assignment

nm1 <- names(foo)[3:5]
#or programmatically based on checking whether column is numeric
#nm1 <- foo[, which(unlist(lapply(.SD, is.numeric)))]
foo[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1
][, (nm1) := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = nm1][]

A tidyverse approach to the above would be

library(dplyr)
foo %>%
group_by(group) %>%
mutate_if(is.numeric, funs(./sum(.)))

data.table, apply function to portion of a table

data.table needs more work to apply scale :

library(data.table)

cols <- names(dt)[-1]
dt[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
dt[2:5, (cols) := Map(scale, .SD, c(1,2,3), c(4,5,6)), .SDcols = cols]


Related Topics



Leave a reply



Submit