R Data.Table Apply Function to Rows Using Columns as Arguments

R data.table apply function to rows using columns as arguments

The best way is to write a vectorized function, but if you can't, then perhaps this will do:

x[, func.text(f1, f2), by = seq_len(nrow(x))]

Apply function by row in data.table using columns as arguments

You could do something like this:

DF <- read.table(text = "    Cycle   Tab ID  colA    colB    colC    colG    high1   high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC", header = TRUE)

library(data.table)
setDT(DF)

maxTwo <- function(x) {
ind <- length(x) - (1:0) #the index is equal for all rows,
#so it could be made a function parameter
#for better efficiency
as.list(sort.int(x, partial = ind)[ind]) #partial sorting
}

DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)),
by = seq_len(nrow(DF)), .SDcols = 4:7]
DF[, diffMax := max2 - max1]

# Cycle Tab ID colA colB colC colG high1 high1a max1 max2 diffMax
#1: 1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC -3.141 3740.916 3744.057
#2: 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 357.071 2900.866 2543.795
#3: 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 353.479 4036.636 3683.157
#4: 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 384.945 4354.994 3970.049
#5: 5 0 45513 -89.719 -504.643 1298.476 131.320 1298.476 colC 131.320 1298.476 1167.156
#6: 6 0 45513 -250.110 -30.862 1877.049 -184.772 1877.049 colC -30.862 1877.049 1907.911

However, you'd still be looping over the rows, which means nrow calls to the function. You could try Rcpp to do the looping in compiled code.

How to apply a different multi-argument function to each row of a data.table?

So we want do a rowwise calculation, and return it defined as a new column o

mapply is definitely the right family of functions, but mapply (and sapply) will simplify their output out of a list before they return it. data.table loves lists. Map is just an expressive shortcut to mapply(..., simplify = FALSE) which does not modify the return.

The following does the calculation we're after, but it's still not quite right. (data.table interprets the list-output as separate columns)

> dt[, Map(sub, l, '', n)]
apple ball cat
1: I ate I played ate pudding

So we want to go one further and wrap it in a list to get the output we're after:

>dt[, .(Map(sub, l, '', n))]
V1
1: I ate
2: I played
3: ate pudding

Now we can assign this using :=

> dt[, o := Map(sub, l, '', n)]
> dt
l m n o
1: apple 1 I ate apple I ate
2: ball 2 I played ball I played
3: cat 3 cat ate pudding ate pudding

EDIT: As was pointed out, this results in o being a list-column.

We can avoid this by using standard mapply, though I tend to prefer the one-size-fits-all approach of Map (Each row creates a single output, which goes in a list. Regardless of what that output looks like, this will always work, and then we can type-convert at the end.)

dt[, o := mapply(sub, l, '', n)]

Apply a function to every specified column in a data.table and update by reference

This seems to work:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

  • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
  • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
  • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

EDIT: Here's another way that is probably faster, as @Arun mentioned:

for (j in cols) set(dt, j = j, value = -dt[[j]])

Function in data.table with two columns as arguments

One option if you don't mind adding quotes around the variable names

fun <- function(DT, fun, ...){
fun_args <- c(...)
DT[,new_col := do.call(fun, setNames(mget(fun_args), names(fun_args)))]
}

fun(DT, fun = function(x, y){y - x}, x = 'col1', y = 'col2')

DT
# col1 col2 new_col
# 1: 1 2 1
# 2: 2 3 1
# 3: 3 4 1
# 4: 4 5 1

Or use .SDcols (same result as above)

fun <- function(DT, fun, ...){
fun_args <- c(...)
DT[, new_col := do.call(fun, setNames(.SD, names(fun_args))),
.SDcols = fun_args]
}

R data.table - Apply function A to some columns and function B to some others

Here is one way to do it with Map or mapply:

Let's make some toy data first:

dt <- data.table(
variable1 = rnorm(100),
variable2 = rnorm(100),
variable3 = rnorm(100),
variable4 = rnorm(100),
grp = sample(letters[1:5], 100, replace = T)
)

colsToMean <- c("variable1", "variable2")
colsToMax <- c("variable3")
colsToSd <- c("variable4")

Then,

scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c(mean, max, sd), lengths(scols))

# summary
dt[, Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]

# or replace the original values with summary statistics as in OP
dt[, unlist(scols) := Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]

Another option with GForce on:

scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c('mean', 'max', 'sd'), lengths(scols))

jexp <- paste0('list(', paste0(funs, '(', unlist(scols), ')', collapse = ', '), ')')
dt[, eval(parse(text = jexp)), by = grp, verbose = TRUE]

# Detected that j uses these columns: variable1,variable2,variable3,variable4
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# Getting back original order ... 0.000sec
# lapply optimization is on, j unchanged as 'list(mean(variable1), mean(variable2), max(variable3), sd(variable4))'
# GForce optimized j to 'list(gmean(variable1), gmean(variable2), gmax(variable3), gsd(variable4))'
# Making each group and running j (GForce TRUE) ... 0.000sec

Applying function over data.table and storing results in a list

You can use Map to get output as list :

setNames(Map(opt, df$xvalue, df$yvalue), df$ColName)

#$Column1
#[1] 15

#$Column2
#[1] 8

#$Column3
#[1] 6

Applying a function to every row on each n number of columns in R

Here is one approach:

Let d be your 3 rows x 2000 columns frame, with column names as.character(1:2000) (See below for generation of fake data). We add a row identifier using .I, then melt the data long, adding grp, and column-group identifier (i.e. identifying the 20 sets of 100). Then apply your function myfunc (see below for stand-in function for this example), by row and group, and swing wide. (I used stringr::str_pad to add 0 to the front of the group number)

# add row identifier
d[, row:=.I]

# melt and add col group identifier
dm = melt(d,id.vars = "row",variable.factor = F)[,variable:=as.numeric(variable)][order(variable,row), grp:=rep(1:20, each=300)]

# get the result (180 rows long), applying myfync to each set of columns, by row
result = dm[, myfunc(value), by=.(row,grp)][,frow:=rep(1:3,times=60)]

# swing wide (3 rows long, 60 columns wide)
dcast(
result[,v:=paste0("grp",stringr::str_pad(grp,2,pad = "0"),"_",row)],
frow~v,value.var="V1"
)[, frow:=NULL][]

Output: (first six columns only)

      grp01_1    grp01_2    grp01_3    grp02_1    grp02_2    grp02_3
<num> <num> <num> <num> <num> <num>
1: 0.54187168 0.47650694 0.48045694 0.51278399 0.51777319 0.46607845
2: 0.06671367 0.08763655 0.08076939 0.07930063 0.09830116 0.07807937
3: 0.25828989 0.29603471 0.28419957 0.28160367 0.31353016 0.27942687

Input:

d = data.table()
alloc.col(d,2000)
set.seed(123)
for(c in 1:2000) set(d,j=as.character(c), value=runif(3))

myfunc Function (toy example for this answer):

myfunc <- function(x) c(mean(x), var(x), sd(x))


Related Topics



Leave a reply



Submit