Applying a Function to Each Row of a Data.Table

R data.table apply function to rows using columns as arguments

The best way is to write a vectorized function, but if you can't, then perhaps this will do:

x[, func.text(f1, f2), by = seq_len(nrow(x))]

Applying a function to each row of a data.table

How about :

x
a b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

x[,list(a=rep(a,each=2), V1=unlist(strsplit(b," ")))]
a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

Generalized solution given comment :

x[,{s=strsplit(b," ");list(a=rep(a,sapply(s,length)), V1=unlist(s))}]

Apply a function to every specified column in a data.table and update by reference

This seems to work:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

  • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
  • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
  • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

EDIT: Here's another way that is probably faster, as @Arun mentioned:

for (j in cols) set(dt, j = j, value = -dt[[j]])

Applying a function to every row on each n number of columns in R

Here is one approach:

Let d be your 3 rows x 2000 columns frame, with column names as.character(1:2000) (See below for generation of fake data). We add a row identifier using .I, then melt the data long, adding grp, and column-group identifier (i.e. identifying the 20 sets of 100). Then apply your function myfunc (see below for stand-in function for this example), by row and group, and swing wide. (I used stringr::str_pad to add 0 to the front of the group number)

# add row identifier
d[, row:=.I]

# melt and add col group identifier
dm = melt(d,id.vars = "row",variable.factor = F)[,variable:=as.numeric(variable)][order(variable,row), grp:=rep(1:20, each=300)]

# get the result (180 rows long), applying myfync to each set of columns, by row
result = dm[, myfunc(value), by=.(row,grp)][,frow:=rep(1:3,times=60)]

# swing wide (3 rows long, 60 columns wide)
dcast(
result[,v:=paste0("grp",stringr::str_pad(grp,2,pad = "0"),"_",row)],
frow~v,value.var="V1"
)[, frow:=NULL][]

Output: (first six columns only)

      grp01_1    grp01_2    grp01_3    grp02_1    grp02_2    grp02_3
<num> <num> <num> <num> <num> <num>
1: 0.54187168 0.47650694 0.48045694 0.51278399 0.51777319 0.46607845
2: 0.06671367 0.08763655 0.08076939 0.07930063 0.09830116 0.07807937
3: 0.25828989 0.29603471 0.28419957 0.28160367 0.31353016 0.27942687

Input:

d = data.table()
alloc.col(d,2000)
set.seed(123)
for(c in 1:2000) set(d,j=as.character(c), value=runif(3))

myfunc Function (toy example for this answer):

myfunc <- function(x) c(mean(x), var(x), sd(x))

Method to operate on each row of data.table without using apply function

If you really need speed, as always it's best to move to C++ using Rcpp, which gives us a solution that's over 100x faster.

Data

I did make some different example data to test this on that had 1000 rows instead of 5:

set.seed(123)
dat <- data.table(A = rnorm(1e3, sd=4), B = rnorm(1e3, sd=4), C = rnorm(1e3, sd=4),
D = rnorm(1e3, sd=4), E = rnorm(1e3, sd=4))

Solution

I used the following C++ code to do the same thing as your function, but now the looping is done in C++ instead of R through apply which saves considerable time:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector mcs2(DataFrame x) {
int n = x.nrows();
int m = x.size();
NumericMatrix mat(n, m);
for ( int j = 0; j < m; ++j ) {
mat(_, j) = NumericVector(x[j]);
}
NumericVector result(n);
for ( int i = 0; i < n; ++i ) {
NumericVector tmp = mat(i, _);
std::sort(tmp.begin(), tmp.end());
bool do_sd = true;
for ( int j = 1; j < m; ++j ) {
if ( tmp[j] - tmp[j-1] > 6.0 ) {
result[i] = NA_REAL;
do_sd = false;
break;
}
}
if ( do_sd ) {
result[i] = sd(tmp);
}
do_sd = true;
}
return result;
}

We can make sure it's returning the same values:

all.equal(apply(dat[, 2:4], 1, mcs1), mcs2(dat[,2:4]))

[1] TRUE

Now let's benchmark:

benchmark(mcs1 = dat[, sd:=apply(.SD, 1, mcs1), .SDcols=(c(2,3,4))],
mcs2 = dat[, sd:=mcs2(.SD), .SDcols=(c(2,3,4))],
order = 'relative',
columns = c('test', 'elapsed', 'relative', 'user.self'))

test elapsed relative user.self
2 mcs2 0.19 1.000 0.183
1 mcs1 21.34 112.316 20.044

How to compile this code

As an introduction to using C++ code through Rcpp, I'd suggest this chapter of Hadley Wickham's Advanced R. If you intend on doing anything further with Rcpp I'd strongly recommend you also read the official documentation and vignettes, but Wickham's book is probably a little more beginner friendly to use as a starting point. For your purposes, you just need to get Rcpp up and running so that you can compile the code above.

For this code to work for you, you'll need the Rcpp package if you don't already have it. You can get the package by running

install.packages(Rcpp)

from R. Note you'll also need a compiler; if you're on a Debian-based Linux system such as Ubuntu you can run

sudo apt install r-base-dev

from the terminal. If you are on Mac or Windows, check here for some instructions on getting this set up, or in the Wickham chapter linked above.

Once you have Rcpp installed, save the C++ code above into a file. Let's say for our example the file is named "SOanswer.cpp". Then you can make its mcs2() function available from R by putting the following two lines in your R script:

library(Rcpp)
sourceCpp("SOanswer.cpp") # assuming the file is in your working directory

That's it! Now your R script can call mcs2() and run much faster. If you want to learn more about Rcpp, beside the Wickham chapter above, I'd check out the reference manual and the vignettes available here, this page from RStudio (which also has tons of links, some of which are linked to here), and you can also find some really useful stuff looking around the Rcpp gallery.

How to apply function in each row in data.table

For 50 columns, it is better to use max.col

dt$index <- max.col(dt, 'first') *(!!rowSums(dt))

Or as @David Arenburg mentioned, more idiomatic code would be

dt[, indx := max.col(.SD,ties.method="first")*(!!rowSums(.SD))]

If we need 9999

 (max.col(dt)*(!!rowSums(dt))) + (!rowSums(dt))*9999

Apply a function row-wise to a data.table

We can use data.table methods

setnames(sdata[,  {un1 <- unlist(.SD)
as.list(`length<-`(rev(un1[!is.na(un1)]), length(un1)))
} , by = .(grp=1:nrow(sdata))][, grp := NULL], paste0("Z", 1:5))[]
# Z1 Z2 Z3 Z4 Z5
#1: 15 32 21 NA NA
#2: 17 6 NA NA NA
#3: 230 9 7 2 NA
#4: 0 30 28 19 5
#5: 0 105 30 29 16
#6: 0 0 0 2 NA


Related Topics



Leave a reply



Submit