R data.table apply function to rows using columns as arguments
The best way is to write a vectorized function, but if you can't, then perhaps this will do:
x[, func.text(f1, f2), by = seq_len(nrow(x))]
Applying a function to each row of a data.table
How about :
x
a b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19
x[,list(a=rep(a,each=2), V1=unlist(strsplit(b," ")))]
a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19
Generalized solution given comment :
x[,{s=strsplit(b," ");list(a=rep(a,sapply(s,length)), V1=unlist(s))}]
Apply a function to every specified column in a data.table and update by reference
This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
- Because there are parentheses in
(cols) :=
, the result is assigned to the columns specified incols
, instead of to some new variable named "cols". .SDcols
tells the call that we're only looking at those columns, and allows us to use.SD
, theS
ubset of theD
ata associated with those columns.lapply(.SD, ...)
operates on.SD
, which is a list of columns (like all data.frames and data.tables).lapply
returns a list, so in the endj
looks likecols := list(...)
.
EDIT: Here's another way that is probably faster, as @Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])
Applying a function to every row on each n number of columns in R
Here is one approach:
Let d
be your 3 rows x 2000 columns frame, with column names as.character(1:2000)
(See below for generation of fake data). We add a row identifier using .I
, then melt the data long, adding grp
, and column-group identifier (i.e. identifying the 20 sets of 100). Then apply your function myfunc
(see below for stand-in function for this example), by row and group, and swing wide. (I used stringr::str_pad
to add 0 to the front of the group number)
# add row identifier
d[, row:=.I]
# melt and add col group identifier
dm = melt(d,id.vars = "row",variable.factor = F)[,variable:=as.numeric(variable)][order(variable,row), grp:=rep(1:20, each=300)]
# get the result (180 rows long), applying myfync to each set of columns, by row
result = dm[, myfunc(value), by=.(row,grp)][,frow:=rep(1:3,times=60)]
# swing wide (3 rows long, 60 columns wide)
dcast(
result[,v:=paste0("grp",stringr::str_pad(grp,2,pad = "0"),"_",row)],
frow~v,value.var="V1"
)[, frow:=NULL][]
Output: (first six columns only)
grp01_1 grp01_2 grp01_3 grp02_1 grp02_2 grp02_3
<num> <num> <num> <num> <num> <num>
1: 0.54187168 0.47650694 0.48045694 0.51278399 0.51777319 0.46607845
2: 0.06671367 0.08763655 0.08076939 0.07930063 0.09830116 0.07807937
3: 0.25828989 0.29603471 0.28419957 0.28160367 0.31353016 0.27942687
Input:
d = data.table()
alloc.col(d,2000)
set.seed(123)
for(c in 1:2000) set(d,j=as.character(c), value=runif(3))
myfunc
Function (toy example for this answer):
myfunc <- function(x) c(mean(x), var(x), sd(x))
Method to operate on each row of data.table without using apply function
If you really need speed, as always it's best to move to C++ using Rcpp, which gives us a solution that's over 100x faster.
Data
I did make some different example data to test this on that had 1000 rows instead of 5:
set.seed(123)
dat <- data.table(A = rnorm(1e3, sd=4), B = rnorm(1e3, sd=4), C = rnorm(1e3, sd=4),
D = rnorm(1e3, sd=4), E = rnorm(1e3, sd=4))
Solution
I used the following C++ code to do the same thing as your function, but now the looping is done in C++ instead of R through apply which saves considerable time:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector mcs2(DataFrame x) {
int n = x.nrows();
int m = x.size();
NumericMatrix mat(n, m);
for ( int j = 0; j < m; ++j ) {
mat(_, j) = NumericVector(x[j]);
}
NumericVector result(n);
for ( int i = 0; i < n; ++i ) {
NumericVector tmp = mat(i, _);
std::sort(tmp.begin(), tmp.end());
bool do_sd = true;
for ( int j = 1; j < m; ++j ) {
if ( tmp[j] - tmp[j-1] > 6.0 ) {
result[i] = NA_REAL;
do_sd = false;
break;
}
}
if ( do_sd ) {
result[i] = sd(tmp);
}
do_sd = true;
}
return result;
}
We can make sure it's returning the same values:
all.equal(apply(dat[, 2:4], 1, mcs1), mcs2(dat[,2:4]))
[1] TRUE
Now let's benchmark:
benchmark(mcs1 = dat[, sd:=apply(.SD, 1, mcs1), .SDcols=(c(2,3,4))],
mcs2 = dat[, sd:=mcs2(.SD), .SDcols=(c(2,3,4))],
order = 'relative',
columns = c('test', 'elapsed', 'relative', 'user.self'))
test elapsed relative user.self
2 mcs2 0.19 1.000 0.183
1 mcs1 21.34 112.316 20.044
How to compile this code
As an introduction to using C++ code through Rcpp, I'd suggest this chapter of Hadley Wickham's Advanced R. If you intend on doing anything further with Rcpp I'd strongly recommend you also read the official documentation and vignettes, but Wickham's book is probably a little more beginner friendly to use as a starting point. For your purposes, you just need to get Rcpp up and running so that you can compile the code above.
For this code to work for you, you'll need the Rcpp package if you don't already have it. You can get the package by running
install.packages(Rcpp)
from R. Note you'll also need a compiler; if you're on a Debian-based Linux system such as Ubuntu you can run
sudo apt install r-base-dev
from the terminal. If you are on Mac or Windows, check here for some instructions on getting this set up, or in the Wickham chapter linked above.
Once you have Rcpp installed, save the C++ code above into a file. Let's say for our example the file is named "SOanswer.cpp". Then you can make its mcs2()
function available from R by putting the following two lines in your R script:
library(Rcpp)
sourceCpp("SOanswer.cpp") # assuming the file is in your working directory
That's it! Now your R script can call mcs2()
and run much faster. If you want to learn more about Rcpp, beside the Wickham chapter above, I'd check out the reference manual and the vignettes available here, this page from RStudio (which also has tons of links, some of which are linked to here), and you can also find some really useful stuff looking around the Rcpp gallery.
How to apply function in each row in data.table
For 50 columns, it is better to use max.col
dt$index <- max.col(dt, 'first') *(!!rowSums(dt))
Or as @David Arenburg mentioned, more idiomatic code would be
dt[, indx := max.col(.SD,ties.method="first")*(!!rowSums(.SD))]
If we need 9999
(max.col(dt)*(!!rowSums(dt))) + (!rowSums(dt))*9999
Apply a function row-wise to a data.table
We can use data.table
methods
setnames(sdata[, {un1 <- unlist(.SD)
as.list(`length<-`(rev(un1[!is.na(un1)]), length(un1)))
} , by = .(grp=1:nrow(sdata))][, grp := NULL], paste0("Z", 1:5))[]
# Z1 Z2 Z3 Z4 Z5
#1: 15 32 21 NA NA
#2: 17 6 NA NA NA
#3: 230 9 7 2 NA
#4: 0 30 28 19 5
#5: 0 105 30 29 16
#6: 0 0 0 2 NA
Related Topics
Different Robust Standard Errors of Logit Regression in Stata and R
How to Select Columns Programmatically in a Data.Table
Convert Sequence of Longitude and Latitude to Polygon via Sf in R
Geom_Density to Match Geom_Histogram Binwitdh
Create Parametric R Markdown Documentation
R Optimization with Equality and Inequality Constraints
Si Prefixes in Ggplot2 Axis Labels
How to Create a World Map in R with Specific Countries Filled In
Format Ttest Output by R for Tex
Dictionary() Is Not Supported Anymore in Tm Package. How to Emend Code
How to Read the Source Code for an R Function
Lapply to Add Columns to Each Dataframe in a List
Create Barplot from Data.Frame
Merge Dataframes on Matching A, B and *Closest* C
Remove Part of a String in Dataframe Column (R)