How to Define Multiple Variables with Lapply

How to define multiple variables with lapply?

General solution

Try outer:

c(outer(1:10, 2:4, Vectorize(function(x, y) x*y)))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

If function is Vectorized already

If the function is already vectorized, as it is here, then we can omit Vectorize:

c(outer(1:10, 2:4, function(x, y) x * y))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

Particular example shown in question

In fact, in this particular case the anonymous function shown is the default so this would work:

c(outer(1:10, 2:4))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

Also in this particular case we could use:

c(1:10 %o% 2:4)
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

If input is list X

If your starting point is list X shown in the question then:

c(outer(X[[1]], X[[2]], Vectorize(function(x, y) x * y)))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

or

c(do.call("outer", c(unname(X), Vectorize(function(x, y) x*y))))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

where the prior sections apply to shorten it, if applicable.

R apply function with multiple parameters

Just pass var2 as an extra argument to one of the apply functions.

mylist <- list(a=1,b=2,c=3)
myfxn <- function(var1,var2){
var1*var2
}
var2 <- 2

sapply(mylist,myfxn,var2=var2)

This passes the same var2 to every call of myfxn. If instead you want each call of myfxn to get the 1st/2nd/3rd/etc. element of both mylist and var2, then you're in mapply's domain.

using lapply() with multiple variables

You need to put the mode calculation in the function too.

sapply(data[, 2:ncol(data)], function(x) {
mode <- data$CAG[which.max(x)]
B <- sum(x[data$CAG >= mode])
B/sum(x)
})
## A01 A02
## 1.0000000 0.5882353

The function which.max is equivalent (at least in this use) to x==max(x).

Using lapply to create new variables based on multiple conditions and subsets

Here is a base R method that uses ave with lapply. Loop through the columns of dataset excluding the 'cluster', then with ave get the min grouped by 'cluster', subtract from the column and assign the list of vectors to new columns

df[paste0(names(df)[-1], ".var")] <- lapply(df[-1], function(x)
ave(x, df$cluster, FUN = min) - x)
df
# cluster x y x.var y.var
#1 A 3 4 -1 -3
#2 B 4 5 -3 -2
#3 B 1 3 0 0
#4 A 5 1 -3 0
#5 A 2 2 0 -1
#6 B 6 6 -5 -3

Applying a function and assigning multiple variables in a single call in R

Use [ extraction for the lefthand-side data.frame rather than $ extraction:

df[,c('NewX2','NewY2')] <- mapply(find.key, 
list(df$x, df$y),
list(x2, y2),
SIMPLIFY=FALSE)
# df
# x y NewX2 NewY2
# 1 a e Alpha Epi
# 2 b f Beta OtherY
# 3 c g Other OtherY
# 4 d h Other OtherY

Or, if you don't like writing mapply you can use Vectorize, which will create an mapply-based function for you to obtain the same result:

find.keys <- Vectorize(find.key, c("x","li"), SIMPLIFY=FALSE)
df[,c('NewX2','NewY2')] <- find.keys(list(df$x, df$y), list(x2, y2))
df
# x y NewX2 NewY2
# 1 a e Alpha Epi
# 2 b f Beta OtherY
# 3 c g Other OtherY
# 4 d h Other OtherY

Use lapply to create new variable over multiple data frames

According to the OP, there are 100 data.frames with identical columns names. The OP wants to create a new column in all of the data.frames using exactly the same formula.

This indicates a fundamental flaw in the design of the data structure. I guess, no data base admin would create 100 identical tables where only the data contents differs. Instead, he would create one table with an additional column identifying the origin of each row. Then, all subsequent operations would be applied on one table instead to be repeated for each of many.

In R, the data.table package has the convenient rbindlist() function which can be used for this purpose:

library(data.table)   # CRAN version 1.10.4 used
# get list of data.frames from the given names and
# combine the rows of all data sets into one large data.table
DT <- rbindlist(mget(temp), idcol = "origin")
# now create new column for all rows across all data sets
DT[, ps_true := (1 + exp(-(0.8*w1 - 0.25*w2 + 0.6*w3 -
0.4*w4 - 0.8*w5 - 0.5*w6 + 0.7*w7)))^-1]
DT
                origin ARAND   w1   w2   w3   w4   w5   w6   w7   w8   w9  w10   ps_true
1: sim_rep1.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep1.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep1.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep1.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep1.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
199996: sim_rep100.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
199997: sim_rep100.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
199998: sim_rep100.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
199999: sim_rep100.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
200000: sim_rep100.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867

DT consists now of 200 K rows. Performance is no reason to worry as data.tablewas built to deal with large (even larger) data efficiently.


The origin of each row can be identified in case the data of the individual data sets need to be treated separately. E.g.,

DT[origin == "sim_rep47.dat"]
             origin ARAND   w1   w2   w3   w4   w5   w6   w7   w8   w9  w10   ps_true
1: sim_rep47.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep47.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep47.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep47.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep47.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
1996: sim_rep47.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
1997: sim_rep47.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
1998: sim_rep47.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
1999: sim_rep47.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
2000: sim_rep47.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867

extracts all row belonging to data set sim_rep47.dat.

Data

For test and demonstration, I've created 100 sample data.frames using the code below:

# create vector of file names
temp <- paste0("sim_rep", 1:100, ".dat")
# create one sample data.frame
nr <- 2000L
nc <- 11L
set.seed(123L)
foo <- as.data.frame(matrix(round(rnorm(nr * nc), 1), nrow = nr))
names(foo) <- c("ARAND", paste0("w", 1:10))
str(foo)
# create 100 individually named data.frames by "copying" foo
for (t in temp) assign(t, foo)
# print warning message on using assign
fortunes::fortune(236)
# verify objects have been created
ls()


Addendum: Reading all files at once

The OP has named the single data.frames sim_rep1.dat, sim_rep2.dat, etc. which resemble typical file names. Just in case the OP indeed has 100 files on disk I would like to suggest a way to read all files at once. Let's suppose all files are stored in one directory.

# path to data directory
data_dir <- file.path("path", "to", "data", "directory")
# create vector of file paths
files <- dir(data_dir, pattern = "sim_rep\\d+\\.dat", full.names = TRUE)
# read all files and create one large data.table
# NB: it might be necessary to add parameters to fread()
# or to use another file reader depending on the file type
DT <- rbindlist(lapply(files, fread), idcol = "origin")
# rename origin to contain the file names without path
DT[, origin := factor(origin, labels = basename(files))]
DT
               origin ARAND   w1   w2   w3   w4   w5   w6   w7   w8   w9  w10   ps_true
1: sim_rep1.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep1.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep1.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep1.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep1.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
199996: sim_rep99.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
199997: sim_rep99.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
199998: sim_rep99.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
199999: sim_rep99.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
200000: sim_rep99.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867

All data sets are now stored in one large data.table DT consisting of 200 k rows. However, the order of data sets is different as files is sorted alphabetically, i.e.,

head(files)
[1] "./data/sim_rep1.dat"   "./data/sim_rep10.dat"  "./data/sim_rep100.dat"
[4] "./data/sim_rep11.dat" "./data/sim_rep12.dat" "./data/sim_rep13.dat"

use function on multiple columns (variables) in r

Common parameters to the function need to be passed to ... within lapply. Like this:

lapply(subset(iris, select = -Species), leveneTest, group = iris$Species)

help("lapply") explains that ... is for "optional arguments to FUN" (meaning optional for lapply not for FUN) and provides lapply(x, quantile, probs = 1:3/4) as an example.



Related Topics



Leave a reply



Submit