R Memory Management Advice (Caret, Model Matrices, Data Frames)

R memory management advice (caret, model matrices, data frames)

Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength so that fewer values of mtry are being tried.

Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can't fit one directly, you won't be able to use caret to fit many in one go.

At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn't balloon out of control. So work out how caret is calling randomForest() and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you've determined the optimal value for mtry, you can then try to fit the model with all the extras you might want to help interpret the fit.

How to reliably use a matrix (multivariable) response in R using a formula?

Generally speaking formulas in R work well with dataframes. rpart works on matrices, and while dataframes can hold matrices, they tend to get converted to separate columns. To avoid this, wrap the matrix in I():

# Same as your code to start...then this:

predict(mymodel, newdata = data.frame(x = I(newx)))
#>    1    2    3    4    5 
#> 0.04 0.04 0.04 0.04 0.04

In the second part of your question, you are creating a formula in the mywrapper function, so that's where it will look for variables if they aren't contained in the newdata dataframe. "Environments" in R are similar to "stack frames" in other languages; the main difference is that environments have a single parent and searches proceed there if the object isn't found in the original.

Generally speaking the parent is not the frame of the caller, it is the frame where the environment was created, or something specially listed as the parent.

So what happens if you run predict on the returned value from mywrapper is that it looks at the formula to find what variables it needs. Only the variables on the right hand side are needed for predictions, so that's just x. If you supply x in your newdata argument to predict, everything will be fine and proceed as before, but if you don't, things are different.

Since x was not found in the newdata, it goes to the environment of the formula. That's the evaluation frame of mywrapper, and it will see x there, since it was an argument to that function.

If it was looking for z instead, it wouldn't find it there. The next place to look is the parent environment, which is the one in effect when mywrapper was created, i.e. the global environment. If there's no z there, it would search through the chain of environments listed by search(), which are typically package exports. The search() list is chained together so that each entry is the parent of the one before.

I hope this isn't too much information....

R memory efficient way to store many data frames?

Your example and mentioning the apply family of functions suggest that the structure of the data frames is identical, ie, they all have the same columns.

If this is the case and if the total volume of data (all data frames together) still does fit in available RAM then a solution could be to pack all data into one large data.table with an extra id column. This can be achieved with function rbindlist:

library(data.table)
x <- data.table(A = rnorm(100), B = rnorm(200))
y <- data.table(A = rnorm(30), B = rnorm(300))
z <- data.table(A = rnorm(20), B = rnorm(600))
dt <- rbindlist(list(x, y, z), idcol = TRUE)
dt
      .id           A           B
   1:   1 -0.10981198 -0.55483251
   2:   1 -0.09501871 -0.39602767
   3:   1  2.07894635  0.09838722
   4:   1 -2.16227936  0.04620932
   5:   1 -0.85767886 -0.02500463
  ---                            
1096:   3  1.65858606 -1.10010088
1097:   3 -0.52939876 -0.09720765
1098:   3  0.59847826  0.78347801
1099:   3  0.02024844 -0.37545346
1100:   3 -1.44481850 -0.02598364

The rows originating from the individual source data frames
can be distinghuished by the .id variable. All the memory efficient data.tableoperations can be applied on all rows, selected rows (dt[.id == 1, some_function(A)]) or group-wise (dt[, another_function(B), by = .id]).

Although the data.table operations are memory efficient, RAM might still be a limiting factor. Use the tables() function to monitor memory consumption of all created data.table objects:

tables()
     NAME  NROW NCOL MB COLS    KEY
[1,] dt   1,100    3  1 .id,A,B    
[2,] x      200    2  1 A,B        
[3,] y      300    2  1 A,B        
[4,] z      600    2  1 A,B        
Total: 4MB

and remove objects from memory which are no longer needed

rm(x, y, z)
tables()
     NAME  NROW NCOL MB COLS    KEY
[1,] dt   1,100    3  1 .id,A,B    
Total: 1MB

Memory issue with lm

Memory usage goes up because each call to biglm() makes a copy of the data in memory. Since sapply() is basically a for loop, using doMC (or doParallel) allows to go through the loop with a single copy of the data in memory. Here is one possibility:

EDIT: As @moho wu pointed, parallel fitting helps, but not quite enough. Factors are more efficient than plain characters, so that helps too. Then ff can help even more as it keeps bigger data sets on the disk, rather than in memory. I updated the code below to make it a complete toy example using ff and doMC.

library(tidyverse)
library(pryr)

# toy data
df <- sample_n(mtcars, size = 1e7, replace = T)   
df$A <- as.factor(letters[1:5]) 

# get objects / save on disk
all_vars <- names(df) 
y <- "mpg"  
vars.model <- "cyl"
vars.remaining <- all_vars[-c(1:2)]
save(y, vars.model, vars.remaining, file = "all_vars.RData") 
readr::write_delim(df, path = "df.csv", delim = ";")

# close R session and start fresh

library(ff)
library(biglm)
library(doMC)
library(tidyverse)

# read flat file as "ff" ; also read variables
ff_df <- read.table.ffdf(file = "df.csv", sep = ";", header = TRUE)
load("all_vars.RData") 

# prepare the "cluster"
nc <- 2 # number of cores to use
registerDoMC(cores = nc)

# make all formula
fo <- paste0(y, "~", vars.model, "+", vars.remaining)
fo <- modify(fo, as.formula) %>%
  set_names(vars.remaining)

# fit models in parallel
all_rsq <- foreach(fo = fo) %dopar% {
  fit <- biglm(formula = fo, data = ff_df)
  new.rsq <- summary(fit)$rsq
}

Efficient way to create market basket matrix in R

You don't really need reshape2 for this; table is what you are looking for.

m1 <- as.matrix(as.data.frame.matrix(table(input)))

all.equal(m, m1)
TRUE

Automate Machine Learning process with R on multiple datasets

Here's one way (of several) to do this:

# Corrélation
library(caret)
library(dplyr)

set.seed(99)

H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))   
C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))

# Combine input datasets a list
inputs <- list(H, C, R, E)
# Empty list to hold results
outputs <- list()

# Loop over each dataset, one at a time
for(df in inputs){
  data.cor <- cor(df)
  high.cor <- findCorrelation(data.cor, cutoff=0.40)
  # Subset the dataset based on `high.cor`
  # Add the subsetted dataset to a output list of datasets
  outputs <- append(outputs, list(df[,-high.cor]))
}

# This is the first dataset processed by the loop
outputs[[1]]
# Second...
outputs[[2]]
# Third...
outputs[[3]]

edit: integrating your lasso routine

library(glmnet)
library(caret)

set.seed(99)

## Define data (indpendent variables)
H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))   
C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
inputs <- list(H, C, R, E)

## Define targets (dependent variables)
Y_H <- data.frame(label_1 = replicate(1,sample(20:35, 10, rep = TRUE)))
Y_C <- data.frame(label_2 = replicate(1,sample(15:65, 10, rep = TRUE)))
Y_R <- data.frame(label_3 = replicate(1,sample(25:45, 10, rep = TRUE)))
Y_E <- data.frame(label_4 = replicate(1,sample(21:80, 10, rep = TRUE)))
targets <- list(Y_H, Y_C, Y_R, Y_E)

## Remove coorelated independent variables
outputs <- list()

for(df in inputs){
  data.cor <- cor(df)
  high.cor <- findCorrelation(data.cor, cutoff=0.40)
  outputs <- append(outputs, list(df[,-high.cor]))
}

## Do lasso regression
lasso_cv <- list()
lasso_model <- list()

for(i in 1:length(outputs)){
  for(j in 1:length(targets)){
    
    lasso_cv[[i]] <- cv.glmnet(
      as.matrix(outputs[[i]]), as.matrix(targets[[j]]), standardize = TRUE, type.measure = "mse",  alpha = 1, nfolds = 3)
    
    lasso_model[[i]] <- glmnet(
      as.matrix(outputs[[i]]), as.matrix(targets[[j]]), lambda = lasso_cv[[i]]$lambda_cv, standardize = TRUE, alpha = 1)
    
  }
}

R Memory Management Advice (Caret, Model Matrices, Data Frames)