R memory management advice (caret, model matrices, data frames)
Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength
so that fewer values of mtry
are being tried.
Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can't fit one directly, you won't be able to use caret to fit many in one go.
At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn't balloon out of control. So work out how caret is calling randomForest()
and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you've determined the optimal value for mtry
, you can then try to fit the model with all the extras you might want to help interpret the fit.
How to reliably use a matrix (multivariable) response in R using a formula?
Generally speaking formulas in R work well with dataframes. rpart
works on matrices, and while dataframes can hold matrices, they tend to get converted to separate columns. To avoid this, wrap the matrix in I()
:
# Same as your code to start...then this:
predict(mymodel, newdata = data.frame(x = I(newx)))
#> 1 2 3 4 5
#> 0.04 0.04 0.04 0.04 0.04
In the second part of your question, you are creating a formula in the mywrapper
function, so that's where it will look for variables if they aren't contained in the newdata
dataframe. "Environments" in R are similar to "stack frames" in other languages; the main difference is that environments have a single parent and searches proceed there if the object isn't found in the original.
Generally speaking the parent is not the frame of the caller, it is the frame where the environment was created, or something specially listed as the parent.
So what happens if you run predict
on the returned value from mywrapper
is that it looks at the formula to find what variables it needs. Only the variables on the right hand side are needed for predictions, so that's just x
. If you supply x
in your newdata
argument to predict
, everything will be fine and proceed as before, but if you don't, things are different.
Since x
was not found in the newdata
, it goes to the environment of the formula. That's the evaluation frame of mywrapper
, and it will see x
there, since it was an argument to that function.
If it was looking for z
instead, it wouldn't find it there. The next place to look is the parent environment, which is the one in effect when mywrapper
was created, i.e. the global environment. If there's no z
there, it would search through the chain of environments listed by search()
, which are typically package exports. The search()
list is chained together so that each entry is the parent of the one before.
I hope this isn't too much information....
R memory efficient way to store many data frames?
Your example and mentioning the apply
family of functions suggest that the structure of the data frames is identical, ie, they all have the same columns.
If this is the case and if the total volume of data (all data frames together) still does fit in available RAM then a solution could be to pack all data into one large data.table
with an extra id column. This can be achieved with function rbindlist
:
library(data.table)
x <- data.table(A = rnorm(100), B = rnorm(200))
y <- data.table(A = rnorm(30), B = rnorm(300))
z <- data.table(A = rnorm(20), B = rnorm(600))
dt <- rbindlist(list(x, y, z), idcol = TRUE)
dt
.id A B
1: 1 -0.10981198 -0.55483251
2: 1 -0.09501871 -0.39602767
3: 1 2.07894635 0.09838722
4: 1 -2.16227936 0.04620932
5: 1 -0.85767886 -0.02500463
---
1096: 3 1.65858606 -1.10010088
1097: 3 -0.52939876 -0.09720765
1098: 3 0.59847826 0.78347801
1099: 3 0.02024844 -0.37545346
1100: 3 -1.44481850 -0.02598364
The rows originating from the individual source data frames
can be distinghuished by the .id
variable. All the memory efficient data.table
operations can be applied on all rows, selected rows (dt[.id == 1, some_function(A)]
) or group-wise (dt[, another_function(B), by = .id]
).
Although the data.table
operations are memory efficient, RAM might still be a limiting factor. Use the tables()
function to monitor memory consumption of all created data.table
objects:
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
[2,] x 200 2 1 A,B
[3,] y 300 2 1 A,B
[4,] z 600 2 1 A,B
Total: 4MB
and remove objects from memory which are no longer needed
rm(x, y, z)
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
Total: 1MB
Memory issue with lm
Memory usage goes up because each call to biglm()
makes a copy of the data in memory. Since sapply()
is basically a for loop, using doMC
(or doParallel
) allows to go through the loop with a single copy of the data in memory. Here is one possibility:
EDIT: As @moho wu pointed, parallel fitting helps, but not quite enough. Factors are more efficient than plain characters, so that helps too. Then ff
can help even more as it keeps bigger data sets on the disk, rather than in memory. I updated the code below to make it a complete toy example using ff
and doMC
.
library(tidyverse)
library(pryr)
# toy data
df <- sample_n(mtcars, size = 1e7, replace = T)
df$A <- as.factor(letters[1:5])
# get objects / save on disk
all_vars <- names(df)
y <- "mpg"
vars.model <- "cyl"
vars.remaining <- all_vars[-c(1:2)]
save(y, vars.model, vars.remaining, file = "all_vars.RData")
readr::write_delim(df, path = "df.csv", delim = ";")
# close R session and start fresh
library(ff)
library(biglm)
library(doMC)
library(tidyverse)
# read flat file as "ff" ; also read variables
ff_df <- read.table.ffdf(file = "df.csv", sep = ";", header = TRUE)
load("all_vars.RData")
# prepare the "cluster"
nc <- 2 # number of cores to use
registerDoMC(cores = nc)
# make all formula
fo <- paste0(y, "~", vars.model, "+", vars.remaining)
fo <- modify(fo, as.formula) %>%
set_names(vars.remaining)
# fit models in parallel
all_rsq <- foreach(fo = fo) %dopar% {
fit <- biglm(formula = fo, data = ff_df)
new.rsq <- summary(fit)$rsq
}
Efficient way to create market basket matrix in R
You don't really need reshape2
for this; table
is what you are looking for.
m1 <- as.matrix(as.data.frame.matrix(table(input)))
all.equal(m, m1)
TRUE
Automate Machine Learning process with R on multiple datasets
Here's one way (of several) to do this:
# Corrélation
library(caret)
library(dplyr)
set.seed(99)
H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
# Combine input datasets a list
inputs <- list(H, C, R, E)
# Empty list to hold results
outputs <- list()
# Loop over each dataset, one at a time
for(df in inputs){
data.cor <- cor(df)
high.cor <- findCorrelation(data.cor, cutoff=0.40)
# Subset the dataset based on `high.cor`
# Add the subsetted dataset to a output list of datasets
outputs <- append(outputs, list(df[,-high.cor]))
}
# This is the first dataset processed by the loop
outputs[[1]]
# Second...
outputs[[2]]
# Third...
outputs[[3]]
edit: integrating your lasso routine
library(glmnet)
library(caret)
set.seed(99)
## Define data (indpendent variables)
H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
inputs <- list(H, C, R, E)
## Define targets (dependent variables)
Y_H <- data.frame(label_1 = replicate(1,sample(20:35, 10, rep = TRUE)))
Y_C <- data.frame(label_2 = replicate(1,sample(15:65, 10, rep = TRUE)))
Y_R <- data.frame(label_3 = replicate(1,sample(25:45, 10, rep = TRUE)))
Y_E <- data.frame(label_4 = replicate(1,sample(21:80, 10, rep = TRUE)))
targets <- list(Y_H, Y_C, Y_R, Y_E)
## Remove coorelated independent variables
outputs <- list()
for(df in inputs){
data.cor <- cor(df)
high.cor <- findCorrelation(data.cor, cutoff=0.40)
outputs <- append(outputs, list(df[,-high.cor]))
}
## Do lasso regression
lasso_cv <- list()
lasso_model <- list()
for(i in 1:length(outputs)){
for(j in 1:length(targets)){
lasso_cv[[i]] <- cv.glmnet(
as.matrix(outputs[[i]]), as.matrix(targets[[j]]), standardize = TRUE, type.measure = "mse", alpha = 1, nfolds = 3)
lasso_model[[i]] <- glmnet(
as.matrix(outputs[[i]]), as.matrix(targets[[j]]), lambda = lasso_cv[[i]]$lambda_cv, standardize = TRUE, alpha = 1)
}
}
Related Topics
Check If Character String Is a Valid Color Representation
Subset Observations That Differ by at Least 30 Minutes Time
Handling Latex Backslashes in Xtable
Combine Lists While Overriding Values with Same Name in R
How to Create a List in R from Two Vectors (One Would Be the Keys, the Other the Values)
Arrange_() Multiple Columns with Descending Order
Why Are Lubridate Functions So Slow When Compared with As.Posixct
Adding Scale Bar to Ggplot Map
How to Prevent Functions Polluting Global Namespace
Custom Fill Color in Ggvis (And Other Options)
Beginner Tips on Using Plyr to Calculate Year-Over-Year Change Across Groups
Rmarkdown Error "Attempt to Use Zero-Length Variable Name"
Returning a Vector of Class Posixct with Vapply
Convert a Netcdf Time Variable to an R Date Object
How to Store the Returned Value from a Shiny Module in Reactivevalues