Loading Data Bigger Than the Memory Size in H2O

Loading data bigger than the memory size in h2o

Swap-to-disk was disabled by default awhile ago, because performance was so bad. The bleeding-edge (not latest stable) has a flag to enable it: "--cleaner" (for "memory cleaner").

Note that your cluster has an EXTREMELY tiny memory:
H2O cluster total memory: 0.06 GB
That's 60MB! Barely enough to start a JVM with, much less run H2O. I would be surprised if H2O could come up properly there at all, never mind the swap-to-disk. Swapping is limited to swapping the data alone. If you're trying to do a swap-test, up your JVM to 1 or 2 Gigs ram, and then load datasets that sum more than that.

Cliff

H2o: Iterating over data bigger than memory without loading all data into memory

The short answer is this isn't what H2O was designed to do.
So unfortunately the answer today is no.

The longer answer... (Assuming that the intent of the question is regarding model training in H2O-3.x...)

I can think of at least two ways one might want to use H2O in this way: one-pass streaming, and swapping.

Think of one-pass streaming as having a continuous data stream feeding in, and the data constantly being acted on and then thrown away (or passed along).

Think of swapping as the computer science equivalent of swapping, where there is fast storage (memory) and slow storage (disk) and the algorithms are continuously sweeping over the data and faulting (swapping) data from disk to memory.

Swapping just gets worse and worse from a performance perspective the bigger the data gets. H2O isn't ever tested this way, and you are on your own. Maybe you can figure out how to enable an unsupported swapping mode from clues/hints in the other referenced stackoverflow question (or the source code), but nobody ever runs that way, and you're on your own. H2O was architected to be fast for machine learning by holding data in memory. Machine learning algorithms iteratively sweep over the data again and again. If every data touch is hitting the disk, it's just not the experience the in-memory H2O-3 platform was designed to provide.

The streaming use case, especially for some algorithms like Deep Learning and DRF, definitely makes more sense for H2O. H2O algorithms support checkpoints, and you can imagine a scenario where you read some data, train a model, then purge that data and read in new data, and continue training from the checkpoint. In the deep learning case, you'd be updating the neural network weights with the new data. In the DRF case, you'd be adding new trees based on the new data.

How to deal with a big dataset with H2O

If this is in any way commercial, buy more RAM, or pay a few dollars to rent a few hours on a cloud server.

This is because the extra time and effort to do machine learning on a machine that is too small is just not worth it.

If it is a learning project, with no budget at all: cut the data set into 8 equal-sized parts (*), and just use the first part to make and tune your models. (If the data is not randomly ordered, cut it in 32 equal parts, and then concatenate parts 1, 9, 17 and 25; or something like that.)

If you really, really, really, must build a model using the whole data set, then still do the above. But then save the model, then move to the 2nd of your 8 data sets. You will already have tuned hyperparameters by this point, so you are just generating a model, and it will be quick. Repeat for parts 3 to 8. Now you have 8 models, and can use them in an ensemble.

*: I chose 8, which gives you a 0.5GB data set, which is a quarter of available memory. For the early experiments I'd actually recommend going even smaller, e.g. 50MB, as it will make the iterations so much quicker.

A couple more thoughts:

H2O compresses data in-memory. So if the 4GB was the uncompressed data size, you might get by with a smaller memory. (However, remember that the recommendation is for memory that is 3-4x the size of your data.)
If you have some friends with similar small-memory computers, you could network them together. 4 to 8 computers might be enough to load your data. It might work well, it might be horribly slow, it depends on the algorithm (and how fast your network is).

Read a large (1.5 GB) file in h2o R

train=h2o.importFile(path=normalizePath("C:\\Users\\All data\\traindt.rds")

Are you trying to load an .rds file? That's an R binary format which is not readable by h2o.importFile(), so that won't work. You will need to store your training data in a cross-platform storage format (e.g. CSV, SMVLight, etc) if you want to read it into H2O directly. If you don't have a copy in another format, then just save one from R:

# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")

# save as CSV
write.csv(train, "C:\\Users\\All data\\traindt.csv")

# import from CSV into H2O cluster directly
train = h2o.importFile(path = normalizePath("C:\\Users\\All data\\traindt.csv"))

Another option is to load it into R from the .rds file and use the as.h2o() function:

# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")

# send to H2O cluster
hf <- as.h2o(train)

How to allow h2o to access all available memory?

The max_mem_size argument in the h2o R package is functional, so you can use it to start an H2O cluster of whatever size you want -- you don't need to start it from the command line using -Xmx.

What's seems to be happening in your case is that you are connecting to an existing H2O cluster located at localhost:54321 that was limited to "10G" (in reality, 9.78 GB). So when you run h2o.init() from R, it will just connect to the existing cluster (with a fixed memory), rather than starting a new H2O cluster with the memory that you specified in max_mem_size, and so the memory request gets ignored.

To fix, you should do one of the following:

Kill the existing H2O cluster at localhost:54321 and restart from R with the desired memory requirement, or
start a cluster from R at different IP/port than the one that's
already running.

H2O: Cannot read LARGE model from disk via `h2o.loadModel`

The difference in size of the models (169M vs 37G) is surprising. Can you please make sure that H2O recognizes all your numeric columns as numeric and not categorical with very high cardinality?

Do you use automatic detection of column types or do you specify it manually?

H2o not fully accessing memory on a cluster

I have solved the issue. By putting the following before initializing H2O:

options(java.parameters = "-Xmx500000m")

Loading Data Bigger Than the Memory Size in H2O