Recommended Package for Very Large Dataset Processing and MAChine Learning in R

Recommended package for very large dataset processing and machine learning in R

Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.

Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.

And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.

Work in R with very large data set

If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.

In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) - in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM).
Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.

The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time - especially if you are new to R.

clustering very large dataset in R

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

How to handle huge data and build model in R

As the name tells, Machine Learning needs a machine (PC). What's more, it requires a suitable machine for a specific work. Even though there are some techniques to deal with it:

1. Down-Sampling

Most of the time, you don't need all data for a machine learning, you can sample you data to get a much smaller one which can be used on your laptop.

Of cause, you may need to use some tool(s) (e.g. database) for sampling work on your laptop.

2. Data Points

Depends on the number of variables you have, each record may not be unique. You can "aggregate" your data by your key variables. Each unique combination of variable is called a data point, and the number of duplicates can be used as the weight for clustering methods.

But depends on the chosen clustering method and purpose of the project, this aggregated data may not provide you the best model.

3. Split into Parts

Assuming you have all your data in one csv file, you can read data in chunks using data.table::fread by specifying the rows that can fit your laptop.

https://stackoverflow.com/a/21801701/5645311

You can process each data chunk in R separately, and build model on those data. Eventually, you will have lots of clustering results as a kind of bagging method.

4. Cloud Solution

Nowadays, cloud solutions are really popular, and you can move your work to
cloud for data manipulation and modelling.

If you feels like it quite expensive for a whole project, you can down-sampling using cloud then back to your laptop if you cannot find a suitable tool locally for sampling work.

5. A New Machine

This is a way I'd think first. A new machine may still not handle your data (depends on number of variables in your data). But it will definitely make the other calculation more efficient.

For personal project, a 32gb RAM with i7 CPU would be good enough to start machine learning. A Titan GPU would give you speed boost on some machine learning methods (e.g. xgboost, lightgbm keras etc.)

For commercial purpose, a server solution or cluster solution makes more sense to deal with a 70m records data on a clustering job.

Big Data Process and Analysis in R

If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:

an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.

How to handle large yet not big-data datasets?

There are different methods

Chunk up the dataset (saves time in future but needs initial time invest)

Chunking allows you to ease up many operations such as shuffling and so on.

Make sure each subset/chunk is representative of the whole Dataset. Each chunk file should have the same amount of lines.

This can be done by appending a line to one file after another. Quickly, you will realize that it's inefficient to open each file and write a line. Especially while reading and writing on the same drive.

-> add Writing and Reading buffer which fits into memory.

Sample Image

Choose a chunksize that fits your needs. I choose this particular size because my default text editor can still open it fairly quickly.

Smaller chunks can boost performance especially if you want to get metrics such as class ditribution because you only have to loop through one representative file to get an estimation of the whole dataset which might be enough.

Bigger chunkfiles do have a better representation of the whole dataset in each file but you could as well just go through x smaller chunkfiles.

I do use c# for this because I am way more experienced there and thus I can use the full featureset such as splitting the tasks reading / processing / writing onto different threads.

If you are experienced using python or r, I suspect there should be simillar functionalities as well. Parallelizing might be a huge factor on such large Datasets.

Chunked datasets can be modeled into one interleaved dataset which you can process with tensor processing units. That would probably yield one of the best performances and can be executed locally as well as in the cloud on the really big machines. But this requires a lot of learning on tensorflow.

Using a reader and read the file step by step

instead of doing something like all_of_it = file.read() you want to use some kind of streamreader. The following function reads through one of the chunk files (or your whole 300gb dataset) line by line to count each class within the file. By processing one line at a time, your program will not overflow the memory.

you might want to add some progress indication such as X lines/s or X MBbs in order to make an estimation of the total process time.

def getClassDistribution(path):
    classes = dict()
    # open sample file and count classes
    with open(path, "r",encoding="utf-8",errors='ignore') as f:
        line = f.readline()
        while line:
            if line != '':
                labelstring = line[-2:-1]
                if labelstring == ',':
                    labelstring = line[-1:]
                label = int(labelstring)
                if label in classes:
                    classes[label] += 1
                else:
                    classes[label] = 1
            line = f.readline()
    return classes

Sample Image

I use a combination of chunked datasets and estimation.

Pitfalls for performance

~~whenever possible,~~ avoid nested loops. Each loop inside another loop multiplies the complexity by n
~~whenever possible,~~ process the data in one go. Each loop after another adds a complexity of n
if your data comes in csv format, avoid premade functions such as cells = int(line.Split(',')[8]) this will lead very quickly to a memory throughput bottleneck. One proper example of this can be found in getClassDistributionwhere I only want to get the label.

the following C# function splits a csv line into elements ultra fast.

// Call function
ThreadPool.QueueUserWorkItem((c) => AnalyzeLine("05.02.2020,12.20,10.13").Wait());

// Parralelize this on multiple cores/threads for ultimate performance
private async Task AnalyzeLine(string line)
{
    PriceElement elementToAdd = new PriceElement();
    int counter = 0;
    string temp = "";
    foreach (char c in line)
    {
        if (c == ',')
        {
            switch (counter)
            {
                case 0:
                    elementToAdd.spotTime = DateTime.Parse(temp, CultureInfo.InvariantCulture);
                    break;
                case 1:
                    elementToAdd.buyPrice = decimal.Parse(temp);
                    break;
                case 2:
                    elementToAdd.sellPrice = decimal.Parse(temp);
                    break;
            }
            temp = "";
            counter++;
        }
        else temp += c;
    }
    // compare the price element to conditions on another thread
    Observate(elementToAdd);
}

Create a database and load the data

when processing csv like data you can load the data into a Database.

Databases are made to accommodate for huge amount of data and you can expect very high performance.

A Database will likely use up way more space on your disk than raw data. This is one reason why I moved away from using a database.

Hardware Optimisations

If your code is optimized well your bottleneck will most likely be the hard drive throughput.

If the Data fits onto your local hard drive, use it locally as this will get rid of network latencies (imagine 2-5ms for each record in a local network and 10-100ms in remote locations).
Use a modern Harddrive. A 1tb NVME SSD costs around 130 today (intel 600p 1tb). An nvme ssd is using pcie and is around 5 times faster than a normal ssd and 50 times faster than a normal hard drive, especially when writing to different locations fastly (chunking up data). SSDs have cought up vastly in capacity in the recent years and for such a task it would be savage.

The following screenshot provides a performance comparisation of tensorflow training with the same data on the same machine. Just one time saved locally on a standard ssd and one time on a network attached storage in the local network (normal hard disk).

Sample Image

Recommended Package for Very Large Dataset Processing and MAChine Learning in R