What Are Helpful Optimizations in R for Big Data Sets

What are helpful optimizations in R for big data sets?

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

use data.table package

library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)

1

grouping by many columns is something that data.table is excellent at

see barchart at the very bottom of the second plot for comparison against dplyr spark and others for exactly this kind of grouping

https://h2oai.github.io/db-benchmark

by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]

note I changed date to date_sequence, I think you meant that as a column name

2

it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid

please clarify

3

data.table has very useful type of join called rolling join, which does exactly what you need

a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__,  : 
#  Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.

It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining.
If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence identity (taking roll into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a and date_sequence combined.

Optimizing for loop in big data frame

You could try

  library(data.table)
  dcast.data.table(setDT(df)[ ,c('.id', 'Seq'):= 
        list(c('entry', 'exit'), gl(.N,2, .N))], id+Seq~.id, value.var='time')

  #   id Seq            entry             exit
  #1:  1   1 15/12/2014 06:30 15/12/2014 06:31
  #2:  1   2 15/12/2014 06:34 15/12/2014 06:35
  #3:  2   3 15/12/2014 06:36 15/12/2014 06:37
  #4:  3   4 15/12/2014 06:38 15/12/2014 06:39

data

 df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), time = 
   structure(1:8, .Label = c("15/12/2014 06:30", 
 "15/12/2014 06:31", "15/12/2014 06:34", "15/12/2014 06:35", "15/12/2014 06:36", 
 "15/12/2014 06:37", "15/12/2014 06:38", "15/12/2014 06:39"), class
   = "factor")),.Names = c("id", "time"), class = "data.frame", row.names
  = c(NA, -8L))

Optimization/Parallelization R - Handle large data sets calculating SPI in R

I find a pretty linear, not exponential, processing time:

system.time(spi(data[,1], 6))
system.time(spi(data[,1:10], 6))
system.time(spi(data[,1:100], 6))

If you see an exponential growth, it is probably caused by a problem of excess of RAM allocation.

One easy solution is to split the computation over the matrix:

spi6 <- data*NA
system.time(
  for (i in 1:100) spi6[,i] <- spi(data[,i], 6)$fitted
)

or, with a similar efficiency:

system.time(
  spi6 <- apply(data[,1:100], 2, function(x) spi(x, 6)$fitted)
)

As you can see, the computation time is pretty similar to the original option in which you provide the whole matrix as input to the spi() function. But, if you are experiencing memory problems, this might solve them.

On the other hand, if you have access to a multi-core computer (as it is most likely the case nowadays), you may see an improvement in computation time by parallelising the calculation:

library(snowfall)
sfInit(parallel=TRUE, cpus=2)
sfLibrary(SPEI)

system.time(
  spi6 <- sfApply(data[,1:100], 2, function(x) spi(x, 6)$fitted)
)

sfStop()

You may want to set a higher value to ncpu for higher speed gain, depending on how many threads your computer supports. However, sfApply, will not solve your memory problem with very large datasets. This is so because the function splits the dataset between the number of cpus allocated. Since the total memory of the system does not vary, you will run out of memory just the same.

The solution is to merge both approaches: split your dataset and then parallelize. Something like this:

data <- replicate(10000, rnorm(240))

sfInit(parallel=TRUE, cpus=10)
sfLibrary(SPEI)

spi6 <- data*NA
for (i in 0:9) {
    chunk <- (i*1000)+(1:1000)
    spi6[,chunk] <- sfApply(data[,chunk], 2, function(x) spi(x, 6)$fitted)
} 

sfStop()

Now, you just need to find the maximum size of the chunk, that is the number of data raws you pass to sfApply, in order to do not overflow your available RAM. That's pretty easy with some try and error.

How to optimize the r parameter when fitting a single model to several datasets in spatstat?

This is not currently implemented in a neat way for mppm, that is, for fitting models to several point pattern datasets. It is on the "to do" list.
(For fitting models to a single point pattern dataset, see the last paragraph below.)

Your code is OK, except for one problem: it assumes that it is valid to compare the log pseudolikelihood values of two models fitted with different values of r. That's not always true because, by default, ppm and mppm use the border method of edge correction, and by default, the border distance rbord is chosen to be equal to the interaction distance r. In your code, rbord is different for each model, so the pseudolikelihoods are not strictly comparable (effectively the models are based on different "sample sizes").

To avoid this problem, you can either explicitly set the border distance rbord to be equal to the maximum value of r that will be used:

mppm(Points ~ something, Strauss(r), rbord=rmax)

or specify another edge correction such as correction="iso" or correction="none" in the call to mppm. Either of these strategies will ensure that the pseudolikelihood values are comparable.

You noted that estimates of r obtained by your search procedure were unrealistic. This could be attributable to the issue discussed above. But it does also happen sometimes when the search domain is chosen to be too large (if you allowed the software to try unrealistic values).

Another option, which would be much faster, is to use mppm to fit a model with the PairPiece interaction, specifying a sequence of jump points r, then to plot the resulting fitted interaction (extracted from the model using fitin.) This will allow you to judge the most appropriate value of r, or at least the range of r that is appropriate. It also gives you a way to judge whether a threshold type interaction model is appropriate. See the bottom of page 517 and left panel of Figure 13.19 in the spatstat book.

In the jargon, r is called an irregular parameter. As explained in the spatstat package documentation and the spatstat book, only the regular parameters are estimated by ppm or mppm, and the irregular parameters must be determined in some other way. See Sections 9.12 and 13.6.3 of the spatstat book.

Because r is a distance threshold, the likelihood or pseudolikelihood is not continuous as a function of r. So the answer to your second question is strictly "no", there is no analytic formula for estimating r.

For fitting a model to a single point pattern dataset, there are two functions that allow estimation of the interaction range: profilepl which uses a method similar to the one you have implemented above, and ippm which solves the score equation analytically. To fit a Strauss model to a single dataset where r has to be estimated, the only option supported is profilepl because the Strauss model is not differentiable with respect to r. Very little information is known about the statistical performance of estimators of interaction range. This is still a research problem.

How to optimize subsetting from a large dataset?

Note: The post has been edited by changing the function being calculated from rowSums to colSums (using lapply in case of data.table)

I don't think you could get the result faster than data.table. Here's a benchmark between plyr and data.table. Of course if the time-consuming part is your function, then you could use doMC to run in parallel using plyr (assuming you have a lot of cores or you work on a cluster). Else, I'd stick to data.table. Here's an analysis with a huge test data and a dummy function:

# create a huge data.frame with repeating id values
len <- 1e5
reps <- sample(1:20, len, replace = TRUE)
x <- data.frame(id = rep(1:len, reps))
x <- transform(x, v1 = rnorm(nrow(x)), v2 = rnorm(nrow(x)))

> nrow(x) 
[1] 1048534 # 1 million rows

# construct functions for data.table and plyr
# method 1
# using data.table
DATA.TABLE <- function() {
    require(data.table)
    x.dt <- data.table(x, key="id")
    x.dt.out <- x.dt[, lapply(.SD, sum), by=id]
}

# method 2
# using plyr
PLYR <- function() {
    require(plyr)
    x.plyr.out <- ddply(x, .(id), colSums)
}

# let's benchmark
> require(rbenchmark)
> benchmark(DATA.TABLE(), PLYR(), order = "elapsed", replications = 1)[1:5]
          test replications elapsed relative user.self
1 DATA.TABLE()           1  1.006     1.00    .992
2       PLYR()           1  67.755   67.351  67.688

On a data.frame with 1 million rows, data.table takes 0.992 seconds. The speed-up using data.table compared to plyr (admittedly, on computing column sums) is 68x. Depending on the computation time in your function, this speed-up will vary. But data.table will still be way faster. plyr is a split-apply-combine strategy. I don't think you'd get a comparable speed-up compared to using base to split, apply and combine yourself. Of course you can try it.

I ran the code with 10 million rows. data.table ran in 5.893 seconds. plyr took 6300 seconds.

A faster way to combine a large number of datasets in R?

So, I've done some tinkering and come to a first solution. Batch combining the datasets with rbindlist indeed proved to be a great efficiency boost.

I believe this does have to do mostly with the pre-allocation of size that rbindlist does automatically. I was struggling a bit with finding an elegant way to pre-allocate the dataset that did not involve hand-naming all 88 columns and wildly guessing the final size of the dataset (due to the removal of duplicates in the process). So I didn't benchmark the rbindlist solution against an rbind solution with pre-allocated dataset size.

So here's a solution for batching the files and combining them via rbindlist. The process of combining all 858 datasets onto an initial dataset of ~236000 tweets of an initial search clocked in at 909.68 system.time() and a dataset of ~2.5 million rows.

filenames <- list.files(pattern = "dataset_*") 
   
full_data <- as.data.table(data_init)
    
  for (i in seq(1,length(filenames),13)){  # sequence into batches (here: 13)
    files <- filenames[i:(i+12)] # loop through the batched files to load 
    for (j in 1:length(files)) {
      load(files[j])
      print(paste((i+j-1), "/", length(filenames), ":", files[j]))} # indicates current dataset as number of total datasets
    full_data <- rbindlist(mget(c(ls(pattern="dataset"), ls(pattern="full_data"))))   # add new datasets
    print("- batch combined -")
    rm(list=ls(pattern="dataset"))  # remove data
    full_data <- unique(full_data, fromLast = T, by="status_id")  #remove duplicate tweets in data
  }

I've divided them into batches of 13, since that evens out nicely with the 858 datasets. Some testing showed that batches of ~8 datasets might be more efficient. Not sure what would be an elegant way to deal with the number of files not evening out with the batches though.

Optimizing ifelse on a large data frame

There has been some discussion about how ifelse is not the best option for code where speed is an important factor. You might instead try:

df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]

To see what's going on here, let's break down the command. df$A > 0.05 & df$B > 0.05 returns TRUE if both A and B exceed 0.05, and FALSE otherwise. Therefore, (df$A > 0.05 & df$B > 0.05)+1 returns 2 if both A and B exceed 0.05 and 1 otherwise. These are used as indicates into the vector c("", "Equal"), so we get "Equal" when both exceed 0.05 and "" otherwise.

Here's a comparison on a data frame with 1 million rows:

# Build dataset and functions
set.seed(144)
big.df <- data.frame(A = runif(1000000), B = runif(1000000))
OP <- function(df) {
  df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "")
  df
}
josilber <- function(df) {
  df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]
  df
}
all.equal(OP(big.df), josilber(big.df))
# [1] TRUE

# Benchmark
library(microbenchmark)
microbenchmark(OP(big.df), josilber(big.df))
# Unit: milliseconds
#              expr      min        lq      mean    median        uq      max neval
#        OP(big.df) 299.6265 311.56167 352.26841 318.51825 348.09461 540.0971   100
#  josilber(big.df)  40.4256  48.66967  60.72864  53.18471  59.72079 267.3886   100

The approach with vector indexing is about 6x faster in median runtime.

What is an efficient programming way to find the closest time of a dataset to a reference (larger) dataset

One can use a rolling join from data.table:

library(data.table)
set.seed(1)  # reproduciblity on Stackoverflow
DF_A <- data.table(x = seq(-500, by = 0.5, length.out = 26908),
                   idx = seq_len(26908))

DF_HZ <- data.table(x = round(runif(16020209, first(DF_A$x), last(DF_A$x)), 3),
                    idx_hz = seq_len(16020209))

DF_HZ[, x_hz := x + 0] # so we can check
DF_A[, x_a := x + 0] # so we can check

setkey(DF_A, x)
setkey(DF_HZ, x)

# The order(idx_hz) returns the result in the same order as 
# DF_HZ but it is not necessary to match joins.
DF_A[DF_HZ, roll = "nearest"][order(idx_hz)]
#>                   x   idx     x_a   idx_hz      x_hz
#>        1:  3072.021  7145  3072.0        1  3072.021
#>        2:  4506.369 10014  4506.5        2  4506.369
#>        3:  7206.883 15415  7207.0        3  7206.883
#>        4: 11718.574 24438 11718.5        4 11718.574
#>        5:  2213.328  5428  2213.5        5  2213.328
#>       ---                                           
#> 16020205: 10517.477 22036 10517.5 16020205 10517.477
#> 16020206: 11407.776 23817 11408.0 16020206 11407.776
#> 16020207: 12051.919 25105 12052.0 16020207 12051.919
#> 16020208:  3482.463  7966  3482.5 16020208  3482.463
#> 16020209:   817.366  2636   817.5 16020209   817.366

^{Created on 2020-11-11 by the reprex package (v0.3.0)}

On my machine, the above (not including the creation of the dummy data) takes about 3 s.

Faster functions than lm() in R

Apparently, it should not be a problem to run a regression on a dataset of ~100,000 observations.

After receiving helpful comments on the main post, I found that one of the independent variables used in the input of the regression was coded as a character, by using the following command to find the data type of every column in the dataframe (df):

str(df)

$ var1           : chr  "x1" "x2" "x1" "x1"
$ var2           : Factor w/ 2 levels "factor1" "factor2": 1 1 1 0
$ var3           : Factor w/ 2 levels "factorx" "factory": 0 1 1 0
$ var4           : num 1 8 3 2

Changing var1 to a factor variable:

df$var1 <- as.factor(df$var1)

After changing var1 to a factor variable, the regression indeed runs within a few seconds.

What Are Helpful Optimizations in R for Big Data Sets