Aggregation Using Ffdfdply Function in R

How to split/aggregate a large data frame (ffdf) by multiple columns?

# Use this, this makes sure your data does not get into RAM completely but only in chunks of 100000 records
ffshares$splitBy <- with(ffshares[c("articleID", "measure")], paste(articleID, measure, sep=""),
by = 100000)
length(levels(ffshares$splitBy)) ## how many levels are in there - don't know from your question

tmp <- ffdfdply(ffshares, split=ffshares$splitBy, FUN=function(x) {
## In x you are getting a data.frame in RAM with all records of possibly several articleID/measure combinations
## You should write a function which returns a data.frame. E.g. the following returns the mean value by articleID/measure and the first and last timepoint
x <- data.table::setDT(x)
xagg <- x[, list(value = mean(value),
first.timepoint = min(time),
last.timepoint = max(time)), by = list(articleID, measure)]
## the function should return a data frame as indicated in the help of ffdfdply, not a list
setDF(xagg)
})
## tmp is an ffdf

What does the by argument in ffbase::as.character do?

Since as.character.ff works using the default as.character internally, and in view of the fact that df vectors can be larger than RAM, the data needs to be processed in chunks. The partition into chunks is facilitated by the chunk function. In this case, the relevant method is chunk.ff_vector. By default, this will calculate the chunk size by dividing getOption("ffbatchbytes") by the record size. However, this behaviour can be overridden by supplying the chunk size using by.

In the example you give, the ff vector will be converted to character 250000 members at a time.

The end result will be the same for any by or without by at all. Larger values will lead to greater temporary use of RAM but potentially quicker operation.

Using tapply, ave functions for ff vectors in R

There is currently no tapply or ave for ff_vectors currently implemented in package ff.
But what you can do is use functionality in ffbase.
Let's elaborate on some bigger dataset

require(ffbase)
a <- ffrep.int(ff(1:100000), times=500) ## 50Mio records on disk - not in RAM
b <- ffrandom(n=length(a), rfun = runif)
c <- ffseq_len(length(a))
df <- ffdf(a = a, b = b, c = c) ## on disk
dim(df)

For your simple aggregation method, you can use binned_sum for which you can extract the length easily as follows. Mark that binned_sum needs an ff factor object in the bin, which can be obtained by doing as.character.ff as shown.

df$groupbyfactor <- as.character(df$a)
agg <- binned_sum(x=df$b, bin=df$groupbyfactor, nbins = length(levels(df$groupbyfactor)))
head(agg)
agg[, "count"]

For more complex aggregations you can use ffdfdply in ffbase. What I frequently do is combine it with some data.table statements like this:

require(data.table)
agg <- ffdfdply(df, split=df$groupbyfactor, FUN=function(x){
x <- as.data.table(x)
result <- x[, list(b.mean = mean(b), b.median = median(b), b.length = length(b), whatever = b[c == max(c)][1]), by = list(a)]
result <- as.data.frame(result)
result
})
class(agg)
aggg <- as.data.frame(agg) ## Puts the data in RAM!

This will put your data in RAM in chunks of groups of split elements based on which you can apply a function, like some data.table statements, which require your data to be in RAM. The result of all chunks based on which you applied the function is next combined in a new ffdf, so that you can further use it, or put it into RAM if your RAM allows that size.

The sizes of the chunks are controlled by getOption("ffbatchbytes"). So if you have more RAM, the better as it will allow you to get more data in each chunk in RAM.

ffdfdply, splitting and memory limit in R

The most difficult part about using ff/ffbase is making sure your data stays in ff and not accidently put it in RAM. As once you will have put your data in RAM (mostly due to some misunderstanding of when data is put in RAM and when it is not), it is hard to get your RAM back from R and if you are working on your RAM limit, a small extra request of RAM will get your 'Error: cannot allocate vector of size'.

Now, I think you misspecified the input to ikey. Look at ?ikey, it requires as input argument an ffdf, not several ff vectors. Probably this has put your data in RAM while what you wanted is probably to use ikey(x[c("id_1","id_2","month","year")])

It simulated some data on my computer as follows to get an ffdf with 24Mio rows, and the following does not give me RAM troubles (it uses approx 3.5Gb of RAM in my machine)

require(ffbase)
require(data.table)
x <- expand.ffgrid(id_1 = ffseq(1, 1000), id_2 = ffseq(1, 1000), year = as.ff(c(2012,2013)), month = as.ff(1:12))
x$Amount <- ffrandom(nrow(x), rnorm, mean = 10, sd = 5)
x$key <- ikey(x[c("id_1","id_2","month","year")])
x$key <- as.character(x$key)
summary <- ffdfdply(x, split=x$key, FUN=function(df) {
df <- data.table(df)
df <- df[, list(
id_1 = id_1[1],
id_2 = id_2[1],
month = month[1],
year = year[1],
withdraw = sum(Amount*(Amount>0), na.rm=T)
), by = key]
df
}, trace=TRUE)

Another reason might be that you have too much other data in RAM which you are not talking about. Mark also that in ff, all your factor levels are in RAM, this might also be an issue if you are working with a lot of character/factor data - in that case you need to be asking yourself whether you really need these data in your analysis or not.

R{ff}- use apply in ffdf

It seems, you want to apply a function over the rows. You can use chunk for that. Get you chunk in RAM and use apply and store it where you want (in RAM or in ff).

require(ff)
ffiris <- as.ffdf(iris)
for(i in chunk(ffiris)){
x <- ffiris[i, ]
apply(x, MARGIN=1, FUN=yourfunction)
}

Functions for creating and reshaping big data in R using the FF package

The function reshape does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase. Just use ffdfdply from package ffbase, split by Subject and apply reshape inside the function.

An example on the Indometh dataset with 1000000 subjects.

require(ffbase)
require(datasets)
data(Indometh)

## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000 3

## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
df <- reshape(datawithseveralsplitelements,
v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject" "conc.0.25" "conc.0.5" "conc.0.75" "conc.1" "conc.1.25" "conc.2" "conc.3" "conc.4" "conc.5" "conc.6" "conc.8"
dim(result)
[1] 1000000 12


Related Topics



Leave a reply



Submit