Python's Xrange Alternative for R or How to Loop Over Large Dataset Lazilly

Python's xrange alternative for R OR how to loop over large dataset lazilly?

One (arguably more "proper") way to approach this would be to write your own iterator for iterators that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)

This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.

lazyExpandGrid <- function(...) {
dots <- list(...)
sizes <- sapply(dots, length, USE.NAMES = FALSE)
indices <- c(0, rep(1, length(dots)-1))
function() {
indices[1] <<- indices[1] + 1
DONE <- FALSE
while (any(rolls <- (indices > sizes))) {
if (tail(rolls, n=1)) return(FALSE)
indices[rolls] <<- 1
indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
}
mapply(`[`, dots, indices, SIMPLIFY = FALSE)
}
}

Sample usage:

nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
# a b c
# 1 1 15 21
nxt()
# a b c
# 1 2 15 21
nxt()
# a b c
# 1 3 15 21
nxt()
# a b c
# 1 1 16 21

## <yawn>

nxt()
# a b c
# 1 3 16 22
nxt()
# [1] FALSE

NB: for brevity of display, I used as.data.frame(mapply(...)) for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.

EDIT

Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.

lazyExpandGrid <- function(...) {
dots <- list(...)
argnames <- names(dots)
if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
sizes <- lengths(dots)
indices <- cumprod(c(1L, sizes))
maxcount <- indices[ length(indices) ]
i <- 0
function(index) {
i <<- if (missing(index)) (i + 1L) else index
if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
if (i > maxcount || i < 1L) return(FALSE)
setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L ),
argnames)
}
}

It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).

This last use-case allows for sampling a subset of the design space:

set.seed(42)
nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
as.data.frame(nxt())
# a b c d e f
# 1 1 1 1 1 1 1
nxt(sample(1e2^6, size=7))
# a b c d e f
# 2 69 61 7 7 49 92
# 21 72 28 55 40 62 29
# 3 88 32 53 46 18 65
# 4 88 33 31 89 66 74
# 5 57 75 31 93 70 66
# 6 100 86 79 42 78 46
# 7 55 41 25 73 47 94

Thanks alexis_laz for the improvements of cumprod, Map, and index calculations!

How can I avoid writing nested for loops for large data sets?

This is combinations with repetitions. rcppalgos is likely your best out of the box but at n = 1000L, that's just over 500 million combinations to go through which will take up ~ 2GB of ram.

library(RcppAlgos)
n = 1000L
mat <- comboGeneral(n, 3L, repetition = TRUE)

Now there are two routes to go. If you have the RAM and your function is able to be vectorized, you can do the above very quickly. Let's say if the sum of the combination is greater than 1000 you want the means of the combination, other wise you want the sum of the combination.

res <- if (rowSums(mat) > 1000L) 
rowMeans(mat)
else
rowSums(mat)

## Error: cannot allocate vector of size 1.2 Gb

Oh no! I get the dreaded allocate vector error. rcppalgos allows you to return the result of a function. But note that it returns a list and is a lot less fast because it is going to have to evaluate your R function instead of staying in c++. Because of this, I changed to n = 100L because I do not have all day...

comboGeneral(100L, 3L, repetition = TRUE,
FUN = function(x) {
if (sum(x) > 100L)
mean(x)
else
sum(x)
}
)

If I had a static set where I was always choosing 3 combinations out of n, I would likely use Rcpp code directly depending on what foo(a,b,c) and bar(a,b,c) are but first I would like to know more about the functions.

What can substitute a nested loop in R

This seems more like a design-of-experiments, in a sense, where you are iterating over different possible values of x and y.

xs <- 2:6
ys <- 5:15
eg <- expand.grid(x = xs, y = ys)
head(eg)
# x y
# 1 2 5
# 2 3 5
# 3 4 5
# 4 5 5
# 5 6 5
# 6 2 6

I think your %% filtering should be done outside/before this, so:

xs <- xs[!xs %% 2]
ys <- ys[!ys %% 5]
eg <- expand.grid(x = xs, y = ys)
head(eg)
# x y
# 1 2 5
# 2 4 5
# 3 6 5
# 4 2 10
# 5 4 10
# 6 6 10

From here, you can just iterate over the rows:

eg$out <- sapply(seq_len(nrow(eg)), function(r) {
sum(input$value[ complete.cases(input) & input$xcol < eg$x[r] & input$ycol < eg$y[r] ])
})
eg
# x y out
# 1 2 5 0
# 2 4 5 0
# 3 6 5 0
# 4 2 10 4
# 5 4 10 21
# 6 6 10 28
# 7 2 15 4
# 8 4 15 21
# 9 6 15 36

I think your output variable is a little off, since "2,15" should only include input$value[1] (x < 2 is the limiting factor). (Other differences exist.)

Regardless of your actual indexing logic, I suggest this methodology over a double-for or double-lapply implementation.

NB:

  1. These commands are functionally equivalent with this input:

    complete.cases(input)                                         # 1
    complete.cases(input[c("xcol","ycol","value")]) # 2
    !is.na(input$xcol) & !is.na(input$xcol) & !is.na(input$value) # 3

    I used the first since "code golf", but if your actual input data.frame contains other columns, you may prefer the second to be more selective of which columns require non-NA values.

  2. expand.grid works great for this type of expansion. However, if you are looking at significantly larger datasets (including if your filtering is more complex than %% offers), than it can be a little expensive as it must create the entire data.frame in memory. Python's use of lazy iterators would be useful here, in which case you might prefer to use https://stackoverflow.com/a/36144255/3358272 (expanded function in a github gist with some docs: https://gist.github.com/r2evans/e5531cbab8cf421d14ed).

How do I calculate all the cor() between all members of a large dataset using apply instead of for loops?

Put your data into a data frame or matrix and use the built in cor() function. Generally, you want to avoid using loops in R.

cor(yourData)

replace nested loop with expand.grid and call inner function with multiple arguments

Suppose that df <- data.frame(a = 1:2, b = 3:4) and we apply apply(df, 1, function(x) fun(x)). Then the two passed arguments x are vectors c(1, 3) and c(2, 4).

However, when df <- expand.grid(c(1,2,3), c(median, mean)) and apply(df, 1, function(x) fun(x)) is done, we can no longer store, e.g., 1 and median to a single vector because they are of too different types. Then x happens to be a list, e.g., x <- list(1, median). Then, doing x[1] or x[2] does not give 1 and median as desired; instead these are lists with a single element (hence the error object 'b' of mode 'function' was not found). This can actually be seen in your debugging example.

So, here are some ways to use apply in your case:

1) do not modify testFunc but recognize that a list is passed by apply; in that case do.call helps, but it also cares about the names of the columns of df, so I also use unname:

apply(unname(df), 1, do.call, what = testFunc)

2) same as 1) but without do.call:

apply(dframe, 1, function(x) testFunc(x[[1]], x[[2]]))

3) testFunc redefined to have a single argument:

testFunc <- function(a) rollapply(mtcars, width = a[[1]], by = a[[1]], FUN = a[[2]], align="left")
apply(dframe, 1, testFunc)

Find each combination of values from a list of ranges in R

With smaller numbers of combinations, this will work:

t(do.call(expand.grid, mapply(seq, Date1, Date2, SIMPLIFY = FALSE)))

Unfortunately, from your comment I infer that you have a relatively large number of combinations, thereby crushing the chance of dealing with all of your combinations at once. I suggest you may find use out of https://stackoverflow.com/a/36144255/3358272, slightly updated at https://gist.github.com/r2evans/e5531cbab8cf421d14ed. The point is to iterate over each combination and do something with it individually.

Create all combinations of items with many items

With 50 participants you create a dataframe with 3^50= 7.17898e+23 rows. Which is impossible to save in your memory. So I think it is a scaling problem.

Use outer instead of expand.grid

Using rep.int:

expand.grid.alt <- function(seq1,seq2) {
cbind(rep.int(seq1, length(seq2)),
c(t(matrix(rep.int(seq2, length(seq1)), nrow=length(seq2)))))
}

expand.grid.alt(seq_len(nrow(dat)), seq_len(ncol(dat)))

In my computer is like 6 times faster than expand.grid.



Related Topics



Leave a reply



Submit