Python's xrange alternative for R OR how to loop over large dataset lazilly?
One (arguably more "proper") way to approach this would be to write your own iterator for iterators
that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid
but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)
This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.
lazyExpandGrid <- function(...) {
dots <- list(...)
sizes <- sapply(dots, length, USE.NAMES = FALSE)
indices <- c(0, rep(1, length(dots)-1))
function() {
indices[1] <<- indices[1] + 1
DONE <- FALSE
while (any(rolls <- (indices > sizes))) {
if (tail(rolls, n=1)) return(FALSE)
indices[rolls] <<- 1
indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
}
mapply(`[`, dots, indices, SIMPLIFY = FALSE)
}
}
Sample usage:
nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
# a b c
# 1 1 15 21
nxt()
# a b c
# 1 2 15 21
nxt()
# a b c
# 1 3 15 21
nxt()
# a b c
# 1 1 16 21
## <yawn>
nxt()
# a b c
# 1 3 16 22
nxt()
# [1] FALSE
NB: for brevity of display, I used as.data.frame(mapply(...))
for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.
EDIT
Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.
lazyExpandGrid <- function(...) {
dots <- list(...)
argnames <- names(dots)
if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
sizes <- lengths(dots)
indices <- cumprod(c(1L, sizes))
maxcount <- indices[ length(indices) ]
i <- 0
function(index) {
i <<- if (missing(index)) (i + 1L) else index
if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
if (i > maxcount || i < 1L) return(FALSE)
setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L ),
argnames)
}
}
It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).
This last use-case allows for sampling a subset of the design space:
set.seed(42)
nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
as.data.frame(nxt())
# a b c d e f
# 1 1 1 1 1 1 1
nxt(sample(1e2^6, size=7))
# a b c d e f
# 2 69 61 7 7 49 92
# 21 72 28 55 40 62 29
# 3 88 32 53 46 18 65
# 4 88 33 31 89 66 74
# 5 57 75 31 93 70 66
# 6 100 86 79 42 78 46
# 7 55 41 25 73 47 94
Thanks alexis_laz for the improvements of cumprod
, Map
, and index calculations!
How can I avoid writing nested for loops for large data sets?
This is combinations with repetitions. rcppalgos is likely your best out of the box but at n = 1000L
, that's just over 500 million combinations to go through which will take up ~ 2GB of ram.
library(RcppAlgos)
n = 1000L
mat <- comboGeneral(n, 3L, repetition = TRUE)
Now there are two routes to go. If you have the RAM and your function is able to be vectorized, you can do the above very quickly. Let's say if the sum of the combination is greater than 1000 you want the means of the combination, other wise you want the sum of the combination.
res <- if (rowSums(mat) > 1000L)
rowMeans(mat)
else
rowSums(mat)
## Error: cannot allocate vector of size 1.2 Gb
Oh no! I get the dreaded allocate vector error. rcppalgos allows you to return the result of a function. But note that it returns a list and is a lot less fast because it is going to have to evaluate your R function instead of staying in c++. Because of this, I changed to n = 100L
because I do not have all day...
comboGeneral(100L, 3L, repetition = TRUE,
FUN = function(x) {
if (sum(x) > 100L)
mean(x)
else
sum(x)
}
)
If I had a static set where I was always choosing 3 combinations out of n
, I would likely use Rcpp
code directly depending on what foo(a,b,c)
and bar(a,b,c)
are but first I would like to know more about the functions.
What can substitute a nested loop in R
This seems more like a design-of-experiments, in a sense, where you are iterating over different possible values of x
and y
.
xs <- 2:6
ys <- 5:15
eg <- expand.grid(x = xs, y = ys)
head(eg)
# x y
# 1 2 5
# 2 3 5
# 3 4 5
# 4 5 5
# 5 6 5
# 6 2 6
I think your %%
filtering should be done outside/before this, so:
xs <- xs[!xs %% 2]
ys <- ys[!ys %% 5]
eg <- expand.grid(x = xs, y = ys)
head(eg)
# x y
# 1 2 5
# 2 4 5
# 3 6 5
# 4 2 10
# 5 4 10
# 6 6 10
From here, you can just iterate over the rows:
eg$out <- sapply(seq_len(nrow(eg)), function(r) {
sum(input$value[ complete.cases(input) & input$xcol < eg$x[r] & input$ycol < eg$y[r] ])
})
eg
# x y out
# 1 2 5 0
# 2 4 5 0
# 3 6 5 0
# 4 2 10 4
# 5 4 10 21
# 6 6 10 28
# 7 2 15 4
# 8 4 15 21
# 9 6 15 36
I think your output
variable is a little off, since "2,15" should only include input$value[1]
(x < 2
is the limiting factor). (Other differences exist.)
Regardless of your actual indexing logic, I suggest this methodology over a double-for
or double-lapply
implementation.
NB:
These commands are functionally equivalent with this
input
:complete.cases(input) # 1
complete.cases(input[c("xcol","ycol","value")]) # 2
!is.na(input$xcol) & !is.na(input$xcol) & !is.na(input$value) # 3I used the first since "code golf", but if your actual
input
data.frame contains other columns, you may prefer the second to be more selective of which columns require non-NA
values.expand.grid
works great for this type of expansion. However, if you are looking at significantly larger datasets (including if your filtering is more complex than%%
offers), than it can be a little expensive as it must create the entiredata.frame
in memory. Python's use of lazy iterators would be useful here, in which case you might prefer to use https://stackoverflow.com/a/36144255/3358272 (expanded function in a github gist with some docs: https://gist.github.com/r2evans/e5531cbab8cf421d14ed).
How do I calculate all the cor() between all members of a large dataset using apply instead of for loops?
Put your data into a data frame or matrix and use the built in cor()
function. Generally, you want to avoid using loops in R.
cor(yourData)
replace nested loop with expand.grid and call inner function with multiple arguments
Suppose that df <- data.frame(a = 1:2, b = 3:4)
and we apply apply(df, 1, function(x) fun(x))
. Then the two passed arguments x
are vectors c(1, 3)
and c(2, 4)
.
However, when df <- expand.grid(c(1,2,3), c(median, mean))
and apply(df, 1, function(x) fun(x))
is done, we can no longer store, e.g., 1
and median
to a single vector because they are of too different types. Then x
happens to be a list, e.g., x <- list(1, median)
. Then, doing x[1]
or x[2]
does not give 1
and median
as desired; instead these are lists with a single element (hence the error object 'b' of mode 'function' was not found
). This can actually be seen in your debugging example.
So, here are some ways to use apply
in your case:
1) do not modify testFunc
but recognize that a list is passed by apply
; in that case do.call
helps, but it also cares about the names of the columns of df
, so I also use unname
:
apply(unname(df), 1, do.call, what = testFunc)
2) same as 1) but without do.call
:
apply(dframe, 1, function(x) testFunc(x[[1]], x[[2]]))
3) testFunc
redefined to have a single argument:
testFunc <- function(a) rollapply(mtcars, width = a[[1]], by = a[[1]], FUN = a[[2]], align="left")
apply(dframe, 1, testFunc)
Find each combination of values from a list of ranges in R
With smaller numbers of combinations, this will work:
t(do.call(expand.grid, mapply(seq, Date1, Date2, SIMPLIFY = FALSE)))
Unfortunately, from your comment I infer that you have a relatively large number of combinations, thereby crushing the chance of dealing with all of your combinations at once. I suggest you may find use out of https://stackoverflow.com/a/36144255/3358272, slightly updated at https://gist.github.com/r2evans/e5531cbab8cf421d14ed. The point is to iterate over each combination and do something with it individually.
Create all combinations of items with many items
With 50 participants you create a dataframe with 3^50= 7.17898e+23 rows. Which is impossible to save in your memory. So I think it is a scaling problem.
Use outer instead of expand.grid
Using rep.int
:
expand.grid.alt <- function(seq1,seq2) {
cbind(rep.int(seq1, length(seq2)),
c(t(matrix(rep.int(seq2, length(seq1)), nrow=length(seq2)))))
}
expand.grid.alt(seq_len(nrow(dat)), seq_len(ncol(dat)))
In my computer is like 6 times faster than expand.grid
.
Related Topics
Removing Na Observations with Dplyr::Filter()
Setting Defaults for Geoms and Scales Ggplot2
How to Show Only Part of the Plot Area of Polar Ggplot with Facet
Equivalent to Unix "Less" Command Within R Console
Replace Na Values by Row Means
Promise Already Under Evaluation: Recursive Default Argument Reference or Earlier Problems
How to Directly Select the Same Column from All Nested Lists Within a List
Predict.Lm() with an Unknown Factor Level in Test Data
How to Move or Position a Legend in Ggplot2
Output a Vector in R in the Same Format Used for Inputting It into R
How to Parse Year + Week Number in R
Merge Dataframes of Different Sizes
Elegantly Assigning Multiple Columns in Data.Table with Lapply()
What Is the Significance of the New Reference Classes