How to Optimize the Following Code with Nested While-Loop? Multicore an Option

How to optimize the following code with nested while-loop? Multicore an option?

As @CarlWitthoft already mentioned you have to rethink your data structure because of many duplicated data.

Here you find a simple vectorized approach:

  ## create all possible ranges of months
ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)

## how many months per ID?
n <- unlist(lapply(ranges, length))

## create new data.frame
output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
min=rep(extremes$min, n),
max=rep(extremes$max, n),
month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)

Comparison to your approach:

extremes <- data.frame(X_BusinessIDDescription=c("ID105", "ID206", "ID204", "ID785", "ID125", "ID107"),
min=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
max=as.Date(c("2008-06-01", "2009-07-01", "2008-02-01", "2010-08-01", "2008-07-01", "2011-06-01")),
month=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
stringsAsFactors=FALSE)

approachWhile <- function(extremes) {
output <- data.frame(X_BusinessIDDescription=NA, min=as.Date("1970-01-01"), max=as.Date("1970-01-01"), month=as.Date("1970-01-01"), stringsAsFactors=FALSE)
IDcounter <- 1
IDmax <- nrow(extremes)
linecounter <- 1
while (IDcounter <= IDmax){
start <- extremes$min[IDcounter]
end <- extremes$max[IDcounter] # add three months
while(start <= end){
output[linecounter,] <- extremes[IDcounter,]
output$month[linecounter] <- start
linecounter <- linecounter+1
start <- seq(start, by ="month", length=2)[2]
}
IDcounter <- IDcounter + 1
}
return(output)
}

approachMapply <- function(extremes) {
ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)

n <- unlist(lapply(ranges, length))

output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
min=rep(extremes$min, n),
max=rep(extremes$max, n),
month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)
return(output)
}

identical(approachWhile(extremes), approachMapply(extremes)) ## TRUE

library("rbenchmark")

benchmark(approachWhile(extremes), approachMapply(extremes), order="relative")
# test replications elapsed relative user.self sys.self
#2 approachMapply(extremes) 100 0.176 1.00 0.172 0.000
#1 approachWhile(extremes) 100 6.102 34.67 6.077 0.008

How to use multicore with loops in R

The variables M_xx, G_xx, S_xx and Q_xx are independent. Also, there is a lot of coincident values generated. I've split these 4 variables in separate loops, then I've generated all combinations considering only unique values. See the code:

M <- NULL

for(M_P in 0:9) for(M_D in 0:(9-M_P)) for(M_A in 0:(9-M_P-M_D)) for(M_CC in 0:(9-M_P-M_D-M_A)) for(M_CD in (9-M_P-M_D-M_A-M_CC))
{
M[length(M)+1] <- 1.1*M_P+2.1*M_D+3.1*M_A+4.1*M_CC+4*M_CD
}

G <- NULL

for(G_D in 0:9) for(G_A in 0:(9-G_D)) for(G_CC in 0:(9-G_D-G_A)) for(G_CD in (9-G_D-G_A-G_CC))
{
G[length(G)+1] <- 2*G_D+5*G_A+1.5*G_CC+3*G_CD
}

S <- NULL

for(S_D in 0:9) for(S_A in 0:(9-S_D)) for(S_CC in 0:(9-S_D-S_A)) for(S_CD in (9-S_D-S_A-S_CC))
{
S[length(S)+1] <- 5*S_D+4*S_A+3*S_CC+6*S_CD
}

Q <- NULL

for(Q_P in 0:3) for(Q_D in 0:(3-Q_P)) for(Q_A in 0:(3-Q_P-Q_D)) for(Q_CC in 0:(3-Q_P-Q_D-Q_A)) for(Q_CD in (3-Q_P-Q_D-Q_A-Q_CC))
{
Q[length(Q)+1] <- 2*Q_P+3*Q_D+2.2*Q_A+3*Q_CC+4*Q_CD
}

max(apply(expand.grid(unique(M), unique(G), unique(S), unique(Q)), 1, sum))

The nested loop cannot be stopped

using next skipped the increment of k. Try this version of your loop

test = data.frame(ID = c(1,2,3), s= c(0.4,0.3,0.3), j1 = c(0.3,0.22,0.15), j2 = c(0.11,0.58, 0.02))

j = 1
k = 1

firstsum = 0
tm1 <- system.time(
while (j <= nrow(test)){
while (k <= nrow(test)){
if (k != j) {
for (i in 3:4){
normindator = 0
denominator = 0

normindator = normindator + (test[j,i] * test[k, i])
denominator = denominator + test[j, i] * test[j, i]
firstsum = firstsum + normindator/denominator * test[k, 2]

}
}
k = k + 1
}
secondsum = 0
secondsum = secondsum + firstsum * test[j,2]
j = j + 1
k = 1
}
)

Optimizing (Vectorizing?) For loop with nested loop in R

Several tips:

0) profile your code using Rprof, with the line.profiling option

1) Matrices in R are column-wise. Because you compare your vectors between them, it will be much faster if you store them as columns of your matrix

2) I do not know where the rdist function comes from, but you should avoid the as.matrix(rdist(tab[i,],tab)) that will copy and create a new matrix

3) you can optimize your knn() function, that sorts 4 times the same vector

4) Why not just rdist(tab) ?

Optimize a sorting while loop

first use the profiler (Rprof) to find where time is being spent. next try to replace dataframes with matrices -- dataframes are very slow when accessing. then you might know where to focus.

How to optimize a simple loop?

  1. This loop is absolutely vectorizable by compiler. But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor, whatever other techniques). One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it. This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.

If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it. In order to push compiler see next bullet. Note that next bullet could be useful even if loop was succesfully auto-vectorized.


  1. To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you). (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.

  2. Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication). Unrolling is often good way to slightly mitigate such disadvantages. But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll. In fact, for certain compiler versions the unrolling will happen automatically. Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor:
    Sample Image

  3. In given loop you deal with "streaming" access pattern. I.e. you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values). So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end. Once you become memory bandwidth bound, you could use techniques like loop blocking, non-temporal stores, aggressive prefetching. All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.

  4. If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd, which will simulteneously thread-parallelize and vectorize the loop. However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually. I.e. n=n1xn2xn3, where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization. Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).

Some up to date links with syntax examples and more info:

  • unroll: https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling
  • OpenMP SIMD pragma (not so fresh and detailed, but still relevant): https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40
  • restrict vs. ivdep
  • NT-stores and prefetching : https://software.intel.com/sites/default/files/managed/22/a3/mtaap2013-prefetch-streaming-stores.pdf

Note1: I apologize that I don't provide various code snippets here. There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours. 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version.

Note2: Last comment regarding your GPU question. Think of your loop vs. simple industry benchmarks like LINPACK or STREAM. In fact your loop could become somewhat very similar to some of them in the end. Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM. They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen). So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe).

Row col extraction given random generated numbers

'RX' is just having a single row, We may need to replicate it to make the dimensions same

FX == RX[col(FX)]

Also, these are float values, so there is some precision involved i.e. it may not be exactly equal unless we do some rounding

Find minimums with R (1 Variable X, n times a fixed parameter U)

Since X * log(X) = 1 / (1 - U[i]) can be solved numerically for any U[i], there is a solution for each distinct U[i] so any of the (X*ln(X)-1/(1-U[i]))^2 can be driven to zero and therefore there is a solution for each distinct U[i]. If typically the U[i] are all distinct that means there are length(U) solutions. The solutions are given by (can omit the unique if the U[i] are all distinct):

f <- function (X, a) ((X*log(X)-(1/(1-a)))^2)
unique(sapply(U, function(a) optimize(f, c(0, 1000000), a = a)$minimum))


Related Topics



Leave a reply



Submit