Digging into R Profiling Information

Digging into R profiling information

It seems that the short answer is "No" and the long answer is "Yes, but you're not going to enjoy this." Even answering this question is going to take some time (so stick around, I may be updating it).

There are several basic things to get one's head around when working with profiling in R:

First, there are many different ways to think about profiling. It is quite typical to think in terms of a call stack. At any given instant, this is the sequence of function calls that are active, essentially nested within each other (subroutines, if you will). This is quite useful for understanding the state of evaluations, where functions will return, and lots of other things that are important for seeing things as the computer / interpreter / OS may see them. Rprof does call stack profiling.

Second, a different perspective is that I've got a bunch of code and a particular call is taking a long time: which line in my code caused that call to be made? This is line profiling. R doesn't have line profiling, as far as I can tell. This is in contrast with Python and Matlab, which both have line profilers.

Third, the map from from lines to calls is surjective, but it is not bijective: given a particular call stack we cannot guarantee that we can map it back to the code. In fact, call stack analyses often summarize the calls completely out of context of the whole stack (i.e. cumulative times are reported no matter where that call was on all of the different stacks in which it occurred).

Fourth, even though we have these constraints, we can put on our statistical hats and analyze the call stack data carefully and see what we can make of it. The call stack information is data and we like data, don't we? :)

Just a quick intro to a call stack. Let's just assume that our call stack looked like this:

"C" "B" "A"

This means that function A called B which then calls C (the order is reversed), and the call stack is 3 levels deep. In my code, the call stack gets to as many as 41 levels deep. Since the stacks can be so deep and are presented in reverse order, this is more interpretable by software than by a human. Naturally, we begin cleaning and transforming this data. :)

Now, our data really comes along looking like:

".Call" "subCsp_cols" "[" "standardGeneric" "[" "eval" "eval" "callGeneric"
"[" "standardGeneric" "[" "myFunc2" "myFunc1" "eval" "eval" "doTryCatch"
"tryCatchOne" "tryCatchList" "tryCatch" "FUN" "lapply" "mclapply"
"<Anonymous>" "%dopar%"

Miserable, isn't it? It even has duplicates of things like eval, some guy called <Anonymous> - probably some darn hacker. (Anonymous is legion, by the way. :-))

The first step in transforming this into something useful was to split each line of Rprof() output and reverse the entries (via strsplit and rev). The first 12 entries (last 12 if you look at the raw call stack, rather than the post-rev version) were the same for every line (of which there were about 12000, the sampling interval was 0.5 seconds - so about 100 minutes of profiling), and these can be discarded.

Remember, we're still interested in knowing which line(s) led to .Call, which took so much time. Before we get to that question, we put on our statistical caps: the profiling reports, e.g. from summaryRprof, profr, ggplot, etc., only reflect the cumulative time spent for a given call or for calls beneath a given call. What does this cumulative information not tell us? Bingo: whether that call was made many times, or a few, and whether the time spent was constant over all invocations of that call or whether there are some outliers. A particular function might be executed 100 times or 100K times, but all of the cost may come from a single invocation (it shouldn't, but we don't know until we look at the data).

This only begins to describe the fun. The A->B->C example doesn't reflect the way things may really appear, such as A->B->C->D->B->E. Now, "B" may be counted a couple of times. What's more, suppose that a lot of time is spent in the C level, but we never sample at precisely that level, only seeing its child calls in the stack. We may see a sizable time for "total.time", but not for "self.time". If there are lots of different child calls under C, we may lose sight of what to optimize - should we take out C altogether or tweak the children, B, D, and E?

Just to account for the time spent, I took the sequences and ran them through digest, storing counts for the digested values, via hash. I also split up the sequences, storing {(A),(A,B), (A,B,C), etc.}. This doesn't seem so interesting, but removing singletons from the counts helps a lot in cleaning up the data. We can also store the time spent in each call by using rle(). This is useful for analyzing the distribution of time spent for a given call.

Still we're nowhere closer to finding the actual time spent per line of code. We'll never get lines of code from the call stack. A simpler way to do this is to store a list of times throughout the code, which stores the output of proc.time(), for a given invocation. Taking the difference of these times reveals which lines or sections of code are taking a long time. (Hint: that's what we're really looking for, not the actual calls.)

However, we have this call stack and we might as well do something useful. Going up the stack is somewhat interesting, but if we rewind the profile information to a little earlier, we can find which calls tend to precede the longer running calls. This allows us to look for landmarks in the call stack - positions where we can tie a call to a particular line of code. This makes it a bit easier to map more calls back to code, if all we have is the call stack, rather than instrumented code. (As I keep mentioning: out of context, there isn't a 1:1 mapping, but at a fine enough granularity, especially in repeatedly hit calls that are distinctive, you may be able to find landmarks in the calls that map to code.)

Altogether, I was able to find which calls were taking a lot of time, whether that was based on 1 long interval or many small ones, what the distribution of time spent was like, and, with some effort, I was able to map the most important & time consuming calls back to the code and discover which parts of the code could benefit the most from rewriting or from a change in algorithms.

Statistical analyses of the call stack is loads of fun, but investigating a particular call based on cumulative time consumption is not a very good way to go. The cumulative time consumed by a call is informative on a relative basis, but it doesn't enlighten us as whether one or many calls consumed this time, nor the depth of the call in the stack, nor the section of code responsible for the invocations. The first two things can be addressed via a bit more R code, while the latter is best pursued through instrumented code.

As R doesn't yet have line profilers like Python and Matlab, the simplest way to handle this is to just instrument one's code.

Profiling data.table's setkey operation with Rprof

The problem was that the darwin machine was running Snow Leopard with a 64-bit kernel, which is not the default for that OS X version.

I also verified that this is not a problem for another darwin machine running Mountain Lion which uses a 64-bit kernel by default. So it's an interaction between Snow Leopard and running a 64-bit kernel specifically.

As a side note, the official OS X binary installer for R is still built with Snow Leopard, so I do think that this issue is still relevant, as Snow Leopard is still a widely-used OS X version.

When the 64-bit kernel in Snow Leopard is enabled, no kernel extensions that are compatible only with the 32-bit kernel are loaded. After booting into the default 32-bit kernel for Snow Leopard, kextfind shows that these 32-bit only kernel extensions are on the machine and (most likely) loaded:

$ kextfind -not -arch x86_64
/System/Library/Extensions/ACard6280ATA.kext
/System/Library/Extensions/ACard62xxM.kext
/System/Library/Extensions/ACard67162.kext
/System/Library/Extensions/ACard671xSCSI.kext
/System/Library/Extensions/ACard6885M.kext
/System/Library/Extensions/ACard68xxM.kext
/System/Library/Extensions/AppleIntelGMA950.kext
/System/Library/Extensions/AppleIntelGMAX3100.kext
/System/Library/Extensions/AppleIntelGMAX3100FB.kext
/System/Library/Extensions/AppleIntelIntegratedFramebuffer.kext
/System/Library/Extensions/AppleProfileFamily.kext/Contents/PlugIns/AppleIntelYonahProfile.kext
/System/Library/Extensions/IO80211Family.kext/Contents/PlugIns/AirPortAtheros.kext
/System/Library/Extensions/IONetworkingFamily.kext/Contents/PlugIns/AppleRTL8139Ethernet.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/InternalModemSupport.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/MotorolaSM56KUSB.kext
/System/Library/Extensions/JMicronATA.kext
/System/Library/Extensions/System.kext/PlugIns/BSDKernel6.0.kext
/System/Library/Extensions/System.kext/PlugIns/IOKit6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Libkern6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Mach6.0.kext
/System/Library/Extensions/System.kext/PlugIns/System6.0.kext
/System/Library/Extensions/ufs.kext

So it could be any one of those loaded extensions that is enabling something for the Rprof package to use, so that the setkey operation in data.table is profiled correctly.

If anyone wants to investigate this further, dig a bit deeper, and get to the root cause of the problem, please post an answer and I'll happily accept that one.

Preventing performance regressions in R

This is a very challenging question, and one that I'm frequently dealing with, as I swap out different code in a package to speed things up. Sometimes a performance regression comes along with a change in algorithms or implementation, but it may also arise due to changes in the data structures used.

What is a good workflow for detecting performance regressions in R packages?

In my case, I tend to have very specific use cases that I'm trying to speed up, with different fixed data sets. As Spacedman wrote, it's important to have a fixed computing system, but that's almost infeasible: sometimes a shared computer may have other processes that slow things down 10-20%, even when it looks quite idle.

My steps:

  1. Standardize the platform (e.g. one or a few machines, a particular virtual machine, or a virtual machine + specific infrastructure, a la Amazon's EC2 instance types).
  2. Standardize the data set that will be used for speed testing.
  3. Create scripts and fixed intermediate data output (i.e. saved to .rdat files) that involve very minimal data transformations. My focus is on some kind of modeling, rather than data manipulation or transformation. This means that I want to give exactly the same block of data to the modeling functions. If, however, data transformation is the goal, then be sure that the pre-transformed/manipulated data is as close as possible to standard across tests of different versions of the package. (See this question for examples of memoization, cacheing, etc., that can be used to standardize or speed up non-focal computations. It references several packages by the OP.)
  4. Repeat tests multiple times.
  5. Scale the results relative to fixed benchmarks, e.g. the time to perform a linear regression, to sort a matrix, etc. This can allow for "local" or transient variations in infrastructure, such as may be due to I/O, the memory system, dependent packages, etc.
  6. Examine the profiling output as vigorously as possible (see this question for some insights, also referencing tools from the OP).

    Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.

    Unfortunately, I don't have an answer for this.

    What is a good workflow in general?

    For me, it's quite similar to general dynamic code testing: is the output (execution time in this case) reproducible, optimal, and transparent? Transparency comes from understanding what affects the overall time. This is where Mike Dunlavey's suggestions are important, but I prefer to go further, with a line profiler.

    Regarding a line profiler, see my previous question, which refers to options in Python and Matlab for other examples. It's most important to examine clock time, but also very important to track memory allocation, number of times the line is executed, and call stack depth.

    What other languages provide good tools?

    Almost all other languages have better tools. :) Interpreted languages like Python and Matlab have the good & possibly familiar examples of tools that can be adapted for this purpose. Although dynamic analysis is very important, static analysis can help identify where there may be some serious problems. Matlab has a great static analyzer that can report when objects (e.g. vectors, matrices) are growing inside of loops, for instance. It is terrible to find this only via dynamic analysis - you've already wasted execution time to discover something like this, and it's not always discernible if your execution context is pretty simple (e.g. just a few iterations, or small objects).

    As far as language-agnostic methods, you can look at:

    1. Valgrind & cachegrind
    2. Monitoring of disk I/O, dirty buffers, etc.
    3. Monitoring of RAM (Cachegrind is helpful, but you could just monitor RAM allocation, and lots of details about RAM usage)
    4. Usage of multiple cores

    Is it something that can be built on top unit testing, or that is usually done separately?

    This is hard to answer. For static analysis, it can occur before unit testing. For dynamic analysis, one may want to add more tests. Think of it as sequential design (i.e. from an experimental design framework): if the execution costs appear to be, within some statistical allowances for variation, the same, then no further tests are needed. If, however, method B seems to have an average execution cost greater than method A, then one should perform more intensive tests.


Update 1: If I may be so bold, there's another question that I'd recommend including, which is: "What are some gotchas in comparing the execution time of two versions of a package?" This is analogous to assuming that two programs that implement the same algorithm should have the same intermediate objects. That's not exactly true (see this question - not that I'm promoting my own questions, here - it's just hard work to make things better and faster...leading to multiple SO questions on this topic :)). In a similar way, two executions of the same code can differ in time consumed due to factors other than the implementation.

So, some gotchas that can occur, either within the same language or across languages, within the same execution instance or across "identical" instances, which can affect runtime:

  1. Garbage collection - different implementations or languages can hit garbage collection under different circumstances. This can make two executions appear different, though it can be very dependent on context, parameters, data sets, etc. The GC-obsessive execution will look slower.
  2. Cacheing at the level of the disk, motherboard (e.g. L1, L2, L3 caches), or other levels (e.g. memoization). Often, the first execution will pay a penalty.
  3. Dynamic voltage scaling - This one sucks. When there is a problem, this may be one of the hardest beasties to find, since it can go away quickly. It looks like cacheing, but it isn't.
  4. Any job priority manager that you don't know about.
  5. One method uses multiple cores or does some clever stuff about how work is parceled among cores or CPUs. For instance, getting a process locked to a core can be useful in some scenarios. One execution of an R package may be luckier in this regard, another package may be very clever...
  6. Unused variables, excessive data transfer, dirty caches, unflushed buffers, ... the list goes on.

The key result is: Ideally, how should we test for differences in expected values, subject to the randomness created due to order effects? Well, pretty simple: go back to experimental design. :)

When the empirical differences in execution times are different from the "expected" differences, it's great to have enabled additional system and execution monitoring so that we don't have to re-run the experiments until we're blue in the face.

Does a line profiler for code require a parse tree and is that sufficient?

I'd say that yes, you require a parse tree (and the source) - how else would you know what constitutes a "line" and a valid statement?

A practical simplification though might be an "statement profiler" instead of a "line profiler".
In R, the parse tree is readily available: body(theFunction), so it should be fairly easy to insert measuring code around each statement. With some more work you can insert it around a group of statements that belong to the same line.

In R, the body of a function loaded from a file typically also has an attribute srcref that lists the source for each "line" (actually each statement) :

Here's a sample function (put in "example.R"):

f <- function(x, y=3)
{
a <- 0; a <- 1 # Two statements on one line
a <- (x + 1) * # One statement on two lines
(y + 2)

a <- "foo
bar" # One string on two lines
}

Then in R:

source("example.R")
dput(attr(body(theFunction), "srcref"))

Which prints this line/column information:

list(structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L), srcfile = <environment>, class = "srcref"), 
structure(c(3L, 2L, 3L, 7L, 9L, 14L, 3L, 3L), srcfile = <environment>, class = "srcref"),
structure(c(3L, 10L, 3L, 15L, 17L, 22L, 3L, 3L), srcfile = <environment>, class = "srcref"),
structure(c(4L, 2L, 5L, 15L, 9L, 15L, 4L, 5L), srcfile = <environment>, class = "srcref"),
structure(c(7L, 2L, 8L, 6L, 9L, 20L, 7L, 8L), srcfile = <environment>, class = "srcref"))

As you can "see" (the last two numbers in each structure are begin/end line), the expressions a <- 0 and a <- 1 map to the same line...

Good luck!

Profiling for runtime in C

You said you don't want to wrap each call, but let me just show you how I handle timer calls so they can be removed at compile time, and maybe that can be wrapped up in another macro later:

#ifndef NOTIME
# include <ctime>
# define CLOCK_TICK(acc, ctr) ctr = std::clock()
# define CLOCK_TOCK(acc, ctr) acc += (std::clock() - ctr)
# define CLOCK_RESET(acc) acc = 0
# define CLOCK_REPORT(acc) 1000. * double(acc) / double(CLOCKS_PER_SEC)

static clock_t t1a, t1c, t2a, t2c; // as many pairs as you need

#else
# define CLOCK_TICK(acc, ctr)
# define CLOCK_TOCK(acc, ctr)
# define CLOCK_RESET(acc)
# define CLOCK_REPORT(acc) 0
#endif

Now you can use it like this:

CLOCK_RESET(t1a);
for (int i = 0; i != 250000; ++i)
{
CLOCK_TICK(t1a, t1c);
some_expensive_function();
CLOCK_TOCK(t1a, t1c);
}
std::cout << "Time spent in some_expensive_function(): " << CLOCK_REPORT(t1a) << "ms.\n";

With variadic macros you might even be able to wrap up the function call:

#define timed_call(ACC, CTR, func, ...) do { CLOCK_TICK(ACC, CTR); \
func(__VA_ARGS__); \
CLOCK_TOCK(ACC,CTR); \
} while (false)

Optimizing R objective function with Rcpp slower, why?

In general if you are able to use vectorized functions, you will find it to be (almost) as fast as running your code directly in Rcpp. This is because many vectorized functions in R (almost all vectorized functions in Base R) are written in C, Cpp or Fortran and as such there is often little to gain.

That said, there are improvements to gain both in your R and Rcpp code. Optimization comes from carefully studying the code, and removing unnecessary steps (memory assignment, sums, etc.).

Lets start with the Rcpp code optimization.

In your case the main optimization is to remove unnecessary matrix and vector calculations. The code is in essence

  1. Shift beta
  2. calculate the log of the sum of exp(shift beta) [log-sum-exp]
  3. use Obs as an index for the shifted beta and sum over all the probabilities
  4. substract the log-sum-exp

Using this observation we can reduce your code to 2 for-loops. Note that sum is simply another for-loop (more or less: for(i = 0; i < max; i++){ sum += x }) so avoiding the sums can speed up ones code further (in most situations this is unnecessary optimization!). In addition your input Obs is an integer vector, and we can further optimize the code by using the IntegerVector type to avoid casting the double elements to integer values (Credit to Ralf Stubner's answer).

cppFunction('double llmnl_int_C_v2(NumericVector beta, IntegerVector Obs, int n_cat)
{

int n_Obs = Obs.size();

NumericVector betas = (beta.size()+1);
//1: shift beta
for (int i = 1; i < n_cat; i++) {
betas[i] = beta[i-1];
};
//2: Calculate log sum only once:
double expBetas_log_sum = log(sum(exp(betas)));
// pre allocate sum
double ll_sum = 0;

//3: Use n_Obs, to avoid calling Xby.size() every time
for (int i = 0; i < n_Obs; i++) {
ll_sum += betas(Obs[i] - 1.0) ;
};
//4: Use that we know denom is the same for all I:
ll_sum = ll_sum - expBetas_log_sum * n_Obs;
return ll_sum;
}')

Note that I have removed quite a few memory allocations and removed unnecessary calculations in the for-loop. Also i have used that denom is the same for all iterations and simply multiplied for the final result.

We can perform similar optimizations in your R-code, which results in the below function:

llmnl_int_R_v2 <- function(beta, Obs, n_cat) {
n_Obs <- length(Obs)
betas <- c(0, beta)
#note: denom = log(sum(exp(betas)))
sum(betas[Obs]) - log(sum(exp(betas))) * n_Obs
}

Note the complexity of the function has been drastically reduced making it simpler for others to read. Just to be sure that I haven't messed up in the code somewhere let's check that they return the same results:

set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

beta = c(4,2,1)
Obs = mnl_sample
n_cat = 4
xr <- llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xr2 <- llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc <- llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc2 <- llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
all.equal(c(xr, xr2), c(xc, xc2))
TRUE

well that's a relief.

Performance:

I'll use microbenchmark to illustrate the performance. The optimized functions are fast, so I'll run the functions 1e5 times to reduce the effect of the garbage collector

microbenchmark("llmml_int_R" = llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmml_int_C" = llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmnl_int_R_v2" = llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmml_int_C_v2" = llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
times = 1e5)
#Output:
#Unit: microseconds
# expr min lq mean median uq max neval
# llmml_int_R 202.701 206.801 288.219673 227.601 334.301 57368.902 1e+05
# llmml_int_C 250.101 252.802 342.190342 272.001 399.251 112459.601 1e+05
# llmnl_int_R_v2 4.800 5.601 8.930027 6.401 9.702 5232.001 1e+05
# llmml_int_C_v2 5.100 5.801 8.834646 6.700 10.101 7154.901 1e+05

Here we see the same result as before. Now the new functions are roughly 35x faster (R) and 40x faster (Cpp) compared to their first counter-parts. Interestingly enough the optimized R function is still very slightly (0.3 ms or 4 %) faster than my optimized Cpp function. My best bet here is that there is some overhead from the Rcpp package, and if this was removed the two would be identical or the R.

Similarly we can check performance using Optim.

microbenchmark("llmnl_int" = optim(beta, llmnl_int, Obs = mnl_sample, 
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C" = optim(beta, llmnl_int_C, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
times = 1e3)
#Output:
#Unit: microseconds
# expr min lq mean median uq max neval
# llmnl_int 29541.301 53156.801 70304.446 76753.851 83528.101 196415.5 1000
# llmnl_int_C 36879.501 59981.901 83134.218 92419.551 100208.451 190099.1 1000
# llmnl_int_R_v2 667.802 1253.452 1962.875 1585.101 1984.151 22718.3 1000
# llmnl_int_C_v2 704.401 1248.200 1983.247 1671.151 2033.401 11540.3 1000

Once again the result is the same.

Conclusion:

As a short conclusion it is worth noting that this is one example, where converting your code to Rcpp is not really worth the trouble. This is not always the case, but often it is worth taking a second look at your function, to see if there are areas of your code, where unnecessary calculations are performed. Especially in situations where one uses buildin vectorized functions, it is often not worth the time to convert code to Rcpp. More often one can see great improvements if one uses for-loops with code that cant easily be vectorized in order to remove the for-loop.

Profiling ruby memory usage on require call

Yep, it is the ruby stack. Try

$ irb
>> `pmap #{Process.pid} | tail -1`

You got a similar result.

Mine is: 144512K

Instead if you run:

$ ruby -e 'system "pmap #{Process.pid} | tail -1"'

You a fewer value: 27788K (mine)

So to inspect better what's happen go back to irb

$ irb
>> puts `pmap #{Process.pid}`

When you need to track padrino deps load your lib inside irb

$ cd my_padrino_project
$ irb -r /path/to/my/lib.rb
>> require_relative 'config/boot'

and check your results.



Related Topics



Leave a reply



Submit