Perf Top Result About Nested Functions

perf top result about nested functions

Short Answer

No each percentage that is reported is for that specific function only. So the func_inside samples are not counted in func_outside

Details

The way perf works is that it periodically collects performance samples. By default perf top simply checks which function is currently running and then adds that to the sample count for this function.

I was pretty sure this is the case, but wanted to verify that this is how perf top displays the results so I wrote a quick test program to test its behavior. This program has two functions of interest outer and inner. The outer function calls inner in a loop, and the amount of work that inner does is controlled by an argument. When compiling be sure to use O0 to avoid inlining. The command line arguments control the ratio of work between the two functions.

Running with parameters ./a.out 1 1 1000000000 gives results:

49.20%  a.out             [.] outer    
23.69%  a.out             [.] main    
21.32%  a.out             [.] inner

Running with parameters ./a.out 1 10 1000000000 gives results:

66.06%  a.out             [.] inner    
17.77%  a.out             [.] outer    
 9.50%  a.out             [.] main

Running with parameters ./a.out 1 100 1000000000 gives results:

88.53%  a.out             [.] inner    
 2.85%  a.out             [.] outer    
 1.09%  a.out             [.] main

If the count for inner was included in outer then the runtime percentage for outer would always be higher than inner. But as these results show that is not the case.

The test program I used is below and was compiled with gcc -O0 -g --std=c11 test.c.

#include <stdlib.h>
#include <stdio.h>

long inner(int count) {
  long sum = 0;
  for(int i = 0; i < count; i++) {
    sum += i;
  }
  return sum;

}

long outer(int count_out, int count_in) {
  long sum = 0;
  for(int i = 0; i < count_out; i++) {
    sum += inner(count_in);
  }
  return sum;
}

int main(int argc, char **argv)  {
  if(argc < 4) {
    printf("Usage: %s <outer_cnt> <inner_cnt> <loop>\n",argv[0]);
    exit(-1);
  }

  int outer_cnt = atoi(argv[1]);
  int inner_cnt = atoi(argv[2]);
  int loops     = atoi(argv[3]);

  long res = 0;
  for(int i = 0; i < loops; i++) {
    res += outer(outer_cnt, inner_cnt);
  }

  printf("res is %ld\n", res);
  return 0;
}

Dart: Is it efficient to write nested functions in methods that are called several times

It depends.

If the _bar function can be defined outside of the foo method, then we can presume that it doesn't reference any local variables of the foo method.
In that case, the compiler can optimize the local function to be just as efficient as the instance method. Maybe it does, maybe it doesn't, so let's check.

See: https://dartpad.dev/4a53a91bf4e0006e4af4c8a598b68ee6 .
This is (attempted) written so that the compiler can't optimize away the invocation of _someOtherLogic.
I also tried making the invocation be to a static method (but then having to pass the object itself as argument to give access to the instance getter for flag).

Running this in dartpad gives me a final set of results of

A: 918460 /ms
B: 798960 /ms
S: 918380 /ms

It seems dart2js is more efficient with the local function than with the instance method.
Running the same code on the VM gives a benchmark result of:

A: 625960.0 /ms
B: 723245.0 /ms
S: 625075.0 /ms

showing that it's performance characteristics are exactly the opposite of dart2js.

It's very likely that dart2js inlines the _bar function when it's statically known. The dart2js compiler tend to be more aggressive about inlining than the VM.

All in all, I wouldn't start to worry about this difference unless the function call shows up heavily in the performance profile of a real-world program.
If your program's performance really depends critically on this one function call, I'd probably inline the function. If not, write whatever is more readable and maintainable, and don't start micro-optimizing until you know it matters.

Get `perf` to provide alphabetized list of functions

I don't think it's possible (to be confirmed by perf experts) with perf directly. Nevertheless you can easily redirect perf report output to a file

perf report --stdio > out

Then remove the first "heading lines" prceeding the data and use sort:

sort -b -k 3 out

Are nested functions faster than global functions in Python?

There are several parts that have an effect here:

1) The time to define the function (create the function object),

2) The time to look up the function object by name,

3) The time to actually call the function.

Your global function example is faster with 1) (no need to redefine a in each call to b1). However, it is slower with in 2) because global variable lookup is slower than local lookup.

Why can't we have both then?

I've extended your benchmark with a solution that uses the global function, but avoids the global lookup using a local variable. It seems to be the fastest of the three on my machine:

        5         6         7
b1:  0.04147   0.44421   4.46508
b2:  0.03399   0.43321   4.41121
b3:  0.03258   0.41821   4.25542

b1:  0.03240   0.42998   4.39774
b2:  0.03320   0.43465   4.42229
b3:  0.03155   0.42109   4.23669

b1:  0.03273   0.43321   4.37266
b2:  0.03326   0.43551   4.42208
b3:  0.03137   0.42356   4.25341

b1:  0.03253   0.43104   4.40466
b2:  0.03401   0.43719   4.42996
b3:  0.03155   0.41681   4.24132

b1:  0.03244   0.42965   4.37192
b2:  0.03310   0.43629   4.42727
b3:  0.03117   0.41701   4.23932

Is there an overhead when nesting functions in Python?

Yes, a new object would be created each time. It's likely not an issue unless you have it in a tight loop. Profiling will tell you if it's a problem.

In [80]: def foo():
   ....:     def bar():
   ....:         pass
   ....:     return bar
   ....: 

In [81]: id(foo())
Out[81]: 29654024

In [82]: id(foo())
Out[82]: 29651384

Performance with nested functions in C#

Break them down:

//we start with a space on the stack from a local or parameter
//holding whatever normalizationThreshold is
//1. Create a space on the stack to hold a decimal[][]
//Call CreateOriginalFormTableau() method
//assign result to that space.
decimal[][] originalFormTableau = CreateOriginalFormTableau();
//2. Create a space on the stack to hold a decimal[][]
//Call InsertSlackVariables() method, passing in the first decimal[][]
//assign result to that second space.
decimal[][] standardFormTableau = InsertSlackVariables(originalFormTableau);
//3. Call the Engine setter with the second space or assign it to the Engine field.
this.Engine = new SimplexEngine(standardFormTableau, normalizationThreshold)

VS:

//we start with a space on the stack from a local or parameter
//holding whatever normalizationThreshold is
//1. Create a space on the stack to hold a decimal[][]
//Call CreateOriginalFormTableau() method
//assign result to that space.
//2. Create a space on the stack to hold a decimal[][]
//Call InsertSlackVariables() method, passing in the first decimal[][]
//assign result to that second space.
//3. Call the Engine setter with the second space or assign it to the Engine field.
this.Engine = new SimplexEngine(
  InsertSlackVariables(
  CreateOriginalFormTableau()),
  normalizationThreshold);

They're exactly the same code, but one version gives a name to what another leaves unnamed.

They might have slightly different performance in debug builds, though even there I'd doubt it (some of the times where release builds reuses stack slots, debug builds don't, so you are able to examine in-scope variable even after they won't be used again, but most likely it won't reuse the unnamed slots either, and even if it did make a difference the impact would be minute).

Should we pass a variable that can be acquired in O(1) into nested functions, or recall that function within every nested function?

I think that

B is preferred because it avoids some problems with changing the number or names of parameters. Option A has a code smell data clumps, and B is his solution called preserve whole object.
Also both approaches have a trump data smell. Which can be solved by making the OO re-design and making the passed variables members of the class. Or the introduction of a global or context variable (but this is quite rare)
If there are no unwanted side effects or copying large objects by value, then performance can be neglected. And I would advise you to adhere to the following rule when developing: Don't do premature optimization.
The word "never" does not apply in software design. It all depends on the context.
I think so, and not only for different languages, but also on the problem being solved within the framework of one language, the tools used, libraries, frameworks, etc.

Javascript nested function call optimization

OK, I've been researching this issue for a while now and TL;DR - it's complicated.

Turns out that many performance questions really depend on the platform, browser and even minor browser revision number. And not by a little, either. There are many examples on jsPerf that show things such as 'for vs while; or 'typed arrays vs standard arrays' wildly swinging back and forth in terms of favorable execution speed with different browser releases. Presumably this is due to JIT optimization trade-offs.

Short answer to the general performance questions - just test everything in jsPerf. None of the suggestions I got in this thread were helpful in all cases. The JIT makes things complicated. This is particularly important if you have a background like mine and are used to C programs having certain rule-of-thumb coding patterns that tend to speed things up. Don't assume anything - just test it.

NOTE: many of the weird issues I listed in the original question weer due to using the default Chrome profiler. (e.g.: the profiler you get from the Ctl+Shift+I menu) If you are doing a lot of really fast loops (such as in graphics rendering), DO NOT USE THIS PROFILER. It has a time resolution of 1 ms which is much too coarse to do proper performance debugging.

In fact the ENTIRE issue I had with case 2 being so much faster than the others is entirely due to the profiler simply not 'seeing' many of the function calls and improperly reporting CPU percentages. Int he heat map, I could clearly see huge stretches where inner loop functions were firing but not being recorded by the profiler.

Solution: http://www.html5rocks.com/en/tutorials/games/abouttracing/#
Chrome has a less-obvious and much more powerful profiler built into about:tracing. It's got microsecond resolution, the ability o read code tags for sub-function resolution and is generally much more kickass. As soon as I started using this profiler, the results fell into line with what I saw on jsPerf and helped me reduce my rendering time by nearly half. How did I do that? Again, it wasn't simple. In some cases, calling out to subroutines helped, in others it didn't. Refactoring the whole rendering engine from an object literal to module pattern seemed to help a bit. Precalcing any multiplication operations in for loops did seem to have big effects. etc, etc.

Quick notes about the about:tracing profiler: Zooming and panning is with ASWD on the keyboard - that took me a while to figure out. Also, it profiles all tabs and operates in a tab outside the page being analyzed. So minimize the number of extraneous tabs you have open since they will clutter up the profiler view. Also, if testing Canvas apps, be sure to switch tabs over to the app since RequestAnimationFrame generally doesn't fire when a tab is not active and visible.

Perf Top Result About Nested Functions