perf top result about nested functions
Short Answer
No each percentage that is reported is for that specific function only. So the func_inside
samples are not counted in func_outside
Details
The way perf
works is that it periodically collects performance samples. By default perf top
simply checks which function is currently running and then adds that to the sample count for this function.
I was pretty sure this is the case, but wanted to verify that this is how perf top
displays the results so I wrote a quick test program to test its behavior. This program has two functions of interest outer
and inner
. The outer
function calls inner
in a loop, and the amount of work that inner
does is controlled by an argument. When compiling be sure to use O0 to avoid inlining. The command line arguments control the ratio of work between the two functions.
Running with parameters ./a.out 1 1 1000000000
gives results:
49.20% a.out [.] outer
23.69% a.out [.] main
21.32% a.out [.] inner
Running with parameters ./a.out 1 10 1000000000
gives results:
66.06% a.out [.] inner
17.77% a.out [.] outer
9.50% a.out [.] main
Running with parameters ./a.out 1 100 1000000000
gives results:
88.53% a.out [.] inner
2.85% a.out [.] outer
1.09% a.out [.] main
If the count for inner
was included in outer
then the runtime percentage for outer
would always be higher than inner
. But as these results show that is not the case.
The test program I used is below and was compiled with gcc -O0 -g --std=c11 test.c
.
#include <stdlib.h>
#include <stdio.h>
long inner(int count) {
long sum = 0;
for(int i = 0; i < count; i++) {
sum += i;
}
return sum;
}
long outer(int count_out, int count_in) {
long sum = 0;
for(int i = 0; i < count_out; i++) {
sum += inner(count_in);
}
return sum;
}
int main(int argc, char **argv) {
if(argc < 4) {
printf("Usage: %s <outer_cnt> <inner_cnt> <loop>\n",argv[0]);
exit(-1);
}
int outer_cnt = atoi(argv[1]);
int inner_cnt = atoi(argv[2]);
int loops = atoi(argv[3]);
long res = 0;
for(int i = 0; i < loops; i++) {
res += outer(outer_cnt, inner_cnt);
}
printf("res is %ld\n", res);
return 0;
}
Dart: Is it efficient to write nested functions in methods that are called several times
It depends.
If the _bar
function can be defined outside of the foo
method, then we can presume that it doesn't reference any local variables of the foo
method.
In that case, the compiler can optimize the local function to be just as efficient as the instance method. Maybe it does, maybe it doesn't, so let's check.
See: https://dartpad.dev/4a53a91bf4e0006e4af4c8a598b68ee6 .
This is (attempted) written so that the compiler can't optimize away the invocation of _someOtherLogic
.
I also tried making the invocation be to a static method (but then having to pass the object itself as argument to give access to the instance getter for flag
).
Running this in dartpad gives me a final set of results of
A: 918460 /ms
B: 798960 /ms
S: 918380 /ms
It seems dart2js is more efficient with the local function than with the instance method.
Running the same code on the VM gives a benchmark result of:
A: 625960.0 /ms
B: 723245.0 /ms
S: 625075.0 /ms
showing that it's performance characteristics are exactly the opposite of dart2js.
It's very likely that dart2js inlines the _bar
function when it's statically known. The dart2js compiler tend to be more aggressive about inlining than the VM.
All in all, I wouldn't start to worry about this difference unless the function call shows up heavily in the performance profile of a real-world program.
If your program's performance really depends critically on this one function call, I'd probably inline the function. If not, write whatever is more readable and maintainable, and don't start micro-optimizing until you know it matters.
Get `perf` to provide alphabetized list of functions
I don't think it's possible (to be confirmed by perf experts) with perf
directly. Nevertheless you can easily redirect perf report
output to a file
perf report --stdio > out
Then remove the first "heading lines" prceeding the data and use sort
:
sort -b -k 3 out
Are nested functions faster than global functions in Python?
There are several parts that have an effect here:
1) The time to define the function (create the function object),
2) The time to look up the function object by name,
3) The time to actually call the function.
Your global function example is faster with 1) (no need to redefine a
in each call to b1
). However, it is slower with in 2) because global variable lookup is slower than local lookup.
Why can't we have both then?
I've extended your benchmark with a solution that uses the global function, but avoids the global lookup using a local variable. It seems to be the fastest of the three on my machine:
5 6 7
b1: 0.04147 0.44421 4.46508
b2: 0.03399 0.43321 4.41121
b3: 0.03258 0.41821 4.25542
b1: 0.03240 0.42998 4.39774
b2: 0.03320 0.43465 4.42229
b3: 0.03155 0.42109 4.23669
b1: 0.03273 0.43321 4.37266
b2: 0.03326 0.43551 4.42208
b3: 0.03137 0.42356 4.25341
b1: 0.03253 0.43104 4.40466
b2: 0.03401 0.43719 4.42996
b3: 0.03155 0.41681 4.24132
b1: 0.03244 0.42965 4.37192
b2: 0.03310 0.43629 4.42727
b3: 0.03117 0.41701 4.23932
Is there an overhead when nesting functions in Python?
Yes, a new object would be created each time. It's likely not an issue unless you have it in a tight loop. Profiling will tell you if it's a problem.
In [80]: def foo():
....: def bar():
....: pass
....: return bar
....:
In [81]: id(foo())
Out[81]: 29654024
In [82]: id(foo())
Out[82]: 29651384
Performance with nested functions in C#
Break them down:
//we start with a space on the stack from a local or parameter
//holding whatever normalizationThreshold is
//1. Create a space on the stack to hold a decimal[][]
//Call CreateOriginalFormTableau() method
//assign result to that space.
decimal[][] originalFormTableau = CreateOriginalFormTableau();
//2. Create a space on the stack to hold a decimal[][]
//Call InsertSlackVariables() method, passing in the first decimal[][]
//assign result to that second space.
decimal[][] standardFormTableau = InsertSlackVariables(originalFormTableau);
//3. Call the Engine setter with the second space or assign it to the Engine field.
this.Engine = new SimplexEngine(standardFormTableau, normalizationThreshold)
VS:
//we start with a space on the stack from a local or parameter
//holding whatever normalizationThreshold is
//1. Create a space on the stack to hold a decimal[][]
//Call CreateOriginalFormTableau() method
//assign result to that space.
//2. Create a space on the stack to hold a decimal[][]
//Call InsertSlackVariables() method, passing in the first decimal[][]
//assign result to that second space.
//3. Call the Engine setter with the second space or assign it to the Engine field.
this.Engine = new SimplexEngine(
InsertSlackVariables(
CreateOriginalFormTableau()),
normalizationThreshold);
They're exactly the same code, but one version gives a name to what another leaves unnamed.
They might have slightly different performance in debug builds, though even there I'd doubt it (some of the times where release builds reuses stack slots, debug builds don't, so you are able to examine in-scope variable even after they won't be used again, but most likely it won't reuse the unnamed slots either, and even if it did make a difference the impact would be minute).
Should we pass a variable that can be acquired in O(1) into nested functions, or recall that function within every nested function?
I think that
B
is preferred because it avoids some problems with changing the number or names of parameters. OptionA
has a code smell data clumps, andB
is his solution called preserve whole object.Also both approaches have a
trump data
smell. Which can be solved by making the OO re-design and making the passed variables members of the class. Or the introduction of a global or context variable (but this is quite rare)If there are no unwanted side effects or copying large objects by value, then performance can be neglected. And I would advise you to adhere to the following rule when developing: Don't do premature optimization.
The word "never" does not apply in software design. It all depends on the context.
I think so, and not only for different languages, but also on the problem being solved within the framework of one language, the tools used, libraries, frameworks, etc.
Javascript nested function call optimization
OK, I've been researching this issue for a while now and TL;DR - it's complicated.
Turns out that many performance questions really depend on the platform, browser and even minor browser revision number. And not by a little, either. There are many examples on jsPerf that show things such as 'for vs while; or 'typed arrays vs standard arrays' wildly swinging back and forth in terms of favorable execution speed with different browser releases. Presumably this is due to JIT optimization trade-offs.
Short answer to the general performance questions - just test everything in jsPerf. None of the suggestions I got in this thread were helpful in all cases. The JIT makes things complicated. This is particularly important if you have a background like mine and are used to C programs having certain rule-of-thumb coding patterns that tend to speed things up. Don't assume anything - just test it.
NOTE: many of the weird issues I listed in the original question weer due to using the default Chrome profiler. (e.g.: the profiler you get from the Ctl+Shift+I menu) If you are doing a lot of really fast loops (such as in graphics rendering), DO NOT USE THIS PROFILER. It has a time resolution of 1 ms which is much too coarse to do proper performance debugging.
In fact the ENTIRE issue I had with case 2 being so much faster than the others is entirely due to the profiler simply not 'seeing' many of the function calls and improperly reporting CPU percentages. Int he heat map, I could clearly see huge stretches where inner loop functions were firing but not being recorded by the profiler.
Solution: http://www.html5rocks.com/en/tutorials/games/abouttracing/#
Chrome has a less-obvious and much more powerful profiler built into about:tracing. It's got microsecond resolution, the ability o read code tags for sub-function resolution and is generally much more kickass. As soon as I started using this profiler, the results fell into line with what I saw on jsPerf and helped me reduce my rendering time by nearly half. How did I do that? Again, it wasn't simple. In some cases, calling out to subroutines helped, in others it didn't. Refactoring the whole rendering engine from an object literal to module pattern seemed to help a bit. Precalcing any multiplication operations in for loops did seem to have big effects. etc, etc.
Quick notes about the about:tracing profiler: Zooming and panning is with ASWD on the keyboard - that took me a while to figure out. Also, it profiles all tabs and operates in a tab outside the page being analyzed. So minimize the number of extraneous tabs you have open since they will clutter up the profiler view. Also, if testing Canvas apps, be sure to switch tabs over to the app since RequestAnimationFrame generally doesn't fire when a tab is not active and visible.
Related Topics
How to Make a Bash String of Command with Redirect and Pipe
How Provide Nested Mount of Overlayfs
Print the Directory Where the 'Find' Linux Command Finds a Match
Perf Tool Stat Output: Multiplex and Scaling of "Cycles"
Reading Complete Line in 'For' Loop with Spaces, Tabs with Multiple Input Files
Gfortran Linking C Libraries with Conda
Cmake Test for Processor Feature
Linux: Differencebetween These Two Symbolic Link Commands
Determine Os from a Single Command Line Operation
Osx/Linux, Slow Down the Output from Terminal
How to Do Like "Netstat -P", But Faster
Linux Script Start,Stop,Restart
Mongod Does Not Start (Mongod.Service: Failed with Result 'Signal')
Linux Command Line Using for Loop and Formatting Results
Will Data Written via Write() Be Flushed to Disk If a Process Is Killed
Linker Error When Calling Printf from _Start