If Profiler Is Not the Answer, What Other Choices Do We Have

If profiler is not the answer, what other choices do we have?

Oh, man, where to begin?

First, I'm amazed that this is news. Second, the problem is not that profilers are bad, it is that some profilers are bad.
The authors built one that, they feel, is good, just by avoiding some of the mistakes they found in the ones they evaluated.
Mistakes are common because of some persistent myths about performance profiling.

But let's be positive.
If one wants to find opportunities for speedup, it is really very simple:

Sampling should be uncorrelated with the state of the program.

That means happening at a truly random time, regardless of whether the program is in I/O (except for user input), or in GC, or in a tight CPU loop, or whatever.
Sampling should read the function call stack,

so as to determine which statements were "active" at the time of the sample.
The reason is that every call site (point at which a function is called) has a percentage cost equal to the fraction of time it is on the stack.
(Note: the paper is concerned entirely with self-time, ignoring the massive impact of avoidable function calls in large software. In fact, the reason behind the original gprof was to help find those calls.)
Reporting should show percent by line (not by function).

If a "hot" function is identified, one still has to hunt inside it for the "hot" lines of code accounting for the time. That information is in the samples! Why hide it?

An almost universal mistake (that the paper shares) is to be concerned too much with accuracy of measurement, and not enough with accuracy of location.
For example, here is an example of performance tuning
in which a series of performance problems were identified and fixed, resulting in a compounded speedup of 43 times.
It was not essential to know precisely the size of each problem before fixing it, but to know its location.
A phenomenon of performance tuning is that fixing one problem, by reducing the time, magnifies the percentages of remaining problems, so they are easier to find.
As long as any problem is found and fixed, progress is made toward the goal of finding and fixing all the problems.
It is not essential to fix them in decreasing size order, but it is essential to pinpoint them.

On the subject of statistical accuracy of measurement, if a call point is on the stack some percent of time F (like 20%), and N (like 100) random-time samples are taken, then the number of samples that show the call point is a binomial distribution, with mean = NF = 20, standard deviation = sqrt(NF(1-F)) = sqrt(16) = 4. So the percent of samples that show it will be 20% +/- 4%.
So is that accurate? Not really, but has the problem been found? Precisely.

In fact, the larger a problem is, in terms of percent, the fewer samples are needed to locate it. For example, if 3 samples are taken, and a call point shows up on 2 of them, it is highly likely to be very costly.
(Specifically, it follows a beta distribution. If you generate 4 uniform 0,1 random numbers, and sort them, the distribution of the 3rd one is the distribution of cost for that call point.
It's mean is (2+1)/(3+2) = 0.6, so that is the expected savings, given those samples.)
INSERTED: And the speedup factor you get is governed by another distribution, BetaPrime, and its average is 4. So if you take 3 samples, see a problem on 2 of them, and eliminate that problem, on average you will make the program four times faster.

It's high time we programmers blew the cobwebs out of our heads on the subject of profiling.

Disclaimer - the paper failed to reference my article: Dunlavey, “Performance tuning with instruction-level cost derived from call-stack sampling”, ACM SIGPLAN Notices 42, 8 (August, 2007), pp. 4-8.

Need to know how to use profilers / which one to use

If you're confused, so are most of the profiler vendors.

The first thing to understand is - programs aren't slow because they have slow parts (bottlenecks).
They are slow because they do more than they have to. They have time drains.

It's like a horse race.
A bottleneck would be a narrow place in the track, where the horses have to pile up and slow down.
A time drain would be like another track fused with the first, into which horses wander and have to travel an extra distance.
Then there could be another track fused with that one, and another, and so on.

A function call that could be avoided is an example of a time drain.

Here's how I find time drains. It's simple and language-independent.
You can even do it with a simple tool like jStack.

Profiler makers mean well, but they are hampered by a bunch of confusing concepts.

Such as "where the time is spent". If what this means is "where the program counter is found most often", that's like the horse being in the wrong racetrack.
You could try to shorten that racetrack, but the real problem is the horse shouldn't even be there. i.e. there's a function call that should have been avoided.

Such as "statistical accuracy of measurement". Do you need to measure how long the horse takes to get around the wrong racetrack to know it's on the wrong racetrack?
No, you just need to take a snapshot (stack sample).
The longer it's on the wrong racetrack, the more likely you'll see it.
If you see it there twice, you know what the problem is.

Such as calling it a "CPU Profiler", that being an excuse to ignore I/O time.
Sometimes the time drain is needless I/O you were not aware of.
That would be like the horse stopping to munch on a bag of oats.
If you can only take your snapshot while the horse is running, you would never notice.
You would only notice that the time was suspiciously long.

There's more where those came from ...

Does attaching a profiler cause some things to run slower then others?

Profilers regularly give mis-leading information, but in generally they are usually right. Where they tend to skew the result is in very simple methods which might be further optimised if profiling wasn't enabled.

If in doubt I suggest you use another profiler, such as YourKit (evalation version should be fine) It has more light weight recording, but can have the same issues.

Profiling by line with Python 3

While line_profiler only works for Python 2.x, it doesn't seem too hard to make the necessary changes to get it to work with 3.x.

I've done this myself here. It's quick and dirty and pretty much untested, so use it at your peril, but it's possibly a start.

One could use a profiler, but why not just halt the program?

On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.

This technique can reveal more performance problems in 2 minutes than any other technique I know of.

Can a profiler change how long recursive calls take to run in Java?

Yes. Most profiles do instrumentation at the level of method invocations. In recursive form, the profiler must take a lot more measurements than in iterative form. While profilers do try to extract their overhead from the reported numbers, this is very difficult to do reliable. Different profilers will be better/worse at this.

If Profiler Is Not the Answer, What Other Choices Do We Have