Profiling a (Possibly I/O-Bound) Process to Reduce Latency

Profiling a (possibly I/O-bound) process to reduce latency

Use this method.

It is quite simple and effective at pinpointing opportunities for optimization, whether they are in CPU or IO bound code.

If you are right that the biggest opportunities are in a particular function or module, then it will find them. If they are elsewhere, it will find them.

Of the tools you mentioned and discarded, it is most similar to poor man's profiler, but still not very similar.

EDIT: Since you say it is triggered by a user interaction and blocks further input until it completes, here's how I would do it.

First, I assume it does not block a manual interrupt signal to the debugger, because otherwise you'd have no way to stop an infinite loop.
Second, I would wrap a loop of 10, 100, or 1000 times around the routine in question, so it is doing it long enough to be manually interrupted.

Now, suppose it is spending some fraction of time doing I/O, like 50%. Then when you interrupt it, you have a 50% chance of catching it in the I/O.
So if you catch it in the I/O, which the call stack will tell you, you can also see in great detail where the I/O is being requested from and why.

It will show you what's going on, which is almost certainly something surprising.
If you see it doing something on as few as two (2) samples that you could find a way to eliminate, then you will get a considerable speedup.
In fact, if you eliminate that activity, you don't know in advance how much time you will save, but on average you can expect to save fraction F = (s+1)/(n+2), where n is the total number of samples you took, and s is the number of samples that showed the activity.
(Rule of Succession)
Example, if you took 4 stack samples and saw the activity on 2 of them, on average it would save you F = 3/6 = 1/2, corresponding to a speedup factor of 1/(1-F) or 2.

Once you've done that, you can do it again and find something else to fix.
The speedup factors multiply together like compound interest.

Then of course you remove the outer loop and "cash in" all the speedups you got.

If you are wondering how this differs from profiling, it is that by carefully examining each stack sample, and possibly related data, you can recognize activities that you could remove, where if all you've got is measurements, you are left trying to intuit what is going on.
The actual amount of time you save is what it is, regardless of any measurements.
The important thing is to find the problem.
No matter how precisely a profiler might measure it, if you can't find it, you're not winning.
These pages are full of people saying either they don't understand what their profiler is telling them, or it seems to be saying there is nothing to fix, which they are only too willing to accept.
That's a case of rose-tinted glasses.

More on all that.

How to profile time spent in memory access in C/C++ applications?

A notion of Arithmetic Intensity has been proposed by the Roofline model: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Simply said it defines the number of arithmetic instructions executed for each memory access.

Computing the Arithmetic Intensity is usually implemented through the use of performance counters.

How should I optimize this filesystem I/O bound program?

Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.

If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.

Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.

What profiler should I use to measure _real_ time (including waiting for syscalls) spend in this function, not _CPU_ one

Quickly hacked up trivial sampling profiler for linux: http://vi-server.org/vi/simple_sampling_profiler.html

It appends backtrace(3) to a file on SIGUSR1, and then converts it to annotated source.

As it probes the program periodically, we'll see functions that waits for something.

And as it walks the stack, we'll see callers too.

Also people from answers to similar questions recommends Zoom.

How can I tell why my Perl code is spending so much time in kernel space (sy in top)?

You could use strace(1) - trace system calls and signals.
-c flag will produce aggregate results grouped by syscall. -p NUM is a PID of running process to attach.
Monitoring a crawler (for instance) looks like this:

stas@crawler2:~/ScrapMe$ strace -c -p 32184                                                                                                                                                                                              
Process 32184 attached - interrupt to quit
^CProcess 32184 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 48.00    0.000024           0       128           sendto
 26.00    0.000013           0        76           write
 26.00    0.000013           0      2652      2652 stat
  0.00    0.000000           0        24           close
  0.00    0.000000           0       226           poll
  0.00    0.000000           0        78           rt_sigaction
  0.00    0.000000           0        26           rt_sigprocmask
  0.00    0.000000           0        52           alarm
  0.00    0.000000           0        26           socket
  0.00    0.000000           0        26        26 connect
  0.00    0.000000           0        75           recvfrom
  0.00    0.000000           0        26           getsockname
  0.00    0.000000           0        26           getpeername
  0.00    0.000000           0        26           getsockopt
  0.00    0.000000           0       310           fcntl
  0.00    0.000000           0        13           epoll_wait
  0.00    0.000000           0        26           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000050                  3816      2678 total

Profiling a (Possibly I/O-Bound) Process to Reduce Latency