Finding Latency Issues (Stalls) in Embedded Linux Systems

How to identify, what is stalling the system in Linux?

For an embedded system that has many process running, there can be multitude of reasons. You may need to investigate in all perspective.

Check code for race conditions and deadlock.The kernel might be busy looping in a certain condition . There can be scenario where your application is waiting on a select call or the CPU resource is used up (This choice of CPU resource usage is ruled out based on the output of top command shared by you) or blocked on a read.

If you are performing a blocking I/O operations, the process shall get into wait queue and only move back to the execution path(ready queue) after the completion of the request. That is, it is moved out of the scheduler run queue and put with a special state. It shall be put back into the run queue only if they wake from the sleep or the resource waited for is made available.

Immediate step shall be to try out 'strace'. It shall intercept/record system calls that are called by a process and also the signals that are received by a process. It will be able to show the order of events and all the return/resumption paths of calls. This can take you almost closer to the area of problem.

There are other many handy tools that can be tried based on your development environment/setup. Key tools are as below :

'iotop' - It shall provide you a table of current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.

'LTTng' - Makes tracing of race conditions and interrupt cascades possible. It is the successor to LTT. It is a combination of kprobes, tracepoint and perf functionalities.

'Ftrace' - This is a Linux kernel internal tracer with which you can analyze/debug latency and performance related issues.

If your system is based on TI processor, the CCS(Trace analyzer) provides capability to perform non-intrusive debug and analysis of system activity. So, note that based on your setup, you may also need to use the relevant tool .

Came across few more ideas :
magic SysRq key is another option in linux. If the driver is stuck, the command SysRq p can take you to the exact routine that is causing the problem.

Profiling of data can tell where exactly the time is being spent by the kernel. There are couple of tools like Readprofile and Oprofile. Oprofile can be enabled by configuring with CONFIG_PROFILING and CONFIG_OPROFILE. Another option is to rebuild the kernel by enabling the profiling option and reading the profile counters using Readprofile utility by booting up with profile=2 via command line.

mpstat can give 'the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request' via 'iowait' argument.

Bursty writes to SD/USB stalling my time-critical apps on embedded Linux

For the record, there turned out to be two main aspects that seem to have eliminated the problem in all but the most extreme cases. This system is still in development and hasn't been thoroughly torture-tested yet but is working fairly well (touch wood).

The big win came from making the userland writer app multi-threaded. It is the calls to write() that block sometimes: other processes and threads still run. So long as I have a thread servicing the device driver and updating frame counts and other data to sychronise with other apps that are running, the data can be buffered and written out a few seconds later without breaking any deadlines. I tried a simple ping-pong double buffer first but that wasn't enough; small buffers would be overwhelmed and big ones just caused bigger pauses while the filesystem digested the writes. A pool of 10 1MB buffers queued between threads is working well now.

The other aspect is keeping an eye on ultimate write throughput to physical media. For this I am keeping an eye on the stat Dirty: reported by /proc/meminfo. I have some rough and ready code to throttle the encoder if Dirty: climbs above a certain threshold, seems to vaguely work. More testing and tuning needed later. Fortunately I have lots of RAM (128M) to play with giving me a few seconds to see my backlog building up and throttle down smoothly.

I'll try to remember to pop back and update this answer if I find I need to do anything else to deal with this issue. Thanks to the other answerers.



Related Topics



Leave a reply



Submit