Automatically Kill Process That Consume Too Much Memory or Stall on Linux

Automatically kill process that consume too much memory or stall on linux

For the first requirement, you might want to look into either using ulimit, or tweaking the kernel OOM-killer settings on your system.

Monitoring daemons exist for this sort of thing as well. God is a recent example.

What is an uninterruptible process?

An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.

To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.

While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:

The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.

Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.

On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.

LWN ran a nice article that touched this topic in July.

To answer the original question:

How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.

Finding a Perl memory leak

How do you know that it's a memory leak? I can think of many other reasons why the OS would kill a program.

The first question I would ask is "Does this program always work correctly from the command line?". If the answer is "No" then I'd fix these issues first.

On the other hand if the answer is "Yes", I would investigate all the differences between having the program executed under cron and from the command line to find out why it is misbehaving.

Dask high memory usage when computing two values with common dependency

The way the array is constructed, every time a chunk is created it has to generate every column of the array. So one opportunity for optimization (if possible) is to generate/load array in a way that allows for column-wise processing. This will reduce memory load of a single task.

Another venue for optimization is to explicitly specify the common dependencies, for example dask.compute(df[['0', '1']].sum()) will run efficiently.

However, the more important point is that by default dask follows some rules of thumb on how to prioritize work, see here. You have several options to intervene (not sure if this list is exhaustive): custom priorities, resource constraints, modify the compute graph (to allow workers to release memory from intermediate tasks without waiting for the final task to complete).

A simple way to modify the graph is to break down the dependency between the final sum figure and all the intermediate tasks by computing intermediate sums manually:

[results] = dask.compute([df["0"].map_partitions(sum), df["1"].map_partitions(sum)])

Note that results will be a list of two sublists, but it's trivial to calculate the sum of each sublist (trying to run sum on a delayed object would trigger computation, so it's more efficient to run sum after results are computed).

What should I do when Ubuntu freezes?

When a single program stops working:

When a program window stops responding, you can usually stop it by clicking the X-shaped close button at the top left of the window. That will generally result in a dialog box saying that the program is not responding (but you already knew that) and presenting you with the option to kill the program or to continue to wait for it to respond.

Sometimes this does not work as expected. If you can't close a window by normal means, you can hit Alt+F2, type xkill, and press Enter. Your mouse cursor will then turn into an X. Hover over the offending window and left-click to kill it. Right clicking will cancel and return your mouse to normal.

If your program is running from a terminal, on the other hand, you can usually halt it with Ctrl+C. If not, find the name and process ID of its command, and tell the program to end as soon as possible with kill [process ID here]. It sends the default signal SIGTERM (15). If all else fails, as a last resort send SIGKILL (9): kill -9 [process ID here]. Note that you should only use SIGKILL as a last resort, because the process will be terminated immediately by the kernel with no opportunity for cleanup. It does not even get the signal - it just stops to exist.

(Killing a process by kill -9 allways works if you have the permission to kill. In some special cases the process is still listed by ps or top (as "zombie") - in this case, the program was killed, but the process table entry is kept, becuse it's needed later.)

When the mouse stops working:

If the keyboard still works, press Alt+F2 and run gnome-terminal (or, if these fail to launch, press Alt+Ctrl+F1 and login with your username and password). From there you can troubleshoot things. I'm not going to get into mouse troubleshooting here, as I haven't researched it. If you just want to try restarting the GUI, run sudo service lightdm restart. This should bring down the GUI, which will then attempt to respawn, bringing you back to the login screen.

When you have an Intel Bay Trail CPU

See https://askubuntu.com/a/803649/225694.

When everything, keys and mouse and all, stop working:

First try the Magic SysReq method outlined in Phoenix' answer. If that doesn't work, press the Reset button on the computer case. If even that doesn't work, you'll just have to power-cycle the machine.

May you never reach this point.

Memory Allocation/Deallocation Bottleneck?

It's significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.

In games I've seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn't practical.

Performance optimization strategies of last resort

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.
Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.

Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.

That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.

Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.

More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.
Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.

Total speedup factor: 43.6

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.

P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.

I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.

ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

Here is the second problem, in two separate lines:

 /* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.

I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.

REFERENCE ADDED:
The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.

EDIT 2011/11/26:
There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

Automatically Kill Process That Consume Too Much Memory or Stall on Linux