Checkpoint/Restart Using Core Dump in Linux

Checkpoint/restart using Core Dump in Linux

No, this is not possible in general without special support from the kernel. The kernel maintains a LOT of per-process state, such as the file descriptor table, IPC objects, etc.

If you were willing to make lots of simplifying assumptions, such as no open files, no open sockets, no living IPC objects, no shared memory regions, and more, then in theory it would be possible, but in practice I don't believe it's possible with Linux even with those concessions.

Is it possible to interrupt a process and checkpoint it to resume it later on?

In general terms, checkpointing a process is not entirely possible (because a process is not only an address space, but also has other resources likes file descriptors, and TCP/IP sockets ...).

In practice, you can use some checkpointing libraries like BLCR etc. With certain limiting conditions, you might be able to migrate a checkpoint image from one system to another one (very similar to the source one: same kernel, same versions of libraries & compilers, etc.).

Migrating images is also possible at the virtual machine level. Some of them are quite good for that.

You could also design and implement your software with your own checkpointing machinery. Then, you should think of using garbage collection techniques and terminology. Look also into Emacs (or Xemacs) unexec.c file (which is heavily machine dependent).

Some languages implementation & runtime have checkpointing primitives. SBCL (a free Common Lisp implementation) is able to save a core image and restart it later. SML/NJ is able to export an image. Squeak (a Smalltalk implementation) also has such ability.

As an other example of checkpointing, the GCC compiler is actually able to compile a single *.h header (into a pre-compiled header file which is a persistent image of GCC heap) by using persistence techniques.

Read more about orthogonal persistence. It is also a research subject. serialization is also relevant (and you might want to use textual formats à la JSON, YAML, XML, ...). You might also use hibernation techniques (on the whole system level).

Fork and core dump with threads

Are you familiar with process checkpoint-restart? In particular, CRIU? It seems to me it might provide an easy option for you.

I want to obtain a core dump of a running Linux process without interrupting the process [and] to somehow obtain the relevant data of the other, original threads.

Forget about not interrupting the process. If you think about it, a core dump has to interrupt the process for the duration of the dump; your true goal must therefore be to minimize the duration of this interruption. Your original idea of using fork() does interrupt the process, it just does so for a very short time.

Is the memory that contains all of the threads' stacks still available and accessible in the forked process?

No. The fork() only retains the thread that does the actual call, and the stacks for the rest of the threads are lost.

Here is the procedure I'd use, assuming CRIU is unsuitable:

Have a parent process that generates a core dump of the child process whenever the child is stopped. (Note that more than one consecutive stop event may be generated; only the first one until the next continue event should be acted on.)
You can detect the stop/continue events using waitpid(child,,WUNTRACED|WCONTINUED).
Optional: Use sched_setaffinity() to restrict the process to a single CPU, and sched_setscheduler() (and perhaps sched_setparam()) to drop the process priority to IDLE.
You can do this from the parent process, which only needs the CAP_SYS_NICE capability (which you can give it using setcap 'cap_sys_nice=pe' parent-binary to the parent binary, if you have filesystem capabilities enabled like most current Linux distributions do), in both the effective and permitted sets.
The intent is to minimize the progress of the other threads between the moment a thread decides it wants a snapshot/dump, and the moment when all threads have been stopped. I have not tested how long it takes for the changes to take effect -- certainly they only happen at the end of their current timeslices at the very earliest. So, this step should probably be done a bit beforehand.
Personally, I don't bother. On my four-core machine, the following SIGSTOP alone yields similar latencies between threads as a mutex or a semaphore does, so I don't see any need to strive for even better synchronization.
When a thread in the child process decides it wants to take a snapshot of itself, it sends a SIGSTOP to itself (via kill(getpid(), SIGSTOP)). This stops all threads in the process.
The parent process will receive the notification that the child was stopped. It will first examines /proc/PID/task/ to obtain the TIDs for each thread of the child process (and perhaps /proc/PID/task/TID/ pseudofiles for other information), then attaches to each TID using ptrace(PTRACE_ATTACH, TID). Obviously, ptrace(PTRACE_GETREGS, TID, ...) will obtain the per-thread register states, which can be used in conjunction with /proc/PID/task/TID/smaps and /proc/PID/task/TID/mem to obtain the per-thread stack trace, and whatever other information you're interested in. (For example, you could create a debugger-compatible core file for each thread.)
When the parent process is done grabbing the dump, it lets the child process continue. I believe you need to send a separate SIGCONT signal to let the entire child process continue, instead of just relying on ptrace(PTRACE_CONT, TID), but I haven't checked this; do verify this, please.

I do believe that the above will yield a minimal delay in wall clock time between the threads in the process stopping. Quick tests on AMD Athlon II X4 640 on Xubuntu and a 3.8.0-29-generic kernel indicates tight loops incrementing a volatile variable in the other threads only advance the counters by a few thousand, depending on the number of threads (there's too much noise in the few tests I made to say anything more specific).

Limiting the process to a single CPU, and even to IDLE priority, will drastically reduce that delay even further. CAP_SYS_NICE capability allows the parent to not only reduce the priority of the child process, but also lift the priority back to original levels; filesystem capabilities mean the parent process does not even have to be setuid, as CAP_SYS_NICE alone suffices. (I think it'd be safe enough -- with some good checks in the parent program -- to be installed in e.g. university computers, where students are quite active in finding interesting ways to exploit the installed programs.)

It is possible to create a kernel patch (or module) that provides a boosted kill(getpid(), SIGSTOP) that also tries to kick off the other threads from running CPUs, and thus try to make the delay between the threads stopping even smaller. Personally, I wouldn't bother. Even without the CPU/priority manipulation I get sufficient synchronization (small enough delays between the times the threads are stopped).

Do you need some example code to illustrate my ideas above?

Can you freeze a C/C++ process and continue it on a different host?

On modern systems, not from a core file, no you can't. For freezing and restoring an individual process on Linux, CryoPID and the new Kernel-based checkpoint and restart are in the works, but their abilities are currently quite limited. OpenVZ and other virtualization-like softwares can freeze and restore an entire system.

Restarting threads in forked process

Threads execution state is not only the data in stack. It is also set of CPU registers, which is lost.

do_fork() system call just don't copy any thread other from thread, which executes a syscall do_fork -> copy_process and there is a single call to copy_thread at line 1181

retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);

Checkpoint/Restart Using Core Dump in Linux