How to Use Ptrace(2) to Change Behaviour of Syscalls

Python3 Ptrace duplicate syscalls

After digging a bit in other threads here, I found that every syscall is supposed to appear twice, once before it was called, and another time after it was called.

So the solution will be to simply to add the syscall to the list only once every two iterations.

Can't trace a subprocess's syscalls which calls execve using ptrace and seccomp

I have found a way to solve this problem. To set up the tracer for children processes as well or at least to avoid the ENOSYS problem for sub-children, we can specify the PTRACE_O_TRACEFORK and PTRACE_O_TRACECLONE flag while setting ptrace options like that:

ptrace(PTRACE_SETOPTIONS, child, 0, PTRACE_O_TRACESECCOMP | PTRACE_O_TRACEFORK | PTRACE_O_TRACECLONE);

The reason why we need to add both is not easy to explain briefly. At first, it is architecture and libc-dependent which syscalls are present in the system and which are used by the programs (usually, through the libc implementation). Perhaps, even this list is not full: we may also have to track VFORK and other ways related to cloning (or spawning) a thread or a process (remember, thread are light-weight processes in Linux). So, what these options do is specified in the man:

PTRACE_O_TRACECLONE (since Linux 2.5.46)
Stop the tracee at the next clone(2) and automatically
start tracing the newly cloned process, which will
start with a SIGSTOP, or PTRACE_EVENT_STOP if
PTRACE_SEIZE was used. A waitpid(2) by the tracer will
return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8))
The PID of the new process can be retrieved with PTRACE_GETEVENTMSG. This option may not catch clone(2) calls in all cases. If the tracee calls clone(2) with the CLONE_VFORK flag, PTRACE_EVENT_VFORK will be delivered instead if PTRACE_O_TRACEVFORK is set; otherwise if the tracee calls clone(2) with the exit signal set to SIGCHLD, PTRACE_EVENT_FORK will be delivered if PTRACE_O_TRACE‐FORK is set.

The reason why it works in my case is that after simple cloning, seccomp rules were copied to the cloned process, but the tracer wasn't. By specifying these flags, the parent process becomes the tracer automatically for every child process, and so, as rules are copied, and tracer is specified, everything works like a charm.

NOTE
As using this way the parent process becomes the tracer, you will also need to wait for all children and sub-children, not only the process you actually spawned. To do this, use -1 as a pid argument in waitpid or similar syscalls:

const pid_t childWaited = waitpid(-1, &status, 0);
// but not const pid_t result = waitpid(myChildPid, &status, 0);

Why does this ptrace program say syscall returned -38?

The code doesn't account for the notification of the exec from the child, and so ends up handling syscall entry as syscall exit, and syscall exit as syscall entry. That's why you see "syscall 12 returned" before "syscall 12 called", etc. (-38 is ENOSYS which is put into RAX as a default return value by the kernel's syscall entry code.)

As the ptrace(2) man page states:

PTRACE_TRACEME
Indicates that this process is to be traced by its parent. Any signal (except SIGKILL) delivered to this process will cause it to stop and its parent to be notified via wait(). Also, all subsequent calls to exec() by this process will cause a SIGTRAP to be sent to it, giving the parent a chance to gain control before the new program begins execution. [...]

You said that the original code you were running was "the same as this one except that I'm running execl("/bin/ls", "ls", NULL);". Well, it clearly isn't, because you're working with x86_64 rather than 32-bit and have changed the messages at least.

But, assuming you didn't change too much else, the first time the wait() wakes up the parent, it's not for syscall entry or exit - the parent hasn't executed ptrace(PTRACE_SYSCALL,...) yet. Instead, you're seeing this notification that the child has performed an exec (on x86_64, syscall 59 is execve).

The code incorrectly interprets that as syscall entry. Then it calls ptrace(PTRACE_SYSCALL,...), and the next time the parent is woken it is for a syscall entry (syscall 12), but the code reports it as syscall exit.

Note that in this original case, you never see the execve syscall entry/exit - only the additional notification - because the parent does not execute ptrace(PTRACE_SYSCALL,...) until after it happens.

If you do arrange the code so that the execve syscall entry/exit are caught, you will see the new behaviour that you observe. The parent will be woken three times: once for execve syscall entry (due to use of ptrace(PTRACE_SYSCALL,...), once for execve syscall exit (also due to use of ptrace(PTRACE_SYSCALL,...), and a third time for the exec notification (which happens anyway).

Here is a complete example (for x86 or x86_64) which takes care to show the behaviour of the exec itself by stopping the child first:

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/ptrace.h>
#include <sys/reg.h>

#ifdef __x86_64__
#define SC_NUMBER  (8 * ORIG_RAX)
#define SC_RETCODE (8 * RAX)
#else
#define SC_NUMBER  (4 * ORIG_EAX)
#define SC_RETCODE (4 * EAX)
#endif

static void child(void)
{
    /* Request tracing by parent: */
    ptrace(PTRACE_TRACEME, 0, NULL, NULL);

    /* Stop before doing anything, giving parent a chance to catch the exec: */
    kill(getpid(), SIGSTOP);

    /* Now exec: */
    execl("/bin/ls", "ls", NULL);
}

static void parent(pid_t child_pid)
{
    int status;
    long sc_number, sc_retcode;

    while (1)
    {
        /* Wait for child status to change: */
        wait(&status);

        if (WIFEXITED(status)) {
            printf("Child exit with status %d\n", WEXITSTATUS(status));
            exit(0);
        }
        if (WIFSIGNALED(status)) {
            printf("Child exit due to signal %d\n", WTERMSIG(status));
            exit(0);
        }
        if (!WIFSTOPPED(status)) {
            printf("wait() returned unhandled status 0x%x\n", status);
            exit(0);
        }
        if (WSTOPSIG(status) == SIGTRAP) {
            /* Note that there are *three* reasons why the child might stop
             * with SIGTRAP:
             *  1) syscall entry
             *  2) syscall exit
             *  3) child calls exec
             */
            sc_number = ptrace(PTRACE_PEEKUSER, child_pid, SC_NUMBER, NULL);
            sc_retcode = ptrace(PTRACE_PEEKUSER, child_pid, SC_RETCODE, NULL);
            printf("SIGTRAP: syscall %ld, rc = %ld\n", sc_number, sc_retcode);
        } else {
            printf("Child stopped due to signal %d\n", WSTOPSIG(status));
        }
        fflush(stdout);

        /* Resume child, requesting that it stops again on syscall enter/exit
         * (in addition to any other reason why it might stop):
         */
        ptrace(PTRACE_SYSCALL, child_pid, NULL, NULL);
    }
}

int main(void)
{
    pid_t pid = fork();

    if (pid == 0)
        child();
    else
        parent(pid);

    return 0;
}

which gives something like this (this is for 64-bit - system call numbers are different for 32-bit; in particular execve is 11, rather than 59):

Child stopped due to signal 19
SIGTRAP: syscall 59, rc = -38
SIGTRAP: syscall 59, rc = 0
SIGTRAP: syscall 59, rc = 0
SIGTRAP: syscall 63, rc = -38
SIGTRAP: syscall 63, rc = 0
SIGTRAP: syscall 12, rc = -38
SIGTRAP: syscall 12, rc = 5324800
...

Signal 19 is the explicit SIGSTOP; the child stops three times for the execve as just described above; then twice (entry and exit) for other system calls.

If you're really interesting in all the gory details of ptrace(), the best documentation I'm aware of is the
README-linux-ptrace file in the strace source. As it says, the "API is complex and has subtle quirks"....

How to use PTRACE to get a consistent view of multiple threads?

I wrote a second test case. I had to add a separate answer, since it was too long to fit into the first one with example output included.

First, here is tracer.c:

#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/ptrace.h>
#include <sys/prctl.h>
#include <sys/wait.h>
#include <sys/user.h>
#include <dirent.h>
#include <string.h>
#include <signal.h>
#include <errno.h>
#include <stdio.h>
#ifndef   SINGLESTEPS
#define   SINGLESTEPS 10
#endif

/* Similar to getline(), except gets process pid task IDs.
 * Returns positive (number of TIDs in list) if success,
 * otherwise 0 with errno set. */
size_t get_tids(pid_t **const listptr, size_t *const sizeptr, const pid_t pid)
{
    char     dirname[64];
    DIR     *dir;
    pid_t   *list;
    size_t   size, used = 0;

    if (!listptr || !sizeptr || pid < (pid_t)1) {
        errno = EINVAL;
        return (size_t)0;
    }

    if (*sizeptr > 0) {
        list = *listptr;
        size = *sizeptr;
    } else {
        list = *listptr = NULL;
        size = *sizeptr = 0;
    }

    if (snprintf(dirname, sizeof dirname, "/proc/%d/task/", (int)pid) >= (int)sizeof dirname) {
        errno = ENOTSUP;
        return (size_t)0;
    }

    dir = opendir(dirname);
    if (!dir) {
        errno = ESRCH;
        return (size_t)0;
    }

    while (1) {
        struct dirent *ent;
        int            value;
        char           dummy;

        errno = 0;
        ent = readdir(dir);
        if (!ent)
            break;

        /* Parse TIDs. Ignore non-numeric entries. */
        if (sscanf(ent->d_name, "%d%c", &value, &dummy) != 1)
            continue;

        /* Ignore obviously invalid entries. */
        if (value < 1)
            continue;

        /* Make sure there is room for another TID. */
        if (used >= size) {
            size = (used | 127) + 128;
            list = realloc(list, size * sizeof list[0]);
            if (!list) {
                closedir(dir);
                errno = ENOMEM;
                return (size_t)0;
            }
            *listptr = list;
            *sizeptr = size;
        }

        /* Add to list. */
        list[used++] = (pid_t)value;
    }
    if (errno) {
        const int saved_errno = errno;
        closedir(dir);
        errno = saved_errno;
        return (size_t)0;
    }
    if (closedir(dir)) {
        errno = EIO;
        return (size_t)0;
    }

    /* None? */
    if (used < 1) {
        errno = ESRCH;
        return (size_t)0;
    }

    /* Make sure there is room for a terminating (pid_t)0. */
    if (used >= size) {
        size = used + 1;
        list = realloc(list, size * sizeof list[0]);
        if (!list) {
            errno = ENOMEM;
            return (size_t)0;
        }
        *listptr = list;
        *sizeptr = size;
    }

    /* Terminate list; done. */
    list[used] = (pid_t)0;
    errno = 0;
    return used;
}

static int wait_process(const pid_t pid, int *const statusptr)
{
    int   status;
    pid_t p;

    do {
        status = 0;
        p = waitpid(pid, &status, WUNTRACED | WCONTINUED);
    } while (p == (pid_t)-1 && errno == EINTR);
    if (p != pid)
        return errno = ESRCH;

    if (statusptr)
        *statusptr = status;

    return errno = 0;
}

static int continue_process(const pid_t pid, int *const statusptr)
{
    int   status;
    pid_t p;

    do {

        if (kill(pid, SIGCONT) == -1)
            return errno = ESRCH;

        do {
            status = 0;
            p = waitpid(pid, &status, WUNTRACED | WCONTINUED);
        } while (p == (pid_t)-1 && errno == EINTR);

        if (p != pid)
            return errno = ESRCH;

    } while (WIFSTOPPED(status));

    if (statusptr)
        *statusptr = status;

    return errno = 0;
}

void show_registers(FILE *const out, pid_t tid, const char *const note)
{
    struct user_regs_struct regs;
    long                    r;

    do {
        r = ptrace(PTRACE_GETREGS, tid, ®s, ®s);
    } while (r == -1L && errno == ESRCH);
    if (r == -1L)
        return;

#if (defined(__x86_64__) || defined(__i386__)) && __WORDSIZE == 64
    if (note && *note)
        fprintf(out, "Task %d: RIP=0x%016lx, RSP=0x%016lx. %s\n", (int)tid, regs.rip, regs.rsp, note);
    else
        fprintf(out, "Task %d: RIP=0x%016lx, RSP=0x%016lx.\n", (int)tid, regs.rip, regs.rsp);
#elif (defined(__x86_64__) || defined(__i386__)) && __WORDSIZE == 32
    if (note && *note)
        fprintf(out, "Task %d: EIP=0x%08lx, ESP=0x%08lx. %s\n", (int)tid, regs.eip, regs.esp, note);
    else
        fprintf(out, "Task %d: EIP=0x%08lx, ESP=0x%08lx.\n", (int)tid, regs.eip, regs.esp);
#endif
}

int main(int argc, char *argv[])
{
    pid_t *tid = 0;
    size_t tids = 0;
    size_t tids_max = 0;
    size_t t, s;
    long   r;

    pid_t child;
    int   status;

    if (argc < 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COMMAND [ ARGS ... ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program executes COMMAND in a child process,\n");
        fprintf(stderr, "and waits for it to stop (via a SIGSTOP signal).\n");
        fprintf(stderr, "When that occurs, the register state of each thread\n");
        fprintf(stderr, "is dumped to standard output, then the child process\n");
        fprintf(stderr, "is sent a SIGCONT signal.\n");
        fprintf(stderr, "\n");
        return 1;
    }

    child = fork();
    if (child == (pid_t)-1) {
        fprintf(stderr, "fork() failed: %s.\n", strerror(errno));
        return 1;
    }

    if (!child) {
        prctl(PR_SET_DUMPABLE, (long)1);
        prctl(PR_SET_PTRACER, (long)getppid());
        fflush(stdout);
        fflush(stderr);
        execvp(argv[1], argv + 1);
        fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
        return 127;
    }

    fprintf(stderr, "Tracer: Waiting for child (pid %d) events.\n\n", (int)child);
    fflush(stderr);

    while (1) {

        /* Wait for a child event. */
        if (wait_process(child, &status))
            break;

        /* Exited? */
        if (WIFEXITED(status) || WIFSIGNALED(status)) {
            errno = 0;
            break;
        }

        /* At this point, only stopped events are interesting. */
        if (!WIFSTOPPED(status))
            continue;

        /* Obtain task IDs. */
        tids = get_tids(&tid, &tids_max, child);
        if (!tids)
            break;

        printf("Process %d has %d tasks,", (int)child, (int)tids);
        fflush(stdout);

        /* Attach to all tasks. */
        for (t = 0; t < tids; t++) {
            do {
                r = ptrace(PTRACE_ATTACH, tid[t], (void *)0, (void *)0);
            } while (r == -1L && (errno == EBUSY || errno == EFAULT || errno == ESRCH));
            if (r == -1L) {
                const int saved_errno = errno;
                while (t-->0)
                    do {
                        r = ptrace(PTRACE_DETACH, tid[t], (void *)0, (void *)0);
                    } while (r == -1L && (errno == EBUSY || errno == EFAULT || errno == ESRCH));
                tids = 0;
                errno = saved_errno;
                break;
            }
        }
        if (!tids) {
            const int saved_errno = errno;
            if (continue_process(child, &status))
                break;
            printf(" failed to attach (%s).\n", strerror(saved_errno));
            fflush(stdout);
            if (WIFCONTINUED(status))
                continue;
            errno = 0;
            break;
        }

        printf(" attached to all.\n\n");
        fflush(stdout);

        /* Dump the registers of each task. */
        for (t = 0; t < tids; t++)
            show_registers(stdout, tid[t], "");
        printf("\n");
        fflush(stdout);

        for (s = 0; s < SINGLESTEPS; s++) {
            do {
                r = ptrace(PTRACE_SINGLESTEP, tid[tids-1], (void *)0, (void *)0);
            } while (r == -1L && errno == ESRCH);
            if (!r) {
                for (t = 0; t < tids - 1; t++)
                    show_registers(stdout, tid[t], "");
                show_registers(stdout, tid[tids-1], "Advanced by one step.");
                printf("\n");
                fflush(stdout);
            } else {
                fprintf(stderr, "Single-step failed: %s.\n", strerror(errno));
                fflush(stderr);
            }
        }

        /* Detach from all tasks. */
        for (t = 0; t < tids; t++)
            do {
                r = ptrace(PTRACE_DETACH, tid[t], (void *)0, (void *)0);
            } while (r == -1 && (errno == EBUSY || errno == EFAULT || errno == ESRCH));
        tids = 0;
        if (continue_process(child, &status))
            break;
        if (WIFCONTINUED(status)) {
            printf("Detached. Waiting for new stop events.\n\n");
            fflush(stdout);
            continue;
        }
        errno = 0;
        break;
    }
    if (errno)
        fprintf(stderr, "Tracer: Child lost (%s)\n", strerror(errno));
    else
    if (WIFEXITED(status))
        fprintf(stderr, "Tracer: Child exited (%d)\n", WEXITSTATUS(status));
    else
    if (WIFSIGNALED(status))
        fprintf(stderr, "Tracer: Child died from signal %d\n", WTERMSIG(status));
    else
        fprintf(stderr, "Tracer: Child vanished\n");
    fflush(stderr);

    return status;
}

tracer.c executes the specified command, waiting for the command to receive a SIGSTOP signal. (tracer.c does not send it itself; you can either have the tracee stop itself, or send the signal externally.)

When the command has stopped, tracer.c attaches a ptrace to every thread, and single-steps one of the threads a fixed number of steps (SINGLESTEPS compile-time constant), showing the pertinent register state for each thread.

After that, it detaches from the command, and sends it a SIGCONT signal to let it continue its operation normally.

Here is a simple test program, worker.c, I used for testing:

#include <pthread.h>
#include <signal.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>

#ifndef   THREADS
#define   THREADS  2
#endif

volatile sig_atomic_t   done = 0;

void catch_done(int signum)
{
    done = signum;
}

int install_done(const int signum)
{
    struct sigaction act;

    sigemptyset(&act.sa_mask);
    act.sa_handler = catch_done;
    act.sa_flags = 0;
    if (sigaction(signum, &act, NULL))
        return errno;
    else
        return 0;
}

void *worker(void *data)
{
    volatile unsigned long *const counter = data;

    while (!done)
        __sync_add_and_fetch(counter, 1UL);

    return (void *)(unsigned long)__sync_or_and_fetch(counter, 0UL);
}

int main(void)
{
    unsigned long   counter = 0UL;
    pthread_t       thread[THREADS];
    pthread_attr_t  attrs;
    size_t          i;

    if (install_done(SIGHUP) ||
        install_done(SIGTERM) ||
        install_done(SIGUSR1)) {
        fprintf(stderr, "Worker: Cannot install signal handlers: %s.\n", strerror(errno));
        return 1;
    }

    pthread_attr_init(&attrs);
    pthread_attr_setstacksize(&attrs, 65536);
    for (i = 0; i < THREADS; i++)
        if (pthread_create(&thread[i], &attrs, worker, &counter)) {
            done = 1;
            fprintf(stderr, "Worker: Cannot create thread: %s.\n", strerror(errno));
            return 1;
        }
    pthread_attr_destroy(&attrs);

    /* Let the original thread also do the worker dance. */
    worker(&counter);

    for (i = 0; i < THREADS; i++)
        pthread_join(thread[i], NULL);

    return 0;
}

Compile both using e.g.

gcc -W -Wall -O3 -fomit-frame-pointer worker.c -pthread -o worker
gcc -W -Wall -O3 -fomit-frame-pointer tracer.c -o tracer

and run either in a separate terminal, or on the background, using e.g.

./tracer ./worker &

The tracer shows the PID of the worker:

Tracer: Waiting for child (pid 24275) events.

At this point, the child is running normally. The action starts when you send a SIGSTOP to the child. The tracer detects it, does the desired tracing, then detaches and lets the child continue normally:

kill -STOP 24275

Process 24275 has 3 tasks, attached to all.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a5d, RSP=0x00007f399cfa6ee8.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a5d, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a63, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a65, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a58, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a5d, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a63, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a65, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a58, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a5d, RSP=0x00007f399cfa6ee8. Advanced by one step.

Task 24275: RIP=0x0000000000400a5d, RSP=0x00007fff6895c428.
Task 24276: RIP=0x0000000000400a5d, RSP=0x00007f399cfb7ee8.
Task 24277: RIP=0x0000000000400a63, RSP=0x00007f399cfa6ee8. Advanced by one step.

Detached. Waiting for new stop events.

You can repeat the above as many times as you wish. Note that I picked the SIGSTOP signal as the trigger, because this way tracer.c is also useful as a basis for generating complex multithreaded core dumps per request (as the multithreaded process can simply trigger it by sending itself a SIGSTOP).

The disassembly of the worker() function the threads are all spinning in the above example:

0x400a50: eb 0b                 jmp          0x400a5d
0x400a52: 66 0f 1f 44 00 00     nopw         0x0(%rax,%rax,1)
0x400a58: f0 48 83 07 01        lock addq    $0x1,(%rdi)          = fourth step
0x400a5d: 8b 05 00 00 00 00     mov          0x0(%rip),%eax       = first step
0x400a63: 85 c0                 test         %eax,%eax            = second step
0x400a65: 74 f1                 je           0x400a58             = third step
0x400a67: 48 8b 07              mov          (%rdi),%rax
0x400a6a: 48 89 c2              mov          %rax,%rdx
0x400a6d: f0 48 0f b1 07        lock cmpxchg %rax,(%rdi)
0x400a72: 75 f6                 jne          0x400a6a
0x400a74: 48 89 d0              mov          %rdx,%rax
0x400a77: c3                    retq

Now, this test program does only show how to stop a process, attach to all of its threads, single-step one of the threads a desired number of instructions, then letting all the threads continue normally; it does not yet prove that the same applies for letting specific threads continue normally (via PTRACE_CONT). However, the detail I describe below indicates, to me, that the same approach should work fine for PTRACE_CONT.

The main problem or surprise I encountered while writing the above test programs was the necessity of the

long r;

do {
    r = ptrace(PTRACE_cmd, tid, ...);
} while (r == -1L && (errno == EBUSY || errno == EFAULT || errno == ESRCH));

loop, especially for the ESRCH case (the others I only added due to the ptrace man page description).

You see, most ptrace commands are only allowed when the task is stopped. However, the task is not stopped when it is still completing e.g. a single-step command. Thus, using the above loop -- perhaps adding a millisecond nanosleep or similar to avoid wasting CPU -- makes sure the previous ptrace command has completed (and thus the task stopped) before we try to supply the new one.

Kerrek SB, I do believe at least some of the troubles you've had with your test programs are due to this issue? To me, personally, it was a kind of a D'oh! moment to realize that of course this is necessary, as ptracing is inherently asynchronous, not synchronous.

(This asynchronicity is also the cause for the SIGCONT-PTRACE_CONT interaction I mentioned above. I do believe with proper handling using the loop shown above, that interaction is no longer a problem -- and is actually quite understandable.)

Adding to the comments to this answer:

The Linux kernel uses a set of task state flags in the task_struct structure (see include/linux/sched.h for definition) to keep track of the state of each task. The userspace-facing side of ptrace() is defined in kernel/ptrace.c.

When PTRACE_SINGLESTEP or PTRACE_CONT is called, kernel/ptrace.c:ptrace_continue() handles most of the details. It finishes by calling wake_up_state(child, __TASK_TRACED) (kernel/sched/core.c::try_to_wake_up(child, __TASK_TRACED, 0)).

When a process is stopped via SIGSTOP signal, all tasks will be stopped, and end up in the "stopped, not traced" state.

Attaching to every task (via PTRACE_ATTACH or PTRACE_SEIZE, see kernel/ptrace.c:ptrace_attach()) modifies the task state. However, ptrace state bits (see include/linux/ptrace.h:PT_ constants) are separate from the task runnable state bits (see include/linux/sched.h:TASK_ constants).

After attaching to the tasks, and sending the process a SIGCONT signal, the stopped state is not immediately modified (I believe), since the task is also being traced. Doing PTRACE_SINGLESTEP or PTRACE_CONT ends up in kernel/sched/core.c::try_to_wake_up(child, __TASK_TRACED, 0), which updates the task state, and moves the task to the run queue.

Now, the complicated part that I haven't yet found the code path, is how the task state gets updated in the kernel when the task is next scheduled. My tests indicate that with single-stepping (which is yet another task state flag), only the task state gets updated, with the single-step flag cleared. It seems that PTRACE_CONT is not as reliable; I believe it is because the single-step flag "forces" that task state change. Perhaps there is a "race condition" wrt. the continue signal delivery and state change?

(Further edit: the kernel developers definitely expect wait() to be called, see for example this thread.)

In other words, after noticing that the process has stopped (note that you can use /proc/PID/stat or /proc/PID/status if the process is not a child, and not yet attached to), I believe the following procedure is the most robust one:

pid_t  pid, p; /* Process owning the tasks */
tid_t *tid;    /* Task ID array */
size_t tids;   /* Tasks */
long   result;
int    status;
size_t i;

for (i = 0; i < tids; i++) {
    while (1) {
        result = ptrace(PTRACE_ATTACH, tid[i], (void *)0, (void *)0);
        if (result == -1L && (errno == ESRCH || errno == EBUSY || errno == EFAULT || errno == EIO)) {
            /* To avoid burning up CPU for nothing: */
            sched_yield(); /* or nanosleep(), or usleep() */
            continue;
        }
        break;
    }       
    if (result == -1L) {
        /*
         * Fatal error. First detach from tid[0..i-1], then exit.
        */
    }
}

/* Send SIGCONT to the process. */
if (kill(pid, SIGCONT)) {
    /*
     * Fatal error, see errno. Exit.
    */
}

/* Since we are attached to the process,
 * we can wait() on it. */
while (1) {
    errno = 0;
    status = 0;
    p = waitpid(pid, &status, WCONTINUED);

How to Use Ptrace(2) to Change Behaviour of Syscalls

Python3 Ptrace duplicate syscalls

Can't trace a subprocess's syscalls which calls execve using ptrace and seccomp

Why does this ptrace program say syscall returned -38?

How to use PTRACE to get a consistent view of multiple threads?

Related Topics

Leave a reply