Low-Overhead Way to Access the Memory Space of a Traced Process

Injected mprotect system call into traced process fails with EFAULT

The problem is related to this question: i386 and x86_64 use different calling conventions for system calls. Your example code uses int 0x80, the i386 variant, but syscall_number = 10, the 64-bit syscall number for mprotect. In 32-bit environments, syscall 10 coresponds to unlink, according to this list, which can return EFAULT (Bad address).

On 64-bit platforms, using either the 32-bit or 64-bit variant in a consistent manner solves the problem.

ptrace with request PTRACE_POKETEXT fails

Those strange bytes that you get from the second ptrace(PTRACE_PEEKTEXT, ...) should match the address of data - compare them with the value of &data.

Although the manual page of ptrace(2) shows the data argument as void *, for the PTRACE_POKETEXT request data holds the request value. Using the address-of operator you actually poke the address of the value instead of the value itself. The correct invocation is as follows:

res = ptrace(PTRACE_POKETEXT, pid, (void *)regs.eip, (void *)data); // w/o &
if (res != 0)
    //error

ptrace'ing of parent process

This question really interested me. So I wrote some code to try it out.

Firstly keep in mind, that when tracing a process, the tracing process becomes a parent for most purposes, except in name (i.e. getppid()). Firstly, a snippet of the PTRACE_ATTACH section of the manual is helpful:

   PTRACE_ATTACH
          Attaches to the process specified in pid,  making  it  a  traced
          "child"  of the calling process; the behavior of the child is as
          if it had done a PTRACE_TRACEME.  The calling  process  actually
          becomes the parent of the child process for most purposes (e.g.,
          it will receive notification of  child  events  and  appears  in
          ps(1)  output  as  the  child's parent), but a getppid(2) by the
          child will still return the PID of  the  original  parent.   The
          child  is  sent a SIGSTOP, but will not necessarily have stopped
          by the completion of this call; use  wait(2)  to  wait  for  the
          child to stop.  (addr and data are ignored.)

Now here is the code I wrote to test and verify that you can in fact ptrace() your parent (you can build this by dumping it in a file named blah.c and running make blah:

#include <assert.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>

int main()
{
    pid_t pid = fork();
    assert(pid != -1);
    int status;
    long readme = 0;
    if (pid)
    {
        readme = 42;
        printf("parent: child pid is %d\n", pid);
        assert(pid == wait(&status));
        printf("parent: child terminated?\n");
        assert(0 == status);
    }
    else
    {
        pid_t tracee = getppid();
        printf("child: parent pid is %d\n", tracee);
        sleep(1); // give parent time to set readme
        assert(0 == ptrace(PTRACE_ATTACH, tracee));
        assert(tracee == waitpid(tracee, &status, 0));
        printf("child: parent should be stopped\n");
        printf("child: peeking at parent: %ld\n", ptrace(PTRACE_PEEKDATA, tracee, &readme));
    }
    return 0;
}

Note that I'm exploiting the replication of the parent's virtual address space to know where to look. Also note that when the child then terminates, I suspect there's an implicit detach which must allow the parent to continue, I didn't investigate further.

How to intercept memory accesses/changes in the Hotspot JVM?

Effectively, you want to monitor changes of Java objects. Tracking memory changes at levels lower than the JVM is an option. Maximum precision could be achieved using

page write protection and a signal handler for generating write notifications (care must be taken not to interfere with the GC write barrier)
dynamic instrumentation using an instrumentation framework such as Valgrind (static instrumentation is not an option because it does not cover the JIT output)
virtualization based on a custom hypervisor

For snapshotting, you could use

ptrace for process suspension and gaining access to process memory
fork-based asynchronous snapshots using custom code / core dumps (taking advantage of memory copy-on-write, the main process does not have to be suspended)
- maximum precision implementation strategies in a relaxed version

The downside of that option is that you'd also be forced to track writes that are unrelated to the Java heap itself (JVM internals, garbage collection, monitors, libraries, ...). Writes affecting the Java heap represent a subset of all writes taking place in the process at any given time. Also, it'd be less straightforward to extract the actual Java objects from those process snapshots/dumps without actual JVM code.

In terms of monitoring changes at the JVM level, a more favorable strategy, maximum precision could be achieved using

bytecode instrumentation (doesn't cover JNI-based writes)
- high-overhead approach: record every single write
- low-overhead approach: add a write barrier that sets a flag whenever a write occurs and dump flagged objects at regular intervals
a custom OpenJDK build that includes your own monitoring layer
- could take advantage of the garbage collector write barrier to identify changes
  - usually implemented by means of a flag set on every write or
  - a flag that is only set on the first write by write-protecting the memory page associated with an object and handling the segmentation fault with setting a flag

For snapshotting, you could use

custom heap snapshots based on JVMTI's IterateThroughHeap and/or FollowReferences
heap dumps triggered externally using JMX or internally:

HotSpotDiagnosticMXBean mxbean = ManagementFactory.newPlatformMXBeanProxy(
  ManagementFactory.getPlatformMBeanServer(),
  "com.sun.management:type=HotSpotDiagnostic",
  HotSpotDiagnosticMXBean.class);
mxbean.dumpHeap("dump.hprof", true);

maximum precision implementation strategies in a relaxed version

The "right" approach depends on desired performance characteristics, target platform, portability (can it rely on a specific JVM implementation/version), and precision/resolution (snapshots/sampling [aggregating writes] vs. instrumentation [recording each individual write]).

In terms of performance, doing the monitoring at the JVM level tends to be more efficient as only the actual Java heap writes have to be taken into account. Integrating your monitoring solution into the VM and taking advantage of the GC write barrier could be a low-overhead solution, but would also be the least portable one (tied to a specific JVM implementation/version).

If you need to record each individual write, you have to go the instrumentation route and it will most likely turn out to have a significant runtime overhead. You cannot aggregate writes, so there's no optimization potential.

In terms of sampling/snapshotting, implementing a JVMTI agent could be a good compromise. It provides high portability (works with many JVMs) and high flexibility (the iteration and processing can be tailored to your needs, as opposed to relying on standard HPROF heap dumps).

Determining exactly what is pickled during Python multiprocessing

Multiprocessing isn't exactly a simple library, but once you're familiar with how it works, it's pretty easy to poke around and figure it out.

You usually want to start with context.py. This is where all the useful classes get bound depending on OS, and... well... the "context" you have active. There are 4 basic contexts: Fork, ForkServer, and Spawn for posix; and a separate Spawn for windows. These in turn each have their own "Popen" (called at start()) to launch a new process to handle the separate implementations.

popen_fork.py

creating a process literally calls os.fork(), and then in the child organizes to run BaseProcess._bootstrap() which sets up some cleanup stuff then calls self.run() to execute the code you give it. No pickling occurs to start a process this way because the entire memory space gets copied (with some exceptions. see: fork(2)).

popen_spawn_xxxxx.py

I am most familiar with windows, but I assume both the win32 and posix versions operate in a very similar manner. A new python process is created with a simple crafted command line string including a pair of pipe handles to read/write from/to. The new process will import the __main__ module (generally equal to sys.argv[0]) in order to have access to all the needed references. Then it will execute a simple bootstrap function (from the command string) that attempts to read and un-pickle a Process object from its pipe it was created with. Once it has the Process instance (a new object which is a copy; not just a reference to the original), it will again arrange to call _bootstrap().

popen_forkserver.py

The first time a new process is created with the "forkserver" context, a new process will be "spawn"ed running a simple server (listening on a pipe) which handles new process requests. Subsequent process requests all go to the same server (based on import mechanics and a module-level global for the server instance). New processes are then "fork"ed from that server in order to save the time of spinning up a new python instance. These new processes however can't have any of the same (as in same object and not a copy) Process objects because the python process they were forked from was itself "spawn"ed. Therefore the Process instance is pickled and sent much like with "spawn". The benefits of this method include: The process doing the forking is single threaded to avoid deadlocks. The cost of spinning up a new python interpreter is only paid once. The memory consumption of the interpreter, and any modules imported by __main__ can largely be shared due to "fork" generally using copy-on-write memory pages.

In all cases, once the split has occurred, you should consider the memory spaces totally separate, and the only communication between them is via pipes or shared memory. Locks and Semaphores are handled by an extension library (written in c), but are basically named semaphores managed by the OS. Queue's, Pipe's and multiprocessing.Manager's use pickling to synchronize changes to the proxy objects they return. The new-ish multiprocessing.shared_memory uses a memory-mapped file or buffer to share data (managed by the OS like semaphores).

To address your concern:

the code may have a bug and an object which is supposed to read-only is inadvertently modified, leading to its pickling to be transferred to other processes.

This only really applies to multiprocessing.Manager proxy objects. As everything else requires you to be very intentional about sending and receiveing data, or instead uses some other transfer mechanism than pickling.

How to find all read-write memory address of a process in Linux/UNIX with C/C++ language?

Read the proc file like you read normal file.

eg.

  FILE *filep = fopen("/proc/9322/maps","r");
  char ch;
  while (ch != EOF){
    ch = fgetc(filep);
    printf("%c", ch);
  }