Does Linux Support Memory Isolation for Processes

Does Linux support memory isolation for processes?

Side note: As far as I know, this is a poorly documented topic given its importance as a security issue.

Too Long; Don't Read: A process's virtual address space is fully isolated from another. The Linux kernel has access to the whole memory as it runs in kernel mode. It provides system calls that allow a process, under certain circumstances (see Ptrace access mode checking below), to access the memory of another.

There are system calls in the Linux kernel that allow reading/writing memory of other process:

process_vm_readv() and process_vm_writev() (same manual page)
These system calls transfer data between the address space of the calling process ("the local process") and the process identified by pid ("the remote process"). The data moves directly between the address spaces of the two processes, without passing through kernel space.
The last sentence refers to what happens in kernel mode (the kernel actually copies between two physical addresses). The user mode cannot access other virtual address space. For technical details, take a look at the implementation patch.
Regarding the permissions needed:
Permission to read from or write to another process is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace().
ptrace()
The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers.

Regarding the permissions needed, according to ptrace() manual page:

Ptrace access mode checking
Various parts of the kernel-user-space API (not just ptrace() operations), require so-called "ptrace access mode" checks, whose outcome determines whether an operation is permitted (or, in a few cases, causes a "read" operation to return sanitized data). These checks are performed in cases where one process can inspect sensitive information about, or in some cases modify the state of, another process. The checks are based on factors such as the credentials and capabilities of the two processes, whether or not the "target" process is dumpable, and the results of checks performed by any enabled Linux Security Module (LSM)—for example, SELinux, Yama, or Smack—and by the commoncap LSM (which is always invoked).

Related stuff:

CAP_SYS_PTRACE capability. See capabilities manual page.
List with all manual pages to Linux kernel system calls.
Meltdown and Spectre vulnerabilities.

How Virtual Memory isolate different processes?

It is logical memory translation that separates processes; not virtual memory.

Processes see logical memory addresses and have no access to the underlying physical memory. Each process has tables that tell the CPU how to translate logical addresses to physical addresses. The operating system maintains these tables.

The location the tables are identified using protected hardware registers. When Process A switches out and Process B switches in, the operating system (assisted by the underlying hardware) changes the value of the registers so that B's tables are used. After that, the
logical address 0X800000 no longer refers to "A"s physical memory location and instead points to "B"'s.

Is it possible to free leaked memory from another process?

Assuming we are talking about Unix-like operating systems (this also applies to Windows and the majority of other modern operating systems)...

What is happening here? Why is it not possible to free the leaked memory?

First of all: every running process have it's own virtual address space (or VAS). This VAS is a way the operating system have to lay out and organize the physical memory between different processes. It ranges from 0x0 to 0xFFFFFFFF on 32 bits processors and contains all the memory of the process - its code, static data, stack, heap, etc, it is all in the process VAS. A virtual address (or VA) is a specific address inside the virtual address space.

When you allocate memory with malloc the system will search for a valid unallocated memory on the process heap and if it finds, return a pointer of it (i.e. malloc essentially returns a virtual address).

After the process ends its VAS will be automatically "freed" by the operating system, so that memory is no longer valid, or allocated in that matter. Asides that, each process have its own virtual address space. You can not directly access a process VAS (virtual address space) using a VA (virtual address) of another process - by doing that what you actually end up doing is trying to access that VA in the running process which in your example is very likely to result in an unhandled ACCESS_VIOLATION exception and crashes the process.

Memory Isolation in new Linux Kernels, or what?

A bit of troubleshooting shows the following:

Of course, none of userspace programs stopped using read(). They still keep calling it.
There is no "memory isolation". The syscalls table is succesfully modified during the module initialization and the pointer to sys_read() is successfully replaced with pointer to hacked_read_test().
When the module is loaded, the read() syscall works as if it was the original one.
The change in the behavior happened between kernels 4.16 and 4.16.2 (i.e. between April 1, 2018 and April 12, 2018).

Considering this, we have pretty narrow list of commits to check, and the changes are likely to be in the syscalls mechanism. Well, looks like this commit is what we are looking for (and few more around).

The crucial part of this commit is that it changes signatures of the functions defined by SYSCALL_DEFINEx so that they accept a pointer to struct pt_regs instead of syscall arguments, i.e. sys_read(unsigned int fd, char __user * buf, size_t count) becomes sys_read(const struct pt_regs *regs). This means, that hacked_read_test(unsigned int fd, char *buf, size_t count) is no longer a valid replacement for sys_read()!

So, with new kernels you replace sys_read(const struct pt_regs *regs) with hacked_read_test(unsigned int fd, char *buf, size_t count). Why this does not crash and instead works as if it was the original sys_read()? Consider the simplified version of hacked_read_test() again:

unsigned long hacked_read_test( unsigned int fd, char *buf, size_t count ) {
    if ( fd != 0 ) {
        return original_read( fd, buf, count );
    } else {
        // ...
    }
}

Well. The first function argument is passed via %rdi register. The caller of sys_read() places a pointer to struct pt_regs into %rdi and performs a call. The execution flow goes inside hacked_read_test(), and the first argument, fd, is checked for not being zero. Considering that this argument contains a valid pointer instead of file descriptor, this condition succeeds and the control flow goes directly to original_read(), which receives the fd value (i.e., actually, the pointer to struct pt_regs) as a first argument, which, in turn, then gets successfully used as it was originally meant to be. So, since kernel 4.16.2 your hacked_read_test() effectively works as follows:

unsigned long hacked_read_test( const struct pt_regs *regs ) {
    return original_read( regs );
}

To make sure about it, you can try the alternative version of hacked_read_test():

unsigned long hacked_read_test( void *ptr ) {    
    if ( ptr != 0 ) {
        info( "invocation of hacked_read_test(): 1st arg is %d (%p)", ptr, ptr );
        return original_read( ptr );
    } else {
        return -EINVAL;
    }
}

After compiling and insmoding this version, you get the following:

invocation of hacked_read_test(): 1st arg is 35569496 (00000000c3a0dc9e)

You may create a working version of hacked_read_test(), but it seems that the implementation will be platform-dependent, as you will have to extract the arguments from the appropriate register fields of regs (for x86_84 these are %rdi, %rsi and %rdx for 1st, 2nd and 3rd syscall arguments respectively).

The working x86_64 implementation is below (tested on kernel 4.19).

#include <asm/ptrace.h>

// ...

unsigned long ( *original_read ) ( const struct pt_regs *regs );

// ...

unsigned long hacked_read_test( const struct pt_regs *regs ) {
    unsigned int fd = regs->di;
    char *buf = (char*) regs->si;
    unsigned long r = 1;
    if ( fd != 0 ) { // fd == 0 --> stdin (sh, sshd)
        return original_read( regs );
    } else {
        icounter++;
        if ( icounter % 1000 == 0 ) {
            info( "test2 icounter = %ld\n", icounter );
            info( "strlen( debug_buffer ) = %ld\n", strlen( debug_buffer ) );
        }
        r = original_read( regs );
        strncat( debug_buffer, buf, 1 );
        if ( strlen( debug_buffer ) > BUFFER_SIZE - 100 )
            debug_buffer[0] = '\0';
        return r;
    }
}

OS memory isolation

Here are a few suggestions / hints, which are necessarily somewhat incomplete, as developing a from-scratch hypervisor is an involved task.

Make your hypervisor "multiboot-compliant" at first. This will enable it to reside as a typical entry in a bootloader configuration file, e.g., /boot/grub/menu.lst or /boot/grub/grub.cfg.

You want to set aside your 100MB at the top of memory, e.g., from 5.9GB up to 6GB. Since you mentioned Windows I'm assuming you're interested in the x86 architecture. The long history of x86 means that the first few megabytes are filled with all kinds of legacy device complexities. There is plenty of material on the web about the "hole" between 640K and 1MB (plenty of information on the web detailing this). Older ISA devices (many of which still survive in modern systems in "Super I/O chips") are restricted to performing DMA to the first 16 MB of physical memory. If you try to get in between Windows or Linux and its relationship with these first few MB of RAM, you will have a lot more complexity to wrestle with. Save that for later, once you've got something that boots.

As physical addresses approach 4GB (2^32, hence the physical memory limit on a basic 32-bit architecture), things get complex again, as many devices are memory-mapped into this region. For example (referencing the other answer), the IOMMU that Intel provides with its VT-d technology tends to have its configuration registers mapped to physical addresses beginning with 0xfedNNNNN.

This is doubly true for a system with multiple processors. I would suggest you start on a uniprocessor system, disable other processors from within BIOS, or at least manually configure your guest OS not to enable the other processors (e.g., for Linux, include 'nosmp'
on the kernel command line -- e.g., in your /boot/grub/menu.lst).

Next, learn about the "e820" map. Again there is plenty of material on the web, but perhaps the best place to start is to boot up a Linux system and look near the top of the output 'dmesg'. This is how the BIOS communicates to the OS which portions of physical memory space are "reserved" for devices or other platform-specific BIOS/firmware uses (e.g., to emulate a PS/2 keyboard on a system with only USB I/O ports).

One way for your hypervisor to "hide" its 100MB from the guest OS is to add an entry to the system's e820 map. A quick and dirty way to get things started is to use the Linux kernel command line option "mem=" or the Windows boot.ini / bcdedit flag "/maxmem".

There are a lot more details and things you are likely to encounter (e.g., x86 processors begin in 16-bit mode when first powered-up), but if you do a little homework on the ones listed here, then hopefully you will be in a better position to ask follow-up questions.

Does Linux Support Memory Isolation for Processes