How to Write-Protect Every Page in the Address Space of a Linux Process

Can I write-protect every page in the address space of a Linux process?

You recieve ENOMEM from mprotect() if you try to call it on pages that aren't mapped.

Your best bet is to open /proc/self/maps, and read it a line at a time with fgets() to find all the mappings in your process. For each writeable mapping (indicated in the second field) that isn't the stack (indicated in the last field), call mprotect() with the right base address and length (calculated from the start and end addresses in the first field).

Note that you'll need to have your fault handler already set up at this point, because the act of reading the maps file itself will likely cause writes within your address space.

Sharing memory from kernel to user space by disabling the Write Protect Bit (CR0:16)

I just experimented on my own computer but to answer my own question: NO, it is not possible to share the kernel's memory directly with a user space process.

I did try to write my own kernel module based on Tempesta's stack-like region-based memory manager (pool.c):

[...]

static int __init
tfw_pool_init(void)
{
    printk(KERN_ALERT "HIJACK INIT\n");
    write_cr0 (read_cr0 () & (~ 0x10000));

    pg_cache = alloc_percpu(unsigned long [TFW_POOL_PGCACHE_SZ]);
    if (pg_cache == NULL)
        return -ENOMEM;

    printk(KERN_NOTICE "__tfw_pool_new = %p\n", __tfw_pool_new);
    printk(KERN_NOTICE "tfw_pool_alloc = %p\n", tfw_pool_alloc);
    printk(KERN_NOTICE "tfw_pool_realloc = %p\n", tfw_pool_realloc);
    printk(KERN_NOTICE "tfw_pool_free = %p\n", tfw_pool_free);
    printk(KERN_NOTICE "tfw_pool_destroy = %p\n", tfw_pool_destroy);

    return 0;
}

static void __exit
tfw_pool_exit(void)
{
    free_percpu(pg_cache);

    write_cr0 (read_cr0 () | 0x10000);
    printk(KERN_ALERT "MODULE EXIT\n");
}

module_init(tfw_pool_init);
module_exit(tfw_pool_exit);
MODULE_LICENSE("GPL");

And not only calling the functions printed out segfaults but my system became highly unstable after loading the module so DO NOT TRY THIS AT HOME.

Is an entire process’s virtual address space split into pages

First note that "pages" are simply regions of an address space. A region that is "non-pageable" (by which I assume you mean it cannot be swapped to disk) is still logically divided into pages, but the OS might implement a different policy on those pages.

The most common page size is 4096 bytes. Many architectures support use of multiple page sizes at the same time (e.g. 4K pages as well as 1MB pages). However, operating systems often stick with just one page size, since under most circumstances, the costs of managing multiple page sizes are much higher than the benefits this provides. Exceptions exist but I don't think you need worry about them.

Every virtual page has certain permissions attached to it, like whether it's readable, writeable, executable (varies depending on hardware support). The OS can use this to help enforce security, cache coherency (for shared memory), and swapping pages out of physical memory.

The .text, .bss and .data regions need not be known to the OS (though most OSes do know about them, for security and performance reasons).

The OS may not actually allocate memory for a stack/heap page until the first time that page is accessed. The OS may provide system calls to request more pages of heap/stack space. Some OSes provide shared memory or shared library functionality which leads to more regions appearing in the address space. Depends on the OS.

Setting protection bits for the whole address space

To avoid protecting the entire address space, only protect the pages in use. And, trap the system calls that change the address space (mmap, brk, possibly thread creation, etc) so you can protect those pages.

Note that your signal handler cannot run without being able to write to its stack (nor being able to execute its code), so there are some fundamental problems beyond just 64-bits being large.

Cleared RW (write protect) flag for PTEs of a process in kernel yet no segmentation fault on write

If you modify a PTE when it's still cached in the TLB, the effect of the modification may be unseen for a while (until the PTE gets evicted from the TLB and has to be reread from the page table).

You need to invalidate the PTE in the TLB with the invlpg (I'm assuming x86) instruction after PTE modification. And it has to be done on all CPUs. There must be a dedicated function for this purpose in the kernel.

Also it wouldn't hurt to double check that the compiler didn't reorder or throw away anything from the above code.

How does each process's private address space gets maped to physical address?

That is the job of the MMU (Memory Management Unit), or one of the jobs. Process A's 0x400000 might be physical address 0x12300000 and Process B's 0x400000 might be physical address 0x32100000 for example. Basically when process A is running there is a mmu table for that core it is running on that replaces 0x004xxxxx with 0x123xxxxx for example in a simplified way. It is a direct address bit replacement, the size of the space can vary both one mmu may have multiple block sizes as well as one architectures mmu chips mmu may vary from another. It could be a case of 0x0040 becomes 0x1230 and 0x0041 becomes 0x1231 and so on.

It can also be that 0x0040 becomes 0x1111 and 0x0041 becomes 0x2123 and 0x0042 becomes 0x1312. For each block size in the mmu you can have a different address replacement from virtual to physical. Likewise for each block in the mmu you can have different protection and other features. For code the processor may have a read only feature, it should have at least a basic cache enable/disable feature as code you want to cache but some data areas you may not. And then address spaces that you want to trap the process from accessing so marking the rest of the address space as invalid.

Assume for understanding purposes there is one size of block and some table (this table might live at physical address 0x40010000)

virtual     physical
...
0x00400000  0x1230000
0x00410000  0x1231000
0x00420000  0x1232000
...

for process A and this table was created by the operating system at some physical address.

If/when that core switches to process B, then ideally a single register re-directs the mmu to some other physical address for the table (this physical table might live at 0x40020000)

...
0x00400000 0x32100000
0x00410000 0x32110000
0x00420000 0x32120000
...

As an example.

So you can link every program for the same address space, but physically they are in separate memory that doesn't interfere with each other.

And if process A and process B are in different cores and/or behind different MMUs then they would still have their own mmu table in front of them but could concurrently run.

Now I don't know the x86 implementation and expect that over the years different chips had different mmus, but, some architecture designs the mmu is per core so one process per core at a time one mmu table for that core at a time. If the mmu is further down and there are multiple cores per mmu then there is going to be some sort of process ID such that the access still lands on a unique set of entries per process.

MMUs these days make this virtual to physical mapping happen, they cover caching or not, and protection of address spaces to keep the application from wandering about memory. But if you think about it this also greatly helps memory allocation and management. Let's say for example in my made up table above the mmu actually does operate on 64K byte sections. if I want to allocate 256Kbytes of data, the operating system doesn't need to find a linear 256Kbytes, It only needs to find four 64Kbyte chunks, anywhere. It doesn't need to try to move memory around like the way old days or copy it somewhere and put it back before whoever owned that memory before wanted to access it etc.

0x00500000 0x11210000
0x00510000 0x31230000
0x00520000 0x12120000
0x00530000 0x43210000

To the process that looks like 256K of linear memory, but in reality it is smaller chunks nowhere near each other.

Is there a better way than parsing /proc/self/maps to figure out memory protection?

I do not know an equivalent of VirtualQuery on Linux. But some other ways to do it which may or may not work are:

you setup a signal handler trapping SIGBUS/SIGSEGV and go ahead with your read or write. If the memory is protected, your signal trapping code will be called. If not your signal trapping code is not called. Either way you win.
you could track each time you call mprotect and build a corresponding data structure which helps you in knowing if a region is read or write protected. This is good if you have access to all the code which uses mprotect.
you can monitor all the mprotect calls in your process by linking your code with a library redefining the function mprotect. You can then build the necessary data structure for knowing if a region is read or write protected and then call the system mprotect for really setting the protection.
you may try to use /dev/inotify and monitor the file /proc/self/maps for any change. I guess this one does not work, but should be worth the try.

memory write protection against shared library

shared libraries such as glibc(in Linux), kernel32.dll(in Windows) are physically shared among processes.

Correct, but with COW (copy on write) property. Once your process writes to a shared page, it gets a copy of the page that is no longer shared with any other process.

I think a malicious process could ... mess up every contents to crash all the other process sharing them.

No, it can't. It can only mess up contents of its own copy and crash itself.