4Gb/4Gb Kernel Vm Split

4GB/4GB Kernel VM Split

If I understand the article correctly, the kernel and the userspace don't share the same address space after the patch. This however costs switching page tables during each userspace/system switch.

Linux 3/1 virtual address split

The reason that kernel virtual space is a limiting factor on useable physical memory is because the kernel needs access to all physical memory, and the way it accesses physical memory is through kernel virtual addresses. The kernel doesn't use special instructions that allow direct access to physical memory locations - it has to set up page table entries for any physical ranges that it wants to talk to.

In the "old style" scheme, the kernel set things up so that every process's page tables mapped virtual addresses from 0xC0000000 to 0xFFFFFFFF directly to physical addresses from 0x00000000 to 0x3FFFFFFF (these pages were marked so that they were only accessible in ring 0 - kernel mode). These are the "kernel virtual addresses". Under this scheme, the kernel could directly read and write any physical memory location without having to fiddle with the MMU to change the mappings.

Under the HIGHMEM scheme, the mappings from kernel virtual addresses to physical addresses aren't fixed - parts of physical memory are mapped in and out of the kernel virtual address space as the kernel needs access to that memory. This allows more physical memory to be used, but at the cost of having to constantly change the virtual-to-physical mappings, which is quite an expensive operation.

Why can't processes address 4GB of memory in 32bits systems?

why in a 32bit Windows systems processes cannot address 4GB of memory

But they most certainly can. Your code just doesn't typically have the required access rights to address the upper portion of the address space. User mode code runs at ring 3, to get to the upper part you need ring 0 access rights. Kernel mode.

Okay, that was a bit tongue-in-cheek, the operating system kernel and drivers that have ring 0 access are not typically thought of being part of the process. Even though they logically are, they are mapped to the same addresses in every process. Technically it would have been possible to map pages dynamically, as the process switches from ring 3 to ring 0 mode, but that would make kernel mode transitions too expensive and cumbersome.

Intuitively: a file buffer that's filled by ReadFile() could then have an address that overlaps a chunk of operating system code or data. Worst case, it could overlap the file system driver code. Or, more likely, the file system cache. The required page flipping and double-copying would make the reading unpredictably slow. The simplest architectural choice, and the one made in 1992 when nobody was rich enough to afford a gigabyte of RAM, was to simply cut the address space in two so no overlap was ever possible.

It is otherwise a solved problem, 32-bit versions of Windows are getting rare and a 32-bit process can address 4 gigabytes on the 64-bit version of Windows. It just needs an option bit in the EXE header, the one set by the /LARGEADDRESSAWARE option available in the linker and in editbin.exe

Why does Windows reserve 1Gb (or 2 Gb) for its system address space?

Two different user processes have different virtual address spaces. Because the virtual↔physical address mappings are different, the TLB cache is invalidated when switching contexts from one user process to another. This is very expensive, as without the address already cached in the TLB, any memory access will result in a fault and a walk of the PTEs.

Syscalls involve two context switches: user→kernel, and then kernel→user. In order to speed this up, it is common to reserve the top 1GB or 2GB of virtual address space for kernel use. Because the virtual address space does not change across these context switches, no TLB flushes are necessary. This is enabled by a user/supervisor bit in each PTE, which ensures that kernel memory is only accessible while in the kernelspace; userspace has no access even though the page table is the same.

If there were hardware support for two separate TLBs, with one exclusively for kernel use, then this optimization would no longer be useful. However, if you have enough space to dedicate, it's probably more worthwhile to just make one larger TLB.

Linux on x86 once supported a mode known as "4G/4G split". In this mode, userspace has full access to the entire 4GB virtual address space, and the kernel also has a full 4GB virtual address space. The cost, as mentioned above, is that every syscall requires a TLB flush, along with more complex routines to copy data between user and kernel memory. This has been measured to impose up to a 30% performance penalty.

Times have changed since this question was originally asked and answered: 64-bit operating systems are now much more prevalent. In current OSes on x86-64, virtual addresses from 0 to 2⁴⁷-1 (0-128TB) are allowed for user programs while the kernel permanently resides within virtual addresses from 2⁴⁷×(2¹⁷-1) to 2⁶⁴-1 (or from -2⁴⁷ to -1, if you treat addresses as signed integers).

What happens if you run a 32-bit executable on 64-bit Windows? You would think that all virtual addresses from 0 to 2³² (0-4GB) would easily be available, but in order to avoid exposing bugs in existing programs, 32-bit executables are still limited to 0-2GB unless they are recompiled with /LARGEADDRESSAWARE. For those that are, they get access to 0-4GB. (This is not a new flag; the same applied in 32-bit Windows kernels running with the /3GB switch, which changed the default 2G/2G user/kernel split to 3G/1G, although of course 3-4GB would still be out of range.)

What sorts of bugs might there be? As an example, suppose you are implementing quicksort and have two pointers, a and b pointing at the start and past the end of an array. If you choose the middle as the pivot with (a+b)/2, it'll work as long as both the addresses are below 2GB, but if they are both above, then the addition will encounter integer overflow and the result will be outside the array. (The correct expression is a+(b-a)/2.)

As an aside, 32-bit Linux, with its default 3G/1G user/kernel split, has historically run programs with their stack located in the 2-3GB range, so any such programming errors would likely have be flushed out quickly. 64-bit Linux gives 32-bit programs access to 0-4GB.

infinite loop malloc in a 32 bit kernel with 4 Gb RAM and 10 Gb Swap Partition

Why do think OOM will kill the process? In your example you will exhaust your address space earlier than you will use up all of your real memory. So in your example (32-bit) you will be able to allocate around 3 GB of address space (minus text/data/stack and other segments and also possible memory fragmentation) and then the system will just return ENOMEM.

If you're on 64-bit system, things get more interesting. Now the result will heavily depend on whether you really are using this memory or not. If you're not using it (just allocating), then due to overcommit (if, of course, you don't have it disabled) you will be able to allocate some huge amounts of address space, but if you try to use it, that's where (around 10+4 GB boundary) you will trigger OOM and that will kill the program.

update

It's actually quite easy to check with a program like this:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main()
{
    static char *allocs[10 * 1024 * 1024];
    static unsigned int size;
    unsigned int i;
    unsigned int const chunk = 1024 * 1024;

    for (i = 0; i < (sizeof(allocs)/sizeof(allocs[0])); i++)
        if ((allocs[i] = malloc(chunk)) == NULL) {
            printf("failed to allocate after %u MB: %s\n", i, strerror(errno));
            break;
        }

    size = i;
    if (size == (sizeof(allocs)/sizeof(allocs[0])))
         printf("allocated %u MB successfully!\n", size);
    for (i = 0; i < size; i++) {
        memset(allocs[i], 1, chunk);
        if (i % 100 == 0)
            printf("memset %u MB\n", i);
    }
    return 0;
}

It tries to allocate 10M of memory chunks, each 1M in size, so effectively that's 10 TB. On 32-bit system with 4 GB of RAM it gives this result (shortened a bit):

failed to allocate after 3016 MB: Cannot allocate memory
memset 0 MB
memset 100 MB
...
memset 2900 MB
memset 3000 MB

As expected, there is just not enough address space and adding swap (I've done my tests with 5GB instead of 10) doesn't help in any way.

With 64 bit it behaves a bit unexpectedly (this is without swap, just 4 GB of RAM) in that it doesn't output anything and just gets killed by OOM killer while still doing allocations. But that's explainable looking at OOM killer log:

[  242.827042] [ 5171]  1000  5171 156707933   612960  306023        0             0 a.out
[  242.827044] Out of memory: Kill process 5171 (a.out) score 905 or sacrifice child
[  242.827046] Killed process 5171 (a.out) total-vm:626831732kB, anon-rss:2451812kB, file-rss:28kB

So it was able to allocate around 620 GB (!) of virtual memory with 2.4 really mapped. Remember also, that we're allocating with malloc(), so the C library has to do its own housekeeping and even if you estimate it to be around 0.5% overhead with such big allocations you get substantial numbers just for that (to be free of that overhead you can try using mmap()). At this stage we have enough memory used to be short of it, so the process gets killed while still doing allocations. If you're to make it less greedy and change allocs to [100 * 1024] (which is just 100 GB effectively), you can easily see the effect of swap, because without it on 4 GB system you have:

allocated 102400 MB successfully!
memset 0 MB
memset 100 MB
...
memset 2800 MB
memset 2900 MB
Killed

And with 5 GB of swap added:

allocated 102400 MB successfully!
memset 0 MB
memset 100 MB
...
memset 7900 MB
memset 8000 MB
Killed

All as expected.

How does the linux kernel manage less than 1GB physical memory?

Not all virtual (linear) addresses must be mapped to anything. If the code accesses unmapped page, the page fault is risen.

The physical page can be mapped to several virtual addresses simultaneously.

In the 4 GB virtual memory there are 2 sections: 0x0... 0xbfffffff - is process virtual memory and 0xc0000000 .. 0xffffffff is a kernel virtual memory.

How can the kernel map 896 MB from only 512 MB ?

It maps up to 896 MB. So, if you have only 512, there will be only 512 MB mapped.

If your physical memory is in 0x00000000 to 0x20000000, it will be mapped for direct kernel access to virtual addresses 0xC0000000 to 0xE0000000 (linear mapping).

What about user mode processes in this situation?

Phys memory for user processes will be mapped (not sequentially but rather random page-to-page mapping) to virtual addresses 0x0 .... 0xc0000000. This mapping will be the second mapping for pages from 0..896MB. The pages will be taken from free page lists.

Where are user mode processes in phys RAM?

Anywhere.

Every article explains only the situation, when you've installed 4 GB of memory and the

No. Every article explains how 4 Gb of virtual address space is mapped. The size of virtual memory is always 4 GB (for 32-bit machine without memory extensions like PAE/PSE/etc for x86)

As stated in 8.1.3. Memory Zones of the book Linux Kernel Development by Robert Love (I use third edition), there are several zones of physical memory:

ZONE_DMA - Contains page frames of memory below 16 MB
ZONE_NORMAL - Contains page frames of memory at and above 16 MB and below 896 MB
ZONE_HIGHMEM - Contains page frames of memory at and above 896 MB

So, if you have 512 MB, your ZONE_HIGHMEM will be empty, and ZONE_NORMAL will have 496 MB of physical memory mapped.

Also, take a look to 2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB section of the book. It is about case, when you have less memory than 896 MB.

Also, for ARM there is some description of virtual memory layout: http://www.mjmwired.net/kernel/Documentation/arm/memory.txt

The line 63 PAGE_OFFSET high_memory-1 is the direct mapped part of memory

64-bit DLL entry point override

You have several options. Unfortunately, you can only choose 2 out of these 3: 100% solid; easy to implement; cheap.

There is very high likelihood that at the end of the .TEXT section you'll find unused space. This is because Windows maps image sections into memory in chunks of 4k, and typically the .text section isn't an exact multiply.

Another easy to implement is to use the PE header. An area very safe to override is the DOS stub. The problem there is that there is no guarantee the PE header is in the same section as the entry routine (Microsoft linker put it in same section though, don't know about GNU or the others).

Another easy but will work only for system DLL's is to do what 'Hot Patching' is doing, and reuse the 15 bytes set to 'nop' in front of each function, and the 'mov edi,edi' instruction. This is the case for all DLL's released with Windows, to support Hot Patching.

The reliable but hard option is to do what @David Heffeman suggestions. This technique is called 'landing function' where you copy the first 12 bytes into a landing function, which will then jmp to the original function.

The easy, and reliable option is using MS Detour. Microsoft Detour is a product from Microsoft Research that does exactly that, and works great, and it is supported, and it takes care of bunch of corner cases and race conditions that may pop (along with other stuff), and its x86 version is open source. The downside is that a commercial usage is very expensive - last time I checked it was 10k.

Each process has its own kernel stack, right?

Application code is loaded (from executable file) into memory by the kernel. But kernel doesn't perform desassembling. So, kernel cannot detect, whether code is short or not, whether it uses system calls or not, and so on.

Because of that, for any application kernel needs to create full execution context. So, allocating kernel stack is needed in any case.

Note also, that system call is not the only case when kernel executes code in the context of the application's process. Process's preemption, exception handling is also performed by the kernel, and requires kernel stack.

4Gb/4Gb Kernel Vm Split