Linux Process Memory Scheme

What does the Kernel Virtual Memory of each process contain?

When a system uses virtual memory, the kernel uses virtual memory as well. Windows will use the upper 2GB (or 1GB if you've specified the /3GB switch in the Windows bootloader) for its own use. This includes kernel code, data (or at least the data that is paged in -- that's right, Windows can page out portions of the kernel address space to the hard disk), and page tables.
Each process has its own VM address space. When a process switch occurs, the page tables are typically swapped out with another process's page table. This is simple to do on an x86 processor - changing the page table base address in the CR3 control register will suffice. The entire 4GB address space is replaced by tables replacing a completely different 4GB address space. Having said that, typically there will be regions of address space that are shared between processes. Those regions are marked in the page tables with special flags that indicate to the processor that those areas do not need to be invalidated in the processor's translation lookaside buffer.
As I mentioned earlier, the kernel's code, data, and the page tables themselves need to be located somewhere. This information is located in the kernel address space. It is possible that certain parts of the kernel's code, data, and page tables can themselves be swapped out to disk as needed. Some portions are deemed more critical than others and are never swapped out at all.
See (3)
It depends. User-mode shared memory is located in the user-mode address space. Parts of the kernel-mode address space might very well be shared between processes as well. For example, it would not be uncommon for the kernel's code to be shared between all processes in the system. Where that memory is located is not precise. I'm using arbitrary addresses here, but shared memory located at 0x100000 in one process might be located at 0x101000 inside another process. Two pages in different address spaces, at completely different addresses, can point to the same physical memory.
I'm not sure what you mean here. Open file handles are not global to all processes. The file system stored on the hard disk is global to all processes. Under Windows, file handles are managed by the kernel, and the objects are stored in the kernel address space and managed by the kernel object manager.
For Windows NT based systems, I'd recommend Windows Internals, 5ed by Mark Russinovich and David Solomon

Response to comment:

And now this 3GB is shared b/w all
processes? or each process has 4GB
space?

It depends on the OS. Some kernels (such as the L4 microkernel) use the same page table for multiple processes and separate the address spaces using segmentation. On Windows each process gets its own page tables. Remember that even though each process might get its own virtual address space, that doesn't mean that the physical memory is always different. For example, the image for kernel32.dll loaded in process A is shared with kernel32.dll in process B. Much of the kernel address space is also shared between processes.

Why does each process have kernel
virtual memory?

The best way to think of this is to ask yourself, "How would a kernel work if it didn't execute using virtual memory?" In this hypothetical situation, every time your program caused a context switch into the kernel (let's say you made a system call), virtual memory would have to be disabled while the CPU was executing in kernel space. There's a cost to doing that and there's a cost to turning it back on when you switch back to user space.

Furthermore, let's suppose that the user program passed in a pointer to some data for its system call. This pointer is a virtual address. You've got virtual memory turned off, so that pointer needs to be translated to a physical address before the kernel can do anything with it. If you had virtual memory turned on, you'd get that for free thanks to the memory-management unit on the CPU. Instead you'd have to manually translate the addresses in software. There's all kinds of examples and scenarios that I could describe (some involving hardware, some involving page table maintenance, and so on) but the gist of it is that it's much easier to have a homogeneous memory management scheme. If user space is using virtual memory, it's going to be easier to write a kernel if you maintain that scheme in kernel space. At least that has been my experience.

there will be only one instnace of OS
kernel right? then why each process
has seperate kernel virtual space?

As I mentioned above, quite a bit of that address space will be shared across processes. There is per-process data that is in the kernel space that gets swapped out during a context switch between processes, but lots of it is shared because there is only one kernel.

How can I measure the actual memory usage of an application or process?

With ps or similar tools you will only get the amount of memory pages allocated by that process. This number is correct, but:

does not reflect the actual amount of memory used by the application, only the amount of memory reserved for it
can be misleading if pages are shared, for example by several threads or by using dynamically linked libraries

If you really want to know what amount of memory your application actually uses, you need to run it within a profiler. For example, Valgrind can give you insights about the amount of memory used, and, more importantly, about possible memory leaks in your program. The heap profiler tool of Valgrind is called 'massif':

Massif is a heap profiler. It performs detailed heap profiling by taking regular snapshots of a program's heap. It produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. The graph is supplemented by a text or HTML file that includes more information for determining where the most memory is being allocated. Massif runs programs about 20x slower than normal.

As explained in the Valgrind documentation, you need to run the program through Valgrind:

valgrind --tool=massif <executable> <arguments>

Massif writes a dump of memory usage snapshots (e.g. massif.out.12345). These provide, (1) a timeline of memory usage, (2) for each snapshot, a record of where in your program memory was allocated. A great graphical tool for analyzing these files is massif-visualizer. But I found ms_print, a simple text-based tool shipped with Valgrind, to be of great help already.

To find memory leaks, use the (default) memcheck tool of valgrind.

How does the linux kernel manage less than 1GB physical memory?

Not all virtual (linear) addresses must be mapped to anything. If the code accesses unmapped page, the page fault is risen.

The physical page can be mapped to several virtual addresses simultaneously.

In the 4 GB virtual memory there are 2 sections: 0x0... 0xbfffffff - is process virtual memory and 0xc0000000 .. 0xffffffff is a kernel virtual memory.

How can the kernel map 896 MB from only 512 MB ?

It maps up to 896 MB. So, if you have only 512, there will be only 512 MB mapped.

If your physical memory is in 0x00000000 to 0x20000000, it will be mapped for direct kernel access to virtual addresses 0xC0000000 to 0xE0000000 (linear mapping).

What about user mode processes in this situation?

Phys memory for user processes will be mapped (not sequentially but rather random page-to-page mapping) to virtual addresses 0x0 .... 0xc0000000. This mapping will be the second mapping for pages from 0..896MB. The pages will be taken from free page lists.

Where are user mode processes in phys RAM?

Anywhere.

Every article explains only the situation, when you've installed 4 GB of memory and the

No. Every article explains how 4 Gb of virtual address space is mapped. The size of virtual memory is always 4 GB (for 32-bit machine without memory extensions like PAE/PSE/etc for x86)

As stated in 8.1.3. Memory Zones of the book Linux Kernel Development by Robert Love (I use third edition), there are several zones of physical memory:

ZONE_DMA - Contains page frames of memory below 16 MB
ZONE_NORMAL - Contains page frames of memory at and above 16 MB and below 896 MB
ZONE_HIGHMEM - Contains page frames of memory at and above 896 MB

So, if you have 512 MB, your ZONE_HIGHMEM will be empty, and ZONE_NORMAL will have 496 MB of physical memory mapped.

Also, take a look to 2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB section of the book. It is about case, when you have less memory than 896 MB.

Also, for ARM there is some description of virtual memory layout: http://www.mjmwired.net/kernel/Documentation/arm/memory.txt

The line 63 PAGE_OFFSET high_memory-1 is the direct mapped part of memory

What's under 0x400000 in virtual memory?

As Maxim says, it's simply unmapped. The pages in that region are marked as "not present" in the CPU's page tables, so that accessing them causes a page fault; and the kernel knows they are not backed by any physical memory, file, or swap space, so that such a page fault will be handled by delivering a segmentation fault signal (SIGSEGV) to the process, normally killing it.

It is desirable for at least the lowest page of a program's virtual address space to be unmapped, so that accesses to address 0 (null pointer dereference) will cause a segmentation fault instead of allowing a buggy program to continue running. Leaving a larger region unmapped is also nice so that, for instance, if the program tries to access p[i] where p is a null pointer and i is somewhat greater than 4096, the program will again get a segfault. In 32-bit mode, the value 0x400000 is convenient because this is 4 MB and corresponds to one entry in the page directory. See https://wiki.osdev.org/Paging for an introduction to x86 paging.

Does calling free or delete ever release memory back to the system

There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).

For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.

Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.

A term for when a computer process exceeds allocated memory by OS

when a computer process uses too much memory, and the OS has to terminate it

What you describe here doesn't happen. The behavior differ from OS to OS, but none happens as you describe it. On Windows for example a memory allocation may fail, but that does not imply the OS terminating the process. The call to allocate memory returns an error code and the process decides how it handles the situation. Linux has this crazy memory allocation scheme on which allocation succeeds without any backing, and then actual reference to the memory may fail. In this case Linux runs the oom-killer:

It is the job of the linux 'oom killer' to sacrifice one or more processes in order to free up memory for the system when all else fails.

Note that the oom-killer a process chosen by the badness() function, not necessarily the process that actually touched a page that had no backing (ie. not the process that requested the memory). In Linux is also important to distinguish between the memory being 'allocated' and the memory being 'referenced' the first time (ie. the PTE shenanigans).

So, strictly speaking, what you describe doesn't exists. However, the common name for a process 'running out of memory' is out of memory, commonly abbreviated as OOM. In most modern systems OOM condition manifests itself as an exception, and is an exception raised voluntarily by the process, not by OS.

One situation when an OS kills a process on-the-spot is when an OOM occurs during a guard-page PTE miss (ie. the OS cannot commit the virtual page). As the OS has no room to actually allocate the guard-page, it has no room to actually write the exception record for the process and it cannot raise the exception (that would a a stack overflow exception, since we're talking about a guard page). Th OS has no choice but to obliterate the process (technically is not a kill, since a kill is a type of signal).

Confusion on Memory Layout vs Memory Management Schemes

Does the memory layout of a program in execution (ie. text, data, stack, heap) only make sense in context of it's virtual address space? If a program is organized ("laid" out) into these logical sections in it's virtual address space, don't these sections just get messed up as soon as addresses start getting converted from virtual to physical addresses using a memory management scheme like paging or segmentation?

That's correct. The sections are contiguous in virtual memory, but not contiguous in physical memory. This isn't an issue since the operating system maintains page tables; the processor's MMU uses those to translate virtual to physical addresses transparently on each access, and the operating system itself can use them to figure out which (scattered) physical pages to interact with e.g. when the process ends and its memory is to be reclaimed.

As far as I'm aware, these two schemes allow for non-contiguous partitioning in the physical address space. So if my "text" section was from address 0 to 100 (random size I picked) in the virtual address space, and I choose to use paging, and my page sizes were 20 addresses in length each (ie there would be 5 pages for the text section), once these pages get placed in the physical address space non-contiguously (based on wherever free space is available), wouldn't the notion of a TEXT "section" kinda not make sense anymore (as it's been chunked and scattered)?

The idea of a section is still applicable in contexts where virtual addresses are applicable. Your user-mode program deals with virtual addresses (i.e. pointers essentially are virtual addresses), and a lot of the operating system still deals with virtual addresses as well. The translation to scattered physical addresses done on-demand by the MMU, and only a subset of kernel code needs to deal with physical addresses.

An aside: Those aren't realistic sizes due to the overhead of bookkeeping for pages; a typical page size is 4096 bytes, and there are ways of creating larger pages on some platforms to reduce this overhead further.

Lastly, are the variable-sized segments in segmentation that end up in the physical address space the exact same size as the logical categories (text, data, stack, heap) of the memory layout present in the virtual space? Is the only caveat here that in the physical space the segments are scattered non-contiguously (are not adjacent to one another) but still exist wholesomely within their specific category (ie all the "data" remains together/contiguous in the physical space)?

Nope, they are scattered on a page-by-page basis and not every virtual page will be backed with a physical page of memory. An example of this is e.g. due to demand paging where a page only gets a physical backing lazily when one is actually needed. Pages of .text that haven't been used yet might not be loaded from disk until a pagefault actually induces the kernel to load them from disk.

Likewise if physical memory is scarce, unused pages might be evicted from virtual memory and be placed onto disk; when they're next accessed a pagefault will induce the kernel to load them back in from disk.

A virtual address might also map to a physical address that doesn't represent a physical page of DRAM memory on a DIMM somewhere. It's possible to map virtual addresses to physical addresses that represent memory-mapped IO, or a page of virtual memory might be shared between two processes as a form of cooperative communication.

There are further tricks done for the sake of optimization. For example, Linux's fork syscall doesn't copy pages; rather it sets up the page tables to enable a feature called copy on write, where pages are only copied when either the parent or child writes to them, and pages which are only read are shared between the two.