Is Kernel Space Mapped into User Space on Linux X86

Is kernel space mapped into user space on Linux x86?

Actually, on 32-bit Windows, without the /3G boot option, the kernel is mapped at the top 2GB of linear address space, leaving 2GB for user process.

Linux does a similar thing, but it maps the kernel in the top 1GB of linear space, thus leaving 3GB for user process.

I don't know if you can peek the entire memory layout by just using the /proc filesystem. For a lab I designed for my students, I created a tiny device driver that allows a user to peek at an physical memory address, and get the contents of several control registers, such as CR3 (directory page base address).

By using these two operations, one can walk through the directory page of the current process (the one which is doing this operation) and see which pages are present, which ones are owned by the user and the kernel, or just by the kernel, which ones are read/write or read only, etc. With that information, they have to display a map showing memory usage, including kernel space.

Take a look at this PDF. It's the compiled version of all the labs we did in my course.
http://www.atc.us.es/asignaturas/tpbn/PracticasTPBN2011.pdf

On page 36 of PDF (page 30 of the document) you will see how a memory map looks like. This is the result of doing exercise #3.2 from lab #3.

The text is in spanish, but I'm sure you can use a translator or something like that if there are things you cannot understand. This labs assumes the student has previously read about how the paging system works and how to interpret the layout of the directory and page entries.

The map is like this. A 16x64 block. Each cell in the block represents 4MB of the current process virtual address space. The map should be tridimensional, as there are 4MB regions that are described by a page table with 1024 entries (pages), and not all pages may be present, but to keep the map clear, the exercise requires the user to collapse these regions, showing the contents of the first page entry that describes a present page, in the hope that all subsequents pages in that page table share the same attributes (which may or may not be actually true).

This map is used with kernels 2.6.X. in which PAE is not used, and PSE is used (PAE and PSE being two bit fields from control register CR4). PAE enables 2MB pages and PSE enables 4MB pages. 4KB pages are always available.

. : PDE not present, or page table empty.
X : 4MB page, supervisor.
R : 4MB page, user, read only.
* : 4MB page, user, read/write.
x : Page table with at least one entry describing a supervisor page.
r : Page table with at least one entry describing an user page, read only.
+ : Page table with at least one entry describing an user page, read/write.

................................r...............................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
...............................+..............................+.
xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..x...........................xx

You can see there is a vast space of 3GB of memory, almost empty in this case (the process is just a little C application, and uses less than 4MB, all contained in a page table, whose first present page is a read only page, assumed to be part of the program code, or maybe static strings).

Near the 3GB border there are two small regions read/write, which may belong to shared libraries loaded by the user program.

The last 4 rows (256 directory entries) belong almost all of them to the kernel. There are 224 entries which are actually being present and used. These maps the first 896MB of physical memory and it's the space in where the kernel lives. The last 32 entries are used by the kernel to access physical memory beyond the 896MB mark in systems with more than 896MB RAM.

Why is kernel mapped to the same address space as processes?

A process "owns" the entire virtual address space here, the kernel and the user portions of it.

Its inability to peek and poke the kernel code and data is not due to different address spaces, it's due to different access rights/permissions set in the page tables. Kernel pages are set up in such a way that regular applications can't access them.

It is, however, customary to refer to the two parts of one whole thing as the kernel space and the user space and that can be confusing.

How are virtual addresses corresponding to kernel stack mapped?

Note: This is the OS agnostic answer. Details do vary slightly with OS in question (e.g. Darwin and continuations..), and possibly with architectural (ARMv8, x86, etc) implementations.

When a process performs a system call, the user mode state (registers) is saved, including the user mode stack pointer. At that point, a kernel mode stack pointer is loaded, which is usually maintained somewhere in the thread control block.

You are correct in saying that there is only one kernel space. What follows is, that (in theory) one thread in kernel space could easily see and/or tamper with any others in kernel space (just like same process threads can "see" each other in user space) This, however, is (almost always) in theory only, since the kernel code presumably respects memory boundaries (as is assumed user mode does, with thread local storage, etc). That said, "almost always", because if the kernel code can be exploited, then all of kernel memory will be laid bare to the exploiter, and potentially read and/or compromised.

Is it possible to map a process into memory without mapping the kernel?

It is traditional and generally good to have your kernel mapped in every user process

So when you make a system call, the kernel doesn't have to change the page tables to access it's own memory. Having all physical memory mapped all the time makes it cheaper for a read system call to copy stuff from anywhere in the pagecache, for example.

The GDT and IDT base addresses are virtual (lidt / lgdt) so interrupt handling requires that at least the page containing the IDT, and the interrupt-handler code it points to, are mapped while user-space is executing.

But as mitigation for Meltdown on Intel CPUs where user-space speculative reads can bypass the user/supervisor page-table permission bit, Linux does actually unmap most of the kernel while user-space executes. It needs to keep a "trampoline" mapped that swaps page tables to remap the kernel proper before jumping to the regular entry points, so interrupt handlers and system calls can work.

is it possible to access the kernel space from the user space and why would I do that?

Usually the kernel would disable this. Page table entries have a user/supervisor bit which controls whether it can be used when not in kernel mode (i.e. ring 3, I think). The kernel can thus leave its memory mapped while still protecting it from read/write by user-space. (See also this for a diagram of nesting of page directories.)

CPUs have a performance feature to support this use-case: there's a "global" bit in each PTE that (if set) means the CPU can keep it cached in the TLB even when CR3 changes (i.e. across context switches, when the kernel installs a new page table). The kernel sets this for the kernel mappings that it includes in every process.

And BTW, there's probably only one physical copy of the tables for those kernel mappings, with the top-level Page Map Level 4 Table (PML4) for each different tree of user-space page tables simply pointing to the same kernel PDPTE structures (most/all of which are actually 1GiB hugepage mappings, rather than pointers to further levels of entries). See the diagram linked above.

There is actually a small amount of memory that the kernel allows user-space to read (and execute): The kernel maps a few 4k pages called the VDSO area into the address space of every process (at the very top of virtual memory).

For a few simple but common system calls like gettimeofday() and getpid(), user-space can call functions in these pages (which for example run rdtsc and scale the result by constants exported by the kernel) instead of using syscall to enter kernel mode and do the same thing there. This saves maybe 50 to 100 clock cycles for a round-trip to kernel mode on a modern x86 CPU, and more from not needing all the save/restore of stuff inside the kernel before dispatching to the right system call.

Is it possible to map a process into memory without mapping the kernel?

With a 32-bit process on a 64-bit kernel, the entire 4GiB virtual address space is available for user-space. (Except for 3 or so 4k VDSO pages.)

Otherwise (when user-space virtual addresses are as wide as kernel-space virtual addresses) Linux uses the upper half for kernel mapping of all physical memory (with 1G hugepages on x86).

i386 Linux has a config options to make the split 1:3, IIRC, further cramping the kernel but allowing more virtual address space for user-space processes. IDK if this is common for 32-bit kernels on other architectures, or only x86.

wouldn't that be a waste of space?

It takes up some virtual address space, but you're supposed to have more of that than you do physical memory. If you don't, you have to pay the speed cost of remapping memory more often.

This is why we have x86-64, so virtual address space is huge. 48 bits is 256 TiB, so half of that is 128 TiB of address space. Future CPUs could implement hardware support for wider virtual addresses if it becomes necessary / useful. (The page table format supports up to 52 bit physical addresses.). Maybe this will become more of an issue with non-volatile DIMMs providing memory-mapped storage with higher density than DRAM, and a reason to use a lot of both kinds of address space.

If you need more than 2GiB of virtual address space in a single process, use a 64-bit system. (Or if you need a zillion processes / threads, use a 64-bit kernel at least. A 32-bit kernel with PAE runs into memory-allocation problems sometimes. See some https://serverfault.com/ questions.)

Someone reposted on their blog some of Linus Torvalds' comments about PAE (Physical Address Extensions) which allows having more than 4GB of physical memory on a 32-bit-only x86 system. Summary: yuck, even with a good kernel-side implementation, it's definitely slower than a 64-bit kernel. Except with more amusing insults at the Intel engineers who thought it would be a good idea and solve the problem for 32-bit OSes.

Is Kernel Space Mapped into User Space on Linux X86