Why Linux Kernel Zone_Normal Is Limited to 896 Mb

Confusion about different meanings of HighMem in Linux Kernel

I think your examples of usage 2 are actually (sometimes mangled) descriptions of usage 1, or its consequences. There is no separate meaning, it's all just things that follow from not having enough kernel virtual address space to keep all the physical memory mapped all the time.

(So with a 3:1 user:kernel split, you only have 1GiB of lowmem, the rest is highmem, even if you don't need to enable PAE paging to see all of it.)

This article https://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae quotes a Linux Torvalds rant about how much it sucks to have less virtual address space than physical (which is what PAE does), with highmem being the way Linux tries to get some use out of the memory it can't keep mapped.

PAE is a 32-bit x86 extension that switches the CPU to using an alternate page-table format with wider PTEs (the same one adopted by AMD64, including an exec permission bit, and room for up to 52-bit physical addresses, although the initial CPUs to support it only supported 36-bit physical addresses). If you use PAE at all, you use it for all your page tables.

A normal kernel using a high-half-kernel memory layout reserves the upper half of virtual address space for itself, even when user-space is running. Or to leave user-space more room, 32-bit Linux moved to a 3G:1G user:kernel split.

See for example modern x86-64 Linux's virtual memory map (Documentation/x86/x86_64/mm.txt), and note that it includes a 64TB direct mapping of all physical memory (using 1G hugepages), so given a physical address, the kernel can access it by adding the phys address to the start of that virtual address. (kmalloc reserves a range of addresses in this region without actually having to modify the page tables at all, just the bookkeeping).

The kernel also wants other mappings of the same pages, for vmalloc kernel memory allocations that are virtually contiguous but don't need to be physically contiguous. And of course for the kernel's static code/data, but that's relatively small.

This is the normal/good situation without any highmem, which also applies to 32-bit Linux on systems with significantly less than 1GiB of physical memory. This is why Linus says:

virtual space needs to be bigger than physical space. Not “as big”. Not “smaller”. It needs to be bigger, by a factor of at least two, and that’s quite frankly pushing it, and you’re much better off having a factor of ten or more.

This is why Linus later says "Even before PAE, the practical limit was around 1GB..." With a 3:1 split to leave 3GB of virtual address space for user-space, that only leaves 1GiB of virtual address space for the kernel, just enough to map most of the physical memory. Or with a 2:2 split, to map all of it and have room for vmalloc stuff.

Hopefully this answer sheds more light on the subject than Linus's amusing Anybody who doesn’t get that is a moron. End of discussion. (From context, he's actually aiming that insult at CPU architects who made PAE, not people learning about OSes, don't worry :P)

So what can the kernel do with highmem? It can use it to hold user-space virtual pages, because the per-process user-space page tables can refer to that memory without a problem.

Many of the times when the kernel has to access that memory are when the task is the current one, using a user pointer. e.g. read/write system calls invoke copy_to/from_user with the user-space address (copying to/from the pagecache for a file read/write), reaching the highmem through the user page table entries.

Unless the data isn't hot in pagecache, then the read will block while DMA from disk (or network for NFS or whatever) is queued up. But that will just bring file data into the pagecache, and I guess the copying into user-owned pages will happen after a context-switch back to the task with the suspended read call.

But what if the kernel wants to swap out some pages from a process that isn't running? DMA works on physical addresses, so it can probably calculate the right physical address, as long as it doesn't need to actually load any of that user-space data.

(But it's usually not that simple, IIRC: DMA devices in 32-bit systems may not support high physical addresses. So the kernel might actually need bounce buffers in lowmem... I concur with Linus: highmem sucked, and using a 64-bit kernel is obviously much better, even if you want to run a pure 32-bit user-space.)

Anything like zswap that compresses pages on the fly, or any driver that does need to copy data using the CPU, would need a virtual mapping of the page it was copying to/from.

Another problem is POSIX async I/O that lets the kernel complete I/O while the process isn't active (and thus its page table isn't in use). Copying from user-space to the pagecache / write buffer can happen right away if there's enough free space, but if not you'd want to let the kernel read pages when convenient. Especially for direct I/O (bypassing pagecache).

Brendan also points out that MMIO (and the VGA aperture) need virtual address space for the kernel to access them; often 128MiB, so your 1GiB of kernel virtual address space is 128MiB of I/O space, 896MiB of lowmem (permanently mapped memory).

The kernel needs lowmem for per-process things including kernel stacks for every task (aka thread), and for page tables themselves. (Because the kernel has to be able to read and modify the page tables for any process efficiently.) When Linux moved to using 8kiB kernel stacks, that meant that it had to find 2 contiguous pages (because they're allocated out of the direct-mapped region of address space). Fragmentation of lowmem apparently was a problem for some people unwisely running 32-bit kernels on big servers with tons of threads.

Why does high-memory not exist for 64-bit cpu?

A 32-bit system can only address 4GB of memory. In Linux this is divided into 3GB of user space and 1GB of kernel space. This 1GB is sometimes not enough so the kernel might need to map and unmap areas of memory which incurs a fairly significant performance penalty. The kernel space is the "high" 1GB hence the name "high memory problem".

A 64-bit system can address a huge amount of memory - 16 EB -so this issue does not occur there.

Confusion regarding kernel version, device tree, and buildroot

As I stated in a comment, the crash looks very similar to the one in Freescale 3.0.35 kernel crash. If so, the crash happens in memset.S. The top two commits in memset.S in upstream kernel whose SHA1 begins with c2459d3 and 1bd4678, respectively, should solve that issue.

How to get working rootfs (initrd) on an ARM board?

Have you already tried Buildroot? It provides various options, how to package your rootfs. One of these options is to integrate initramfs directly into kernel. I would start with this option. You don't even need special kernel params to start rootfs, if it is embedded into your kernel.

Please beware, that kernel with embedded initramfs is bigger and hence bootloader must reserve enough space for the kernel, i.e. address space between load address of the kernel binary and address, where kernel will be extracted and stated.

If you already have a working kernel tree, you can configure BR to use it via local.mk.

Difference between Kernel Virtual Address and Kernel Logical Address?

The Linux kernel maps most of the virtual address space that belongs to the kernel to perform 1:1 mapping with an offset of the first part of physical memory. (slightly less then for 1Gb for 32bit x86, can be different for other processors or configurations). For example, for kernel code on x86 address 0xc00000001 is mapped to physical address 0x1.

This is called logical mapping - a 1:1 mapping (with an offset) that allows the kernel to access most of the physical memory of the machine.

But this is not enough - sometime we have more then 1Gb physical memory on a 32bit machine, sometime we want to reference non contiguous physical memory blocks as contiguous to make thing simple, sometime we want to map memory mapped IO regions which are not RAM.

For this, the kernel keeps a region at the top of its virtual address space where it does a "random" page to page mapping. The mapping there do not follow the 1:1 pattern of the logical mapping area. This is what we call the virtual mapping.

It is important to add that on many platforms (x86 is an example), both the logical and virtual mapping are done using the same hardware mechanism (TLB controlling virtual memory). In many cases, the "logical mapping" is actually done using virtual memory facility of the processor, so this can be a little confusing. The difference therefore is the pattern according to which the mapping is done: 1:1 for logical, something random for virtual.