How to Find or Calculate a Linux Process's Page Table Size and Other Kernel Accounting

Why does Linux favor 0x7f mappings?

First and foremost, assuming that you are talking about x86-64, we can see that the virtual memory map for x86-64 is:

========================================================================================================================
    Start addr    |   Offset   |     End addr     |  Size   | VM area description
========================================================================================================================
                  |            |                  |         |
 0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
 ...              |    ...     | ...              |  ...

Userspace addresses are always in the canonical form in x86-64, using only the lower 48 bits with 4-level page tables or 57 bits with 5-level page tables (note that the highest bit is sign extended and only set to 1 for the kernel, therefore in reality you only see at most 47 or 56 bits set in userspace with the most significant always set to 0).

See:

x86-64 canonical address?
Address canonical form and pointer arithmetic

This puts the end of user-space virtual memory at 0x7fffffffffff. This is where the stack of new programs starts: that is, 0x7ffffffff000 (minus some random offset due to ASLR) and growing to lower addresses.

Let me address the simple question first:

Will there be a problem if I manually mmap pages outside of these prefixes?

Not at all, the mmap syscall always checks the address that is being requested, and it will refuse to map pages that overlap an already mapped memory area or pages at completely invalid addresses (e.g. addr < mmap_min_addr or addr > 0x7ffffffff000).

Now... diving straight into the Linux kernel code, precisely in the kernel ELF loader (fs/binfmt_elf.c:960), we can see a pretty long and esplicative comment:

/*
 * This logic is run once for the first LOAD Program
 * Header for ET_DYN binaries to calculate the
 * randomization (load_bias) for all the LOAD
 * Program Headers, and to calculate the entire
 * size of the ELF mapping (total_size). (Note that
 * load_addr_set is set to true later once the
 * initial mapping is performed.)
 *
 * There are effectively two types of ET_DYN
 * binaries: programs (i.e. PIE: ET_DYN with INTERP)
 * and loaders (ET_DYN without INTERP, since they
 * _are_ the ELF interpreter). The loaders must
 * be loaded away from programs since the program
 * may otherwise collide with the loader (especially
 * for ET_EXEC which does not have a randomized
 * position). For example to handle invocations of
 * "./ld.so someprog" to test out a new version of
 * the loader, the subsequent program that the
 * loader loads must avoid the loader itself, so
 * they cannot share the same load range. Sufficient
 * room for the brk must be allocated with the
 * loader as well, since brk must be available with
 * the loader.
 *
 * Therefore, programs are loaded offset from
 * ELF_ET_DYN_BASE and loaders are loaded into the
 * independently randomized mmap region (0 load_bias
 * without MAP_FIXED).
 */
if (interpreter) {
    load_bias = ELF_ET_DYN_BASE;
    if (current->flags & PF_RANDOMIZE)
        load_bias += arch_mmap_rnd();
    elf_flags |= MAP_FIXED;
} else
    load_bias = 0;

In short, there are two types of ELF Position Independent Executables:

Normal programs: they require a loader in order to run. This represents basically 99.9% of the ELF programs on a normal Linux system. The path of the loader is specified in the ELF program headers, with a program header of type PT_INTERP.
Loaders: a loader is an ELF that does not specify a PT_INTERP program header, and that is responsible for loading and starting normal programs. It also does a bunch of fancy stuff behind the scenes (resolve relocations, load needed libraries, etc.) before actually starting the program that is being loaded.

When the kernel executes a new ELF through an execve syscall, it needs to map into memory the program itself and the loader. Control will then be passed to the loader that will resolve and map all needed shared libraries and finally pass control to the program. Since both the program and its loader need to be mapped, the kernel needs to make sure that those mappings don't overlap (and also that future mapping requests by the loader will not overlap).

In order to do this, the loader is mapped near the stack, (at a lower address than the stack, but with some tolerance, since the stack is allowed to grow by adding more pages if needed), leaving the duty of applying ASLR to mmap itself. The program is then mapped using a load_bias (as seen in the above snippet) to put it far enough from the loader (at a much lower address).

If we take a look at ELF_ET_DYN_BASE, we see that it is architecture dependent and on x86-64 it evaluates to:

((1ULL << 47) - (1 << 12)) / 3 * 2 == 0x555555554aaa

Basically around 2/3 of TASK_SIZE. That load_bias is then adjusted adding arch_mmap_rnd() bytes if ASLR is enabled, and finally page-aligned. At the end of the day, this is the reason why we usually see addresses starting with 0x55 for programs.

When control is passed to the loader, the virtual memory area for the process has already been defined, and successive mmap syscalls that do not specify an address will return decreasing addresses starting near the loader. As we just saw the loader is mapped near the stack, and the stack is at the very end of the user address space: this is the reason why we usually see addresses starting with 0x7f for libraries.

There is a common exception to the above. In the case the loader is invoked directly, like for example:

/lib/x86_64-linux-gnu/ld-2.24.so ./myprog

The kernel will not map ./mpyprog in this case and will leave that to the loader. As a consequence, ./myprog will be mapped at some 0x7f... address by the loader.

You may be wondering: why doesn't the kernel always let the loader map the program then, or why isn't the program just mapped right before/after the loader? I don't have a 100% definitive answer for this, but a few reasons come to mind:

Consistency: making the kernel itself load the ELF into memory without depending on the loader avoids trouble. If this wasn't the case, the kernel would fully depend on the userspace loader, which is not advisable at all (this may also partially be a security concern).
Efficiency: we are sure that at least both the executable and its loader need to be mapped (regardless of any linked libraries), might as well save precious time and do it right away rather than wait for another syscall with associated context switch.
Security: in the default scenario, mapping the program at a different randomized address than the loader and other libraries provides a sort of "isolation" between the program itself and the loaded libraries. In other words, "leaking" any library address won't reveal the program position in memory, and vice-versa. Mapping the program at a predefined offset from the loader and other libraries would instead partially defeat the purpose of ASLR.
In an ideal security-driven scenario, every single mmap (i.e. any needed library) would also be placed at a randomized address independent of previous mappings, but this would hurt performance significantly. Keeping allocations grouped results in faster page table lookups: see Understanding The Linux Kernel (3rd edition), page 606: Table 15-3. Highest index and maximum file size for each radix tree height. It would also cause much greater virtual memory fragmentation, becoming a real problem for programs that need to map large files to memory. The substantial part of isolation between program code and library code is already done, going further has more cons than pros.
Ease of debugging: seeing RIP=0x55... vs RIP=0x7f... instantly helps figuring out where to look (program itself or library code).

Total number of bytes read/written by a Linux process and its children

A little awk, and strace is what you want.

strace -e trace=read,write -o ls.log ls

gives you a log of the read and write syscalls. Now you can take this log and sum the last column like this

cat ls.log | grep read | awk 'BEGIN {FS="="}{ sum += $2} END {print sum}'

You might wan't to change the grep to match only a read at the beginning of the line.

How to Find or Calculate a Linux Process's Page Table Size and Other Kernel Accounting

Why does Linux favor 0x7f mappings?

Total number of bytes read/written by a Linux process and its children

Related Topics

Leave a reply