Hugepages on Raspberry Pi 4

HugePages on Raspberry Pi 4

Huge pages are a way to enhance the performances of the applications by reducing the number of TLB misses. The mechanism coalesces contiguous standard physical pages (typical size of 4 KB) into a big one (e.g. 2 MB). Linux implements this feature in two flavors: Transparent Huge pages and explicit huge pages.

Transparent Huge Pages

Transparent huge pages (THP) are managed transparently by the kernel. The user space applications have no control on them. The kernel makes its best to allocate huge pages whenever it is possible but it is not guaranteed. Moreover, THP may introduce overhead as an underlying "garbage collector" kernel daemon named khugepaged is in charge of the coalescing of the physical pages to make huge pages. This may consume CPU time with undesirable effects on the performances of the running applications. In systems with time critical applications, it is generally advised to deactivate THP.

THP can be disabled on the boot command line (cf. the end of this answer) or from the shell in sysfs:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

N.B.: Some interesting papers exist on the performance evaluation/issues of the THP:

  • Transparent Hugepages: measuring the performance impact;
  • Settling the Myth of Transparent HugePages for Databases.

Explicit huge pages

If the huge pages are required at application level (i.e. from user space). HUGETLBFS kernel configuration must be set to activate the hugetlbfs pseudo-filesystem (the menu in the kernel configurator is something like: "File systems" --> "Pseudo filesystems" --> "HugeTLB file system support"). In the kernel source tree this parameter is in fs/Kconfig:

config HUGETLBFS
bool "HugeTLB file system support"
depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
SYS_SUPPORTS_HUGETLBFS || BROKEN
help
hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read
<file:Documentation/admin-guide/mm/hugetlbpage.rst> for details.

If unsure, say N.

For example, on an Ubuntu system, we can check:

$ cat /boot/config-5.4.0-53-generic | grep HUGETLBFS
CONFIG_HUGETLBFS=y

N.B.: On Raspberry Pi, it is possible to configure the apparition of /proc/config.gz and do the same with zcat to check the parameter. To make it, the configuration menu is: "General setup" --> "Kernel .config support" + "Enable access to .config through /proc/config.gz"

When this parameter is set, hugetlbfs pseudo-filesystem is added into the kernel build (cf. fs/Makefile):

obj-$(CONFIG_HUGETLBFS)     += hugetlbfs/

The source code of hugetlbfs is located in fs/hugetlbfs/inode.c. At startup, the kernel will mount internal hugetlbfs file systems to support all the available huge page sizes for the architecture it is running on:

static int __init init_hugetlbfs_fs(void)
{
struct vfsmount *mnt;
struct hstate *h;
int error;
int i;

if (!hugepages_supported()) {
pr_info("disabling because there are no supported hugepage sizes\n");
return -ENOTSUPP;
}

error = -ENOMEM;
hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
sizeof(struct hugetlbfs_inode_info),
0, SLAB_ACCOUNT, init_once);
if (hugetlbfs_inode_cachep == NULL)
goto out;

error = register_filesystem(&hugetlbfs_fs_type);
if (error)
goto out_free;

/* default hstate mount is required */
mnt = mount_one_hugetlbfs(&hstates[default_hstate_idx]);
if (IS_ERR(mnt)) {
error = PTR_ERR(mnt);
goto out_unreg;
}
hugetlbfs_vfsmount[default_hstate_idx] = mnt;

/* other hstates are optional */
i = 0;
for_each_hstate(h) {
if (i == default_hstate_idx) {
i++;
continue;
}

mnt = mount_one_hugetlbfs(h);
if (IS_ERR(mnt))
hugetlbfs_vfsmount[i] = NULL;
else
hugetlbfs_vfsmount[i] = mnt;
i++;
}

return 0;

out_unreg:
(void)unregister_filesystem(&hugetlbfs_fs_type);
out_free:
kmem_cache_destroy(hugetlbfs_inode_cachep);
out:
return error;
}

A hugetlbfs file system is a sort of RAM file system into which the kernel creates files to back the memory regions mapped by the applications.

The amount of needed huge pages can be reserved by writing the number of needed huge pages into /sys/kernel/mm/hugepages/hugepages-hugepagesize/nr_hugepages.

Then, mmap() is able to map some part of the application address space onto huge pages. Here is an example showing how to do it:

#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>

#define HP_SIZE (2 * 1024 * 1024) // <-- Adjust with size of the supported HP size on your system

int main(void)
{
char *addr, *addr1;

// Map a Huge page
addr = mmap(NULL, HP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED| MAP_HUGETLB, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap()");
return 1;
}

printf("Mapping located at address: %p\n", addr);

pause();

return 0;
}

In the preceding program, the memory pointed by addr is based on huge pages. Example of usage:

$ gcc alloc_hp.c -o alloc_hp
$ ./alloc_hp
mmap(): Cannot allocate memory
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
0
$ sudo sh -c "echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
1
$ ./alloc_hp
Mapping located at address: 0x7f7ef6c00000

In another terminal, the process map can be observed to verify the size of the memory page (it is blocked in pause() system call):

$ pidof alloc_hp
13009
$ cat /proc/13009/smaps
[...]
7f7ef6c00000-7f7ef6e00000 rw-s 00000000 00:0f 331939 /anon_hugepage (deleted)
Size: 2048 kB
KernelPageSize: 2048 kB <----- The page size is 2MB
MMUPageSize: 2048 kB
[...]

In the preceding map, the file name /anon_hugepage for the huge page region is made internally by the kernel. It is marked deleted because the kernel removes the associated memory file which will make the file disappear as soon as there are no longer references on it (e.g. when the calling process ends, the underlying file is closed upon exit(), the reference counter on the file drops to 0 and the remove operation finishes to make it disappear).

Allocation of other huge page sizes

On Raspberry Pi 4B, the default huge page size is 2MB but the card supports several other huge page sizes:

$ ls -l /sys/kernel/mm/hugepages
total 0
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-1048576kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-2048kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-32768kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-64kB

To use them, it is necessary to mount a hugetlbfs type file system corresponding to the size of the desired huge page. The kernel documentation provides details on the available mount options. For example, to mount a hugetlbfs file system on /mnt/huge with 8 Huge Pages of size 64KB, the command is:

mount -t hugetlbfs -o pagesize=64K,size=512K,min_size=512K none /mnt/huge

Then it is possible to map huge pages of 64KB in a user program. The following program creates the /tmp/hpfs directory on which it mounts a hugetlbfs file system with a size of 4 huge pages of 64KB. A file named /memfile_01 is created and extended to the size of 2 huge pages. The file is mapped into memory thanks to mmap() system call. It is not passed MAP_HUGETLB flag as the provided file descriptor is for a file created on a hugetlbfs filesystem. Then, the program calls pause() to suspend its execution in order to make some observations in another terminal:

#include <sys/types.h>
#include <errno.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <fcntl.h>

#define ERR(fmt, ...) do { \
fprintf(stderr, \
"ERROR@%s#%d: "fmt, \
__FUNCTION__, __LINE__, ## __VA_ARGS__); \
} while(0)

#define HP_SIZE (64 * 1024)
#define HPFS_DIR "/tmp/hpfs"
#define HPFS_SIZE (4 * HP_SIZE)

int main(void)
{
void *addr;
char cmd[256];
int status;
int rc;
char mount_opts[256];
int fd;

rc = mkdir(HPFS_DIR, 0777);
if (0 != rc && EEXIST != errno) {
ERR("mkdir(): %m (%d)\n", errno);
return 1;
}

snprintf(mount_opts, sizeof(mount_opts), "pagesize=%d,size=%d,min_size=%d", HP_SIZE, 2*HP_SIZE, HP_SIZE);

rc = mount("none", HPFS_DIR, "hugetlbfs", 0, mount_opts);
if (0 != rc) {
ERR("mount(): %m (%d)\n", errno);
return 1;
}

fd = open(HPFS_DIR"/memfile_01", O_RDWR|O_CREAT, 0777);
if (fd < 0) {
ERR("open(%s): %m (%d)\n", "memfile_01", errno);
return 1;
}

rc = ftruncate(fd, 2 * HP_SIZE);
if (0 != rc) {
ERR("ftruncate(): %m (%d)\n", errno);
return 1;
}

addr = mmap(NULL, 2 * HP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if (MAP_FAILED == addr) {
ERR("mmap(): %m (%d)\n", errno);
return 1;
}

// The file can be closed
rc = close(fd);
if (0 != rc) {
ERR("close(%d): %m (%d)\n", fd, errno);
return 1;
}

pause();

return 0;

} // main

The preceding program must be run as root as it calls mount():

$ gcc mount_tlbfs.c -o mount_tlbfs
$ cat /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages
0
$ sudo sh -c "echo 8 > /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages"
$ cat /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages
8
$ sudo ./mount_tlbfs

In another terminal, the /proc/[pid]/smaps file can be displayed to check the huge page allocation. As soon as the program writes into the huge pages, the Lazy allocation mechanism triggers the effective allocation of the huge pages.

Cf. This article for future details

Early reservation

The huge pages are made with consecutive physical memory pages. The reservation should be done early in the system startup (especially on heavy loaded systems) as the physical memory may be so fragmented that it is sometimes impossible to allocate huge pages afterward. To reserve as early as possible, this can be done on the kernel boot command line:

hugepages=  
[HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
line, it specifies the number of pages to allocate for
the default huge page size. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: <integer>

hugepagesz=
[HW] The size of the HugeTLB pages. This is used in
conjunction with hugepages (above) to allocate huge
pages of a specific size at boot. The pair
hugepagesz=X hugepages=Y can be specified once for
each supported huge page size. Huge page sizes are
architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]

transparent_hugepage=
[KNL]
Format: [always|madvise|never]
Can be used to control the default behavior of the system
with respect to transparent hugepages.
See Documentation/admin-guide/mm/transhuge.rst
for more details.

On Raspberry Pi, the boot command line can typically be updated in /boot/cmdline.txt and the current boot command line used by the running kernel can be seen in /proc/cmdline.

N.B.:

  • This recipe is explained in more details here and here
  • There is a user space library called libhugetlbfs which offers a layer of abstraction on top of the kernel's hugetlbfs mechanism described here. It comes with library services like get_huge_pages() and accompanying tools like hugectl. The goal of this user space service is to map the heap and text+data segments of STATICALLY linked executables into huge pages (the mapping of dynamically linked programs is not supported). All of this relies on the kernel features described in this answer.

Why doesn't the Linux Kernel use huge pages?

The Linux kernel's approach to huge pages is to mainly let system administrators manage them from userspace. This is mostly because as cool as they might sound, huge pages can also have drawbacks: for example, they cannot be swapped to disk. This LWN series on huge pages gives a lot of information on the topic.

By default there are no huge pages reserved, and one can reserve them at boot time through the boot parameters hugepagesz= and hugepages= (specified multiple times for multiple huge page sizes). Huge pages can also be reserved at runtime through /proc/sys/vm/nr_hugepages and /sys/kernel/mm/hugepages/hugepages-*/nr_hugepages. Furthermore, they can be "dynamically" reserved by the kernel if .../nr_overcommit_hugepages is set higher than .../nr_hugepages. These numbers are reflected in /proc/meminfo under the various HugePages_XXX stats, which are for the default huge page size (Hugepagesize).

File-backed mappings only support huge pages if the file resides in a hugetlbfs filesystem, and only of the specific size specified at mount time (mount option pagesize=). The hugeadm command-line tool, among other things, can give info about currently mounted hugetlbfs FSs with --list-all-mounts. One major reason for wanting a hugetlbfs mounted on your system is to enable huge page support in QEMU/libvirt guests.

All of the above covers "voluntary" huge pages allocations done with MAP_HUGETLB.


Linux also supports transparent huge pages (THP). Normal pages can be transparently made huge (or vice-versa existing transparent huge pages can be broken into normal pages) when needed by the kernel. This is without the need for MAP_HUGETLB, and regardless of nr_hugepages in sysfs.

There are some sysfs knobs to control THPs too. The most notable one being /sys/kernel/mm/transparent_hugepage/enabled: always means that the kernel will try to create THPs even without userspace programs actively suggesting it; madvise means that it will do so only if userspace programs suggests it through madvise(addr, len, MADV_HUGEPAGE); never means they are disabled. You'll probably see this set to always by default in modern Linux distros e.g. recent releases of Debian or Ubuntu.

As an example, doing mmap(0x123 << 21, 2*1024*1024, 7, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) with /sys/kernel/mm/transparent_hugepage/enabled set to always should result in a 2M transparent huge page since the requested mapping is aligned to 2M (notice the absence of MAP_HUGETLB).



Does it mean the kernel has no need in huge pages at all? What are some examples of scenarios where huge pages are a must?

In general, you don't really need huge pages of any kind, you can very well live without them. They are just an optimization. Scenarios where they can be useful are, as mentioned by @Mgetz in the comments above, cases where you have a lot of random memory accesses on very large files (common for databases). Minimizing TLB pressure in such cases can result in significant performance improvements.

How to implement MAP_HUGETLB in a character device driver?

This isn't possible. You can only mmap files with MAP_HUGETLB if they reside in a hugetlbfs filesystem. Since /proc is a procfs filesystem, you have no way of mapping those files through huge pages.

You can also see this from the checks performed in mmap by the kernel:

    /* ... */

if (!(flags & MAP_ANONYMOUS)) { // <== File-backed mapping?
audit_mmap_fd(fd, flags);
file = fget(fd);
if (!file)
return -EBADF;
if (is_file_hugepages(file)) { // <== Check that FS is hugetlbfs
len = ALIGN(len, huge_page_size(hstate_file(file)));
} else if (unlikely(flags & MAP_HUGETLB)) { // <== If not, MAP_HUGETLB isn't allowed
retval = -EINVAL;
goto out_fput;
}
} else if (flags & MAP_HUGETLB) { // <== Anonymous MAP_HUGETLB mapping?

/* ... */

See also:

  • How to use Linux hugetlbfs for shared memory maps of files?
  • This answer providing some detailed explanations and examples

Huge pages for memory mapped files on Linux

It looks like the underlying filesystem you are using does not support memory-mapping files using huge pages.

For example, for ext4 this support is still under development as of January 2017, and not included in the kernel yet (as of May 19, 2017).

If you run a kernel with that patchset applied, do note that you need to enable huge page support in the filesystem mount options, for example adding huge=always to the fourth column in /etc/fstab for the filesystems desired, or using sudo mount -o remount,huge=always /mountpoint.



Related Topics



Leave a reply



Submit