How to Use Linux Hugetlbfs for Shared Memory Maps of Files

How to implement MAP_HUGETLB in a character device driver?

This isn't possible. You can only mmap files with MAP_HUGETLB if they reside in a hugetlbfs filesystem. Since /proc is a procfs filesystem, you have no way of mapping those files through huge pages.

You can also see this from the checks performed in mmap by the kernel:

    /* ... */

if (!(flags & MAP_ANONYMOUS)) { // <== File-backed mapping?
audit_mmap_fd(fd, flags);
file = fget(fd);
if (!file)
return -EBADF;
if (is_file_hugepages(file)) { // <== Check that FS is hugetlbfs
len = ALIGN(len, huge_page_size(hstate_file(file)));
} else if (unlikely(flags & MAP_HUGETLB)) { // <== If not, MAP_HUGETLB isn't allowed
retval = -EINVAL;
goto out_fput;
}
} else if (flags & MAP_HUGETLB) { // <== Anonymous MAP_HUGETLB mapping?

/* ... */

See also:

  • How to use Linux hugetlbfs for shared memory maps of files?
  • This answer providing some detailed explanations and examples

Whether the Hugepage Memory Reserved During Linux System Startup Is Continuous?

Each hugetlbfs "page" is physically contiguous. From one page to the next, it is not guaranteed to be adjacent physically.

The reason for allocating the larger page sizes (i.e., 1GB) at boot time is that memory gets more fragmented over time on a running system and finding large, contiguous chunks of memory for hugepage use becomes more rare.

In answer to your question, if your architecture supports creating 16GB page sizes, then yes, the single 16GB page will be physically contiguous.

Source https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetlbpage.html#hugetlbpage:

hugepagesz Specify a huge page size. Used in conjunction with
hugepages parameter to preallocate a number of huge pages of the
specified size. Hence, hugepagesz and hugepages are typically
specified in pairs such as:

hugepagesz=2M hugepages=512 hugepagesz can only be specified once on
the command line for a specific huge page size. Valid huge page sizes
are architecture dependent.

However, this comment in the documentation leads me to believe that creating larger than 1GB page sizes is not typically supported:

For example, x86 CPUs normally support 4K and 2M (1G if
architecturally supported) page sizes

using O_TMPFILE to clean up huge pages... or other methods?

Looks like O_TMPFILE is not implemented yet for hugetlbfs; indeed, this option requires support of the underlying file-system:

O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ex2, ext3, ext4, UDF, Minix, and shmem filesystems. XFS support was added
in Linux 3.15.

This is confirmed by looking at the kernel source code where there's no inode_ops->tmpfile() implementation in hugetlbfs.

I believe that the right answer here is to work on this implementation...


I noticed your comment about the unlink() option, however, maybe the following approach is not that risky:

  • open the file (by name) with TRUNCATE (so you can assume its size is 0)
  • unlink it
  • mmap() it with your target size

If your program gets killed in the middle, worst case is to leave an empty file.

mremap(2) with HugeTLB to change virtual address?

Right now it looks like you do have to use hugetlbfs.

Unless I'm mistaken, the problem occurs in the Linux kernel because mm/mremap.c:mremap_to() calls mm/mremap.c:vma_to_resize(), which fails with EINVAL for huge pages.

Perhaps the test is incorrect, or the function lacks code to handle huge pages correctly. I'm wondering if one should contact the linux-kernel and linux-mm mailing lists, to see if this is a bug that should/could be easily fixed. However, that won't help you with users relying on current (and older) kernels.

Remember that when using mmap() on a file descriptor, you usually use a different code path as each file system can specify their own mmap handler. For hugetlbfs, the code is in fs/hugetlbfs/inode.c:hugetlbfs_file_mmap().
And, like you said, that code path seems to work okay for you.

Note that it is best if you let the user configure the hugetlbfs mount point, instead of scanning one from /proc/mounts, as that way the sysadmin can configure multiple hugetlbfs mount points, each with different configuration, for each service running on the server. (I'm hoping your service does not require running as root.)



Related Topics



Leave a reply



Submit