mmap flag MAP_UNINITIALIZED not defined
In order to understand what to do about the fact that #include <sys/mman.h>
does not define MAP_UNINITIALIZED
, it is helpful to understand how the interface to the kernel is defined.
To build a kernel module, you will need the kernel headers used to build the kernel for the exact version of the kernel for which you wish to build the module. As you wish to run in userspace, you won't need these.
The headers that define the kernel API for userspace are largely in /usr/include/linux
and /usr/include/asm
(see this for how they are generated). One of the more important consumers of these headers is the C standard library, e.g., glibc
, which must be built against some version of these headers. Since the linux kernel API is backwards compatible, you may have a glibc (or other library implementation) built against an older version of these headers than the kernel you are running. I'm by no means an expert on how all the various distros distribute glibc, but it is my impression that the kernel headers defining its userspace API are generally the version that glibc has been built against.
Finally, glibc defines its API through headers also installed under /usr/include
such as /usr/include/sys
. I don't know exactly what, if any, backward or forward compatibility is provided for applications built with older or newer glibc headers, but I'm guessing that the library .so version number gets bumped when backward comparability would be broken.
So now we can understand your problem to be that the glibc headers don't actually define MAP_UNINITIALIZED
for the distros/versions that you tried.
However, the linux kernel API has exposed MAP_UNINITIALIZED
, as this patch demonstrates. If the glibc headers don't define it for you, you can use the linux kernel API headers and #include <linux/mman.h>
if this defines it. Note that you will still need to #include <sys/mman.h>
in order to get the prototype for mmap, among other things.
If your linux kernel API headers don't define MAP_UNINITIALIZED
but you have a kernel version that implements it, you can define it yourself:
#define MAP_UNINITIALIZED 0x4000000
You don't have to worry that you are effectively using "newer" headers than your glibc was built with, because the glibc implementation of mmap
is very thin:
#include <sys/types.h>
#include <sys/mman.h>
#include <errno.h>
#include <sysdep.h>
#ifndef MMAP_PAGE_SHIFT
#define MMAP_PAGE_SHIFT 12
#endif
__ptr_t
__mmap (__ptr_t addr, size_t len, int prot, int flags, int fd, off_t offset)
{
if (offset & ((1 << MMAP_PAGE_SHIFT) - 1))
{
__set_errno (EINVAL);
return MAP_FAILED;
}
return (__ptr_t) INLINE_SYSCALL (mmap2, 6, addr, len, prot, flags, fd,
offset >> MMAP_PAGE_SHIFT);
}
weak_alias (__mmap, mmap)
It is just passing your flags straight through to the kernel.
Do I have to add the length of the mapping to a pointer returned by mmap with the MAP_GROWSDOWN and MAP_STACK flags?
Yes, you have to add 65536 to the resulting pointer. Note, not 65535. Most architectures implement push(x) as *--sp = x; so having the sp above the stack is ok to start with. More importantly it has to be aligned, and 65535 is not.
The documentation appears to be wrong. I think it intends "is one page higher than the...". That better aligns with the source implementation, and the result of the little sample program below:
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
volatile int sp;
void segv(int signo) {
char buf[80];
int n = snprintf(buf, 80, "(%d): sp = %#x\n", signo, sp);
write(1, buf, n);
_exit(1);
}
int main(void) {
int N = 65535;
signal(SIGSEGV, segv);
signal(SIGBUS, segv);
char *stack = (char *)mmap(NULL,
N,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_STACK |
MAP_GROWSDOWN | /*MAP_UNINITIALIZED |*/
MAP_ANONYMOUS,
-1,
0);
printf("stack %p\n", stack);
for (sp = 0; sp < N; sp += 4096) {
if (stack[sp]) {
printf("stack[%d] = %x\n", sp, stack[sp]);
}
}
for (sp = 0; sp > -N; sp -= 4096) {
if (stack[sp]) {
printf("stack[%d] = %x\n", sp, stack[sp]);
}
}
return 0;
}
which prints out:
$ ./a.out
stack 0x7f805c5fb000
(11): sp = -4096
on my system:
$ uname -a
Linux u2 4.15.0-42-generic #45-Ubuntu SMP Thu Nov 15 19:32:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
higher page reclaims when using munmap
The important thing to remember about mmap
is that the MAP_ANONYMOUS
memory must be zeroed. So what happens usually is that a kernel will map a page frame with only zeroes in there - and only when a write hits the page, a read-write mapped zero page is mapped in place.
However, this is the reason why the kernel cannot reuse the originally mapped page right away - it does not know that only the first byte of the page is dirty - instead, it must zero all 4 kiB bytes on that page before it can be given back to the process in a new anonymous mapping. Hence in both examples there are at least 1024 page faults occurring.
If the memory would not need to be zeroed, Linux for example has an extra flag called MAP_UNINITIALIZED
that tells kernel that the pages need not be zeroed, but it is only available in embedded devices:
MAP_UNINITIALIZED
(since Linux 2.6.33)Don't clear anonymous pages. This flag is intended to improve
performance on embedded devices. This flag is honored only if
the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED
option. Because of the security implications, that option
is normally enabled only on embedded devices (i.e., devices
where one has complete control of the contents of user memory).
I guess the reason for its non-availability in generic Linux kernels is because the kernel does not keep track of the process that previously had mapped the page frame, hence the page could leak information from a sensitive process.
bzero
ing the page yourself would not affect performance - the kernel would not know that it was zeroed because there is no architecture that would support it in hardware - and then it is cheaper to write zeroes over the page than to check if the page is full of all zeroes and then in 99.9999999 % cases to write zeroes over it anyway.
Change user space memory protection flags from kernel module
After some more research, I found a function called get_user_pages()
(best documentation I've found is here) that returns a list of pages from userspace at a given address that can be mapped to kernel space with kmap()
and written to that way (in my case, using kernel_read()
). This can be used as a replacement for copy_to_user()
because it allows forcing write permissions on the pages retrieved. The only drawback is that you have to write page by page, instead of all in one go, but it does solve the problem I described in my question.
Related Topics
Unix/Linux Ipc: Reading from a Pipe. How to Know Length of Data at Runtime
How to Run .Exe Executable File from Linux Command Line
How to Create a File in Assembly with a Dynamically Specified File Path
Why Isn't Git Bash Transforming The Path to *Nix Notation for My Python Installation
How to Use Vi to Edit a Command in Terminal on Linux
Kill Bash Script Foreground Children When a Signal Comes
How to Copy from Tmux (Copy Mode) Running on a Remote Ssh Connection to Your Local Clipboard
Virtually Contiguous VS. Physically Contiguous Memory
Process Niceness (Priority) Setting Has No Effect on Linux
Where Is G_Multi Configured in Beaglebone Black
Replace in a CSV File Value of a Column
How to Use Sysfs Inside Kernel Module
Where The Structure "Struct Page" Is Stored on The Linux Kernel
Execute External Program with Trigger in Postgres 9.4
Running Docker Without Sudo on Ubuntu 14.04