Does Madvise(_, _, Madv_Dontneed) Instruct the Os to Lazily Write to Disk

madvise: not understood

For random read/write access to a mmap()ed file, MADV_SEQUENTIAL is probably not very useful (and may in fact cause undesired behavior). MADV_RANDOM or MADV_DONTNEED would be better options in this case. However, be aware that the kernel is free to ignore any madvise() - although in my understanding, Linux currently does not, as it tends to treat madvise() more as a command than an advisory...

Another option would be to mmap() only selected sections of the file as needed, and munmap() them as you're done with them, perhaps maintaining a pool of some small number of currently active mappings (i.e. mapping more than one region at once if needed, but still keeping it limited).

optimizing mmap on very large file

You have stumbled upon the line of reasoning that leads to the B-tree data structure. The optimization you are imagining is worth doing, but to get as much as possible out of it, you will need to reorganize the data on disk substantially and use more complicated algorithms than binary search. You should probably look into existing open source B-tree libraries rather than implementing from scratch.

Because you are using mmap, the minimum granularity of access is not the disk block size, but the memory "page" size, which can be queried with sysconf(_SC_PAGESIZE). Some OSes will read and populate a larger chunk of memory on random access to a file-backed region, but I don't know of any portable way to find out how much. You might also get some benefit from madvise(MADV_RANDOM).

Zero a large memory mapping with `madvise`

There is a much easier solution to your problem that is fairly portable:


Since MAP_FIXED is permitted to fail for fairly arbitrary implementation-specific reasons, falling back to memset if it returns MAP_FAILED would be advisable.

file mapping vs file system synchronization

Use addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, offset) to map the file.

If the size of the file changes, use newaddr = mremap(addr, len, newlen, MREMAP_MAYMOVE) to update the mapping to reflect it. To extend the file, use ftruncate(fd, newlen) before remapping the file.

You can use mprotect(addr, len, protflags) to change the protection (read/write) on any pages in the mapping (both must be aligned on a page boundary). You can also tell the kernel about your future accesses via madvise(), if the mapping is too large to fit in memory at once, but the kernel seems pretty darned good at managing readahead etc. even without those.

When you make changes to the mapping, use msync(partaddr, partlen, MS_SYNC | MS_INVALIDATE) or msync(partaddr, partlen, MS_ASYNC | MS_INVALIDATE) to ensure the changes int partlen chars from partaddr forward are visible to other mappings and file readers. If you use MS_SYNC, the call returns only when the update is complete. The MS_ASYNC call tells the kernel to do the update, but won't wait until it is done. If there are no other memory maps of the file, the MS_INVALIDATE does nothing; but if there are, that tells the kernel to ensure the changes are reflected in those too.

In Linux kernels since 2.6.19, MS_ASYNC does nothing, as the kernel tracks the changes properly anyway (no msync() is needed, except possibly before munmap()). I don't know if Android kernels have patches that change that behaviour; I suspect not. It is still a good idea to keep them in the code, for portability across POSIXy systems.

mapped data turns-out to be inconsistent temporarily

Well, unless you do use msync(partaddr, partlen, MS_SYNC | MS_INVALIDATE), the kernel will do the update when it sees best.

So, if you need some changes to be visible to file readers before proceeding, use msync(areaptr, arealen, MS_SYNC | MS_INVALIDATE) in the process doing those updates.

If you don't care about the exact moment, use msync(areaptr, arealen, MS_ASYNC | MS_INVALIDATE). It'll be a no-op on current Linux kernels, but it's a good idea to keep them for portability (perhaps commented out, if necessary for performance) and to remind developers about the (lack of) synchronization expectations.

As I commented to OP, I cannot observe the synchronization issues on Linux at all. (That does not mean it does not happen on Android, because Android kernels are derivatives of Linux kernels, not exactly the same.)

I do believe the msync() call is not needed on Linux kernels since 2.6.19 at all, as long as the mapping uses flags MAP_SHARED | MAP_NORESERVE, and the underlying file is not opened using the O_DIRECT flag. The reason for this belief is that in this case, both mapping and file accesses should use the exact same page cache pages.

Here are two test programs, that can be used to explore this on Linux. First, a single-process test, test-single.c:

#define  _POSIX_C_SOURCE  200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <signal.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>

static inline int read_from(const int fd, void *const to, const size_t len, const off_t offset)
char *p = (char *)to;
char *const q = (char *)to + len;
ssize_t n;

if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;

while (p < q) {
n = read(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
if (n != -1)
return errno = EIO;
if (errno != EINTR)
return errno;

return 0;

static inline int write_to(const int fd, const void *const from, const size_t len, const off_t offset)
const char *const q = (const char *)from + len;
const char *p = (const char *)from;
ssize_t n;

if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;

while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
if (n != -1)
return errno = EIO;
if (errno != EINTR)
return errno;

return 0;

int main(int argc, char *argv[])
unsigned long tests, n, merrs = 0, werrs = 0;
size_t page;
long *map, data[2];
int fd;
char dummy;

if (argc != 3) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s FILENAME COUNT\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will test synchronization between a memory map\n");
fprintf(stderr, "and reading/writing the underlying file, COUNT times.\n");
fprintf(stderr, "\n");

if (sscanf(argv[2], " %lu %c", &tests, &dummy) != 1 || tests < 1) {
fprintf(stderr, "%s: Invalid number of tests to run.\n", argv[2]);

/* Create the file. */
page = sysconf(_SC_PAGESIZE);
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0644);
if (fd == -1) {
fprintf(stderr, "%s: Cannot create file: %s.\n", argv[1], strerror(errno));
if (ftruncate(fd, page) == -1) {
fprintf(stderr, "%s: Cannot resize file: %s.\n", argv[1], strerror(errno));

/* Map it. */
if (map == MAP_FAILED) {
fprintf(stderr, "%s: Cannot map file: %s.\n", argv[1], strerror(errno));

/* Test loop. */
for (n = 0; n < tests; n++) {

/* Update map. */
map[0] = (long)(n + 1);
map[1] = (long)(~n);

/* msync(map, 2 * sizeof map[0], MAP_SYNC | MAP_INVALIDATE); */

/* Check the file contents. */
if (read_from(fd, data, sizeof data, 0)) {
fprintf(stderr, "read_from() failed: %s.\n", strerror(errno));
munmap(map, page);
werrs += (data[0] != (long)(n + 1) || data[1] != (long)(~n));

/* Update data. */
data[0] = (long)(n * 386131);
data[1] = (long)(n * -257);
if (write_to(fd, data, sizeof data, 0)) {
fprintf(stderr, "write_to() failed: %s.\n", strerror(errno));
munmap(map, page);
merrs += (map[0] != (long)(n * 386131) || map[1] != (long)(n * -257));

munmap(map, page);

if (!werrs && !merrs)
printf("No errors detected.\n");
else {
if (!werrs)
printf("Detected %lu times (%.3f%%) when file contents were incorrect.\n",
werrs, 100.0 * (double)werrs / (double)tests);
if (!merrs)
printf("Detected %lu times (%.3f%%) when mapping was incorrect.\n",
merrs, 100.0 * (double)merrs / (double)tests);


Compile and run using e.g.

gcc -Wall -O2 test-single -o single
./single temp 1000000

to test a million times, whether the mapping and the file contents stay in sync, when both accesses are done in the same process. Note that the msync() call is commented out, because on my machine it is not needed: I never see any errors/desynchronization during testing even without it.

The test rate on my machine is about 550,000 tests per second. Note that each tests does it both ways, so includes a read and a write. I just cannot get this to detect any errors. It is written to be quite sensitive to errors, too.

The second test program uses two child processes and a POSIX realtime signal to tell the other process to check the contents. test-multi.c:

#define  _POSIX_C_SOURCE  200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <signal.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>


int mapper_process(const int fd, const size_t len)
long value = 1, count[2] = { 0, 0 };
long *data;
siginfo_t info;
sigset_t sigs;
int signum;

if (fd == -1) {
fprintf(stderr, "mapper_process(): Invalid file descriptor.\n");

if (data == MAP_FAILED) {
fprintf(stderr, "mapper_process(): Cannot map file.\n");

sigaddset(&sigs, NOTIFY_SIGNAL);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGTERM);

while (1) {
/* Wait for the notification. */
signum = sigwaitinfo(&sigs, &info);
if (signum == -1) {
if (errno == EINTR)
fprintf(stderr, "mapper_process(): sigwaitinfo() failed: %s.\n", strerror(errno));
munmap(data, len);
if (signum != NOTIFY_SIGNAL)

/* A notify signal was received. Check the write counter. */
count[ (data[0] == value) ]++;

/* Update. */
data[0] = value++;
data[1] = -(value++);

/* Synchronize */
/* msync(data, 2 * sizeof (data[0]), MS_SYNC | MS_INVALIDATE); */

/* And let the writer know. */
kill(info.si_pid, NOTIFY_SIGNAL);

/* Print statistics. */
printf("mapper_process(): %lu errors out of %lu cycles (%.3f%%)\n",
count[0], count[0] + count[1], 100.0 * (double)count[0] / (double)(count[0] + count[1]));

munmap(data, len);

static inline int read_from(const int fd, void *const to, const size_t len, const off_t offset)
char *p = (char *)to;
char *const q = (char *)to + len;
ssize_t n;

if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;

while (p < q) {
n = read(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
if (n != -1)
return errno = EIO;
if (errno != EINTR)
return errno;

return 0;

static inline int write_to(const int fd, const void *const from, const size_t len, const off_t offset)
const char *const q = (const char *)from + len;
const char *p = (const char *)from;
ssize_t n;

if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;

while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
if (n != -1)
return errno = EIO;
if (errno != EINTR)
return errno;

return 0;

int writer_process(const int fd, const size_t len, const pid_t other)
long data[2] = { 0, 0 }, count[2] = { 0, 0 };
long value = 0;
siginfo_t info;
sigset_t sigs;
int signum;

sigaddset(&sigs, NOTIFY_SIGNAL);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGTERM);

while (1) {

/* Update. */
data[0] = ++value;
data[1] = -(value++);

/* then write the data. */
if (write_to(fd, data, sizeof data, 0)) {
fprintf(stderr, "writer_process(): write_to() failed: %s.\n", strerror(errno));

/* Let the mapper know. */
kill(other, NOTIFY_SIGNAL);

/* Wait for the notification. */
signum = sigwaitinfo(&sigs, &info);
if (signum == -1) {
if (errno == EINTR)
fprintf(stderr, "writer_process(): sigwaitinfo() failed: %s.\n", strerror(errno));
if (signum != NOTIFY_SIGNAL || info.si_pid != other)

/* Reread the file. */
if (read_from(fd, data, sizeof data, 0)) {
fprintf(stderr, "writer_process(): read_from() failed: %s.\n", strerror(errno));

/* Check the read counter. */
count[ (data[1] == -value) ]++;

/* Print statistics. */
printf("writer_process(): %lu errors out of %lu cycles (%.3f%%)\n",
count[0], count[0] + count[1], 100.0 * (double)count[0] / (double)(count[0] + count[1]));


int main(int argc, char *argv[])
struct timespec duration;
double seconds;
pid_t mapper, writer, p;
size_t page;
siginfo_t info;
sigset_t sigs;
int fd, status;
char dummy;

if (argc != 3) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s FILENAME SECONDS\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will test synchronization between a memory map\n");
fprintf(stderr, "and reading/writing the underlying file.\n");
fprintf(stderr, "The test will run for the specified time, or indefinitely\n");
fprintf(stderr, "if SECONDS is zero, but you can also interrupt it with\n");
fprintf(stderr, "Ctrl+C (INT signal).\n");
fprintf(stderr, "\n");

if (sscanf(argv[2], " %lf %c", &seconds, &dummy) != 1) {
fprintf(stderr, "%s: Invalid number of seconds to run.\n", argv[2]);
if (seconds > 0) {
duration.tv_sec = (time_t)seconds;
duration.tv_nsec = (long)(1000000000 * (seconds - (double)(duration.tv_sec)));
} else {
duration.tv_sec = 0;
duration.tv_nsec = 0;

/* Block INT, HUP, CHLD, and the notification signal. */
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGCHLD);
sigaddset(&sigs, NOTIFY_SIGNAL);
if (sigprocmask(SIG_BLOCK, &sigs, NULL) == -1) {
fprintf(stderr, "Cannot block the necessary signals: %s.\n", strerror(errno));

/* Create the file. */
page = sysconf(_SC_PAGESIZE);
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0644);
if (fd == -1) {
fprintf(stderr, "%s: Cannot create file: %s.\n", argv[1], strerror(errno));
if (ftruncate(fd, page) == -1) {
fprintf(stderr, "%s: Cannot resize file: %s.\n", argv[1], strerror(errno));
fd = -1;

/* Ensure streams are flushed before forking. They should be, we're just paranoid here. */

/* Fork the mapper child process. */
mapper = fork();
if (mapper == -1) {
fprintf(stderr, "Cannot fork mapper child process: %s.\n", strerror(errno));
if (!mapper) {
fd = open(argv[1], O_RDWR);
if (fd == -1) {
fprintf(stderr, "mapper_process(): %s: Cannot open file: %s.\n", argv[1], strerror(errno));
status = mapper_process(fd, page);
return status;

/* For the writer child process. (mapper contains the PID of the mapper process.) */
writer = fork();
if (writer == -1) {
fprintf(stderr, "Cannot fork writer child process: %s.\n", strerror(errno));
kill(mapper, SIGKILL);
if (!writer) {
fd = open(argv[1], O_RDWR);
if (fd == -1) {
fprintf(stderr, "writer_process(): %s: Cannot open file: %s.\n", argv[1], strerror(errno));
status = writer_process(fd, page, mapper);
return status;

/* Wait for a signal. */
if (duration.tv_sec || duration.tv_nsec)
status = sigtimedwait(&sigs, &info, &duration);
status = sigwaitinfo(&sigs, &info);

/* Whatever it was, we kill the child processes. */
kill(mapper, SIGHUP);
kill(writer, SIGHUP);
do {
p = waitpid(-1, NULL, 0);
} while (p != -1 || errno == EINTR);

/* Cleanup. */


Note that the child processes open the temporary file separately. To compile and run, use e.g.

gcc -Wall -O2 test-multi.c -o multi
./multi temp 10

The second parameter is the duration of the test, in seconds. (You can interrupt the testing safely using SIGINT (Ctrl+C) or SIGHUP.)

On my machine, the test rate is roughly 120,000 tests per second; the msync() call is commented out here also, because I don't ever see any errors/desynchronization even without it. (Plus, msync(ptr, len, MS_SYNC) and msync(ptr, len, MS_SYNC | MS_INVALIDATE) are horribly slow; with either, I can get less than 1000 tests per second, with absolutely no difference in the results. That's a 100x slowdown.)

The MAP_NORESERVE flag to mmap tells it to use the file itself as backing storage when under memory pressure, rather than swap. If you compile the code on a system that does not recognize that flag, you can omit it. As long as the mapping is not evicted from RAM, the flag does not affect the operation at all.

Is it possible to discard dirty pages on a shared mapping?

Going over the comments, it appears that using the swap is fine for your needs as an alternative to file storage. If that's the case, I think your best bet is to use a file, as you've done, on a tmpfs partition. The best tmpfs partition to use for that purpose is at /dev/shm.

Just open a file in /dev/shm, truncate it to the size you need, mmap it and unlink, precisely like you've already done. /dev/shm uses the main memory as it's "backing store", but that will get swapped out if memory is short.

The advantage of using the swap is that no force flush will happen to pages that still fit in memory at the point the program exists. Immediately after, these pages are immediately recognized as unneeded, and discarded. This should solve your problem while still allowing you to resize etc.

It has the extra benefit of requiring almost no change to your current code :-)

mmap for writing sequential log file for speed?

I wrote my bachelor thesis about the comparism of fwrite VS mmap ("An Experiment to Measure the Performance Trade-off between Traditional I/O and Memory-mapped Files"). First of all, for writing, you don't have to go for memory-mapped files, espacially for large files. fwrite is totally fine and will nearly always outperform approaches using mmap. mmap will give you the most performance boosts for parallel data reading; for sequential data writing your real limitation with fwrite is your hardware.

In my examples remapSize is the initial size of the file and the size by which the file gets increased on each remapping.
fileSize keeps track of the size of the file, mappedSpace represents the size of the current mmap (it's length), alreadyWrittenBytes are the bytes that have already been written to the file.

Here is the example initalization:

void init() {
fileDescriptor = open(outputPath, O_RDWR | O_CREAT | O_TRUNC, (mode_t) 0600); // Open file
result = ftruncate(fileDescriptor, remapSize); // Init size
fsync(fileDescriptor); // Flush
memoryMappedFile = (char*) mmap64(0, remapSize, PROT_WRITE, MAP_SHARED, fileDescriptor, 0); // Create mmap
fileSize = remapSize; // Store mapped size
mappedSpace = remapSize; // Store mapped size

Ad Q1:

I used an "Unmap-Remap"-mechanism.


  • first flushes (msync)
  • and then unmaps the memory-mapped file.

This could look the following:

void unmap() {
msync(memoryMappedFile, mappedSpace, MS_SYNC); // Flush
munmap(memoryMappedFile, mappedSpace)

For Remap, you have the choice to remap the whole file or only the newly appended part.

Remap basically

  • increases the file size
  • creates the new memory map

Example implementation for a full remap:

void fullRemap() {
ftruncate(fileDescriptor, mappedSpace + remapSize); // Make file bigger
fsync(fileDescriptor); // Flush file
memoryMappedFile = (char*) mmap64(0, mappedSpace + remapSize, PROT_WRITE, MAP_SHARED, fileDescriptor, 0); // Create new mapping on the bigger file
fileSize += reampSize;
mappedSpace += remapSize; // Set mappedSpace to new size

Example implementation for the small remap:

void smallRemap() {
ftruncate(fileDescriptor, fileSize + remapSize); // Make file bigger
fsync(fileDescriptor); // Flush file
remapAt = alreadyWrittenBytes % pageSize == 0
? alreadyWrittenBytes
: alreadyWrittenBytes - (alreadyWrittenBytes % pageSize); // Adjust remap location to pagesize
memoryMappedFile = (char*) mmap64(0, fileSize + remapSize - remapAt, PROT_WRITE, MAP_SHARED, fileDescriptor, remapAt); // Create memory-map
fileSize += remapSize;
mappedSpace = fileSize - remapAt;

There is a mremap function out there, yet it states

This call is Linux-specific, and should not be used in programs
intended to be portable.

Ad Q2:

I'm not sure if I understood that point right. If you want to tell the kernel "and now load the next page", then no, this is not possible (at least to my knowledge). But see Ad Q3 on how to advise the kernel.

Ad Q3:

You can use madvise with the flag MADV_SEQUENTIAL, yet keep in mind that this does not enforce the kernel to read ahead, but only advices it.

Excerp form the man:

This may cause the kernel to aggressively read-ahead

Personal conclusion:

Do not use mmap for sequential data writing. It will just cause much more overhead and will lead to much more "unnatural" code than a simple writing alogrithm using fwrite.

Use mmap for random access reads to large files.

This are also the results that were obtained during my thesis. I was not able to achieve any speedup by using mmap for sequential writing, in fact, it was always slower for this purpose.

Related Topics

Leave a reply
