Memory-Mapped Files and Low-Memory Scenarios

Memory-mapped files and low-memory scenarios

Memory-mapped files copy data from disk into memory a page at a time. Unused pages are free to be swapped out, the same as any other virtual memory, unless they have been wired into physical memory using mlock(2). Memory mapping leaves the determination of what to copy from disk to memory and when to the OS.

Dropping from the Foundation level to the BSD level to use mmap is unlikely to make much difference, beyond making code that has to interface with other Foundation code somewhat more awkward.

Memory mapped files optional write possible?

I don't think you can. By that I mean you may be able to, but it doesn't make any sense to me :-)

The whole point of a memory-mapped file is that it's a window onto the actual file. If you don't wany changes reflected in the file, you'll probably have to do something like batch up the changes in a data structure (e.g., an array of base address, size and data) and apply them when saving.

In which case, you wouldn't actually need the memory mapped file, just read in and maintain the chunks you want to change (lock the file first if there's a chance of multi-user access).

Update:

Have you thought of the possibility of, when doing a save, deleting the original file and just renaming the temporary file to the original file name? That's likely to be much faster than copying 1G of data from temporary to original. That way, if you don't want it saved, just delete the temporary file and keep the original.

You'll still have to copy the original data to the temporary file when loading but you won't have to copy the temporary data back (whether you save it or not) - that would halve the time taken.

Understanding memory mapping conceptually

QUESTION #1: Isn't this the case for a standard file I/O operation as well? If an application tries to read from a part of a file that is not yet cached, it will result in a syscall that will cause the kernel to load the relevant page/block from the device. And on top of that, the page needs to be copied back to the user-space buffer.

You do the read to a buffer and the I/O device will copy it there. There are also async reads or AIO where the data will be transferred by the kernel in the background as the device provides it. You can do the same thing with threads and read. For the mmap case you don't have control or do not know if the page is mapped or not. The case with read is more explicit. This follows from,

ssize_t read(int fd, void *buf, size_t count);

You specify a buf and count. You can explicitly place where you want the data in your program. As a programmer, you may know that data will not be used again. Subsequent calls to read can then reuse the same buf from the last call. This has multiple benefits; the easiest to see is less memory use (or at least address space and MMU tables). mmap will not know whether a page is still going to be accessed in the future or not. mmap does not know that only some data in the page was of interest. Hence, read is more explicit.

Imagine you have 4096 records of size 4095 bytes on a disk. You need to read/look at two random records and perform an operation on them. For read, you can allocate two 4095 buffer with malloc() or use static char buffer[2][4095] data. The mmap() must map on average 8192 bytes for each record to fill two pages or 16k total. When accessing each mmap record, the record spans two pages. This results in two page faults per record access. Also, the kernel must allocate four TLB/MMU pages to hold the data.

Alternatively, if read to sequential buffers, only two pages are needed, with only two syscalls (read). Also, if the computation on the records are extensive, the locality of the buffers will make it much faster (CPU cache hits) than the mmap data.

And on top of that, the page needs to be copied back to the user-space buffer.

This copy may not be as bad as you believe. The CPU will cache data so that the next access doesn't have to reload from main memory with can be 100x slower than L1 CPU cache.

In the case above, mmap can take over two times as long as a read.

Is the concern here that page-faults are somehow more expensive than syscalls in general - my interpretation of what Linus Torvalds says here? Is it because page-faults are blocking => the thread is not scheduled off the CPU => we are wasting precious time? Or is there something I'm missing here?

I think the main point is you don't have control with mmap. You mmap the file and have no idea if any part is in memory or not. If you just randomly access the file, then it will keep reading it back from disk and you may get thrashing depending on the access pattern without knowing. If the access is purely sequential, then it may not seem better at first glance. However, by re-reading a new chunk to the same user buffer, the L1/L2 CPU cache and TLB of the CPU will be better utilized; both for your process and others in the system. If you read all chunks to a unique buffer and process sequentially, then they will be about the same (see note below).

QUESTION #2: Is there an architectural limitation with supporting async I/O for memory mapped files, or is it just that it no one got around to doing it?

mmap is already similar to AIO, but it has fixed sizes of 4k. Ie, the full mmap file doesn't need to be in memory to start operating on it. Functionally, they are different mechansims to get a similar effect. They are architecturally different.

QUESTION #3: Vaguely related, but my interpretation of this article is that the kernel can read-ahead for standard I/O (even without fadvise()) but does not read-ahead for memory mapped files (unless issued an advisory with madvice()). Is this accurate? If this statement is in-fact true, is that why syscalls for standard I/O maybe faster, as opposed to a memory mapped file which will almost always cause a page-fault?

Poor programming of read can be just as bad as mmap. mmap can use madvise. It is more related to all the Linux MM stuff that has to happen to make mmap work. It all depends on your use case; Either can work better depending on the access patterns. I think that Linus was just saying that neither is a magic bullet.

For instance, if you read to a buffer that is more memory than the system has and you use swap which does the same sort of deal as mmap, you will be worse. You may have a system without swap and mmap for random read access will be fine and allow you to manage files bigger than actual memory. The setup do do this with read will require a lot more code which often means more bugs or if you are naive you will just get an OOM kill message.^note However, if the access is sequential, read is not as much code and it will probably be faster than mmap.

Additional `read` benefits

For some, read offers use of sockets and pipes. Also, char devices such as a ttyS0, will only work with read. This can be beneficial if you author a command line program that gets file names from the command line. If you structure with mmap, it maybe difficult to support these files.

Memory-Mapped Files and Low-Memory Scenarios