Linux: Large Int Array: Mmap Vs Seek File

Linux: Large int array: mmap vs seek file?

I'd say performance should be similar if access is truly random. The OS will use a similar caching strategy whether the data page is mapped from a file or the file data is simply cached without an association to RAM.

Assuming cache is ineffective:

You can use fadvise to declare your access pattern in advance and disable readahead.
Due to address space layout randomization, there might not be a contiguous block of 4 TB in your virtual address space.
If your data set ever expands, the address space issue might become more pressing.

So I'd go with explicit reads.

mmaping large files(for persistent large arrays)

All pointers that are stored inside the mmap'd region should be done as offsets from the base of the mmap'd region, not as real pointers! You won't necessarily be getting the same base address when you mmap the region on the next run of the program. (I have had to clean up code that made incorrect assumptions about mmap region base address constancy).

When should I use mmap for file access?

mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.

mmap also allows the operating system to optimize paging operations. For example, consider two programs; program A which reads in a 1MB file into a buffer creating with malloc, and program B which mmaps the 1MB file into memory. If the operating system has to swap part of A's memory out, it must write the contents of the buffer to swap before it can reuse the memory. In B's case any unmodified mmap'd pages can be reused immediately because the OS knows how to restore them from the existing file they were mmap'd from. (The OS can detect which pages are unmodified by initially marking writable mmap'd pages as read only and catching seg faults, similar to Copy on Write strategy).

mmap is also useful for inter process communication. You can mmap a file as read / write in the processes that need to communicate and then use synchronization primitives in the mmap'd region (this is what the MAP_HASSEMAPHORE flag is for).

One place mmap can be awkward is if you need to work with very large files on a 32 bit machine. This is because mmap has to find a contiguous block of addresses in your process's address space that is large enough to fit the entire range of the file being mapped. This can become a problem if your address space becomes fragmented, where you might have 2 GB of address space free, but no individual range of it can fit a 1 GB file mapping. In this case you may have to map the file in smaller chunks than you would like to make it fit.

Another potential awkwardness with mmap as a replacement for read / write is that you have to start your mapping on offsets of the page size. If you just want to get some data at offset X you will need to fixup that offset so it's compatible with mmap.

And finally, read / write are the only way you can work with some types of files. mmap can't be used on things like pipes and ttys.

Mmap() an entire large file

MAP_PRIVATE mappings require a memory reservation, as writing to these pages may result in copy-on-write allocations. This means that you can't map something too much larger than your physical ram + swap. Try using a MAP_SHARED mapping instead. This means that writes to the mapping will be reflected on disk - as such, the kernel knows it can always free up memory by doing writeback, so it won't limit you.

I also note that you're mapping with PROT_WRITE, but you then go on and read from the memory mapping. You also opened the file with O_RDONLY - this itself may be another problem for you; you must specify O_RDWR if you want to use PROT_WRITE with MAP_SHARED.

As for PROT_WRITE only, this happens to work on x86, because x86 doesn't support write-only mappings, but may cause segfaults on other platforms. Request PROT_READ|PROT_WRITE - or, if you only need to read, PROT_READ.

On my system (VPS with 676MB RAM, 256MB swap), I reproduced your problem; changing to MAP_SHARED results in an EPERM error (since I'm not allowed to write to the backing file opened with O_RDONLY). Changing to PROT_READ and MAP_SHARED allows the mapping to succeed.

If you need to modify bytes in the file, one option would be to make private just the ranges of the file you're going to write to. That is, munmap and remap with MAP_PRIVATE the areas you intend to write to. Of course, if you intend to write to the entire file then you need 8GB of memory to do so.

Alternately, you can write 1 to /proc/sys/vm/overcommit_memory. This will allow the mapping request to succeed; however, keep in mind that if you actually try to use the full 8GB of COW memory, your program (or some other program!) will be killed by the OOM killer.

In the following case, which one is better ? fread() or mmap()?

My question is, which method is more efficient ? fread() or mmap() ?

First of all, let's look how fread and mmap works on linux:

fread:

Let's say we work with ext4 file system (without encryption),
fread use some internal buffer and if no data in it,

it calls read, read execute "system call"
and after some time we jump to:

fs/read_write.c::vfs_read
and after more work we reach
mm/filemap.c::generic_file_read_iter

And in this function we fill inode page cache and read to this
page cache data.

So we do the basically the same as "mmap" does.

The difference that that in fread case we not directly
works with pages, we just copy portion of data from kernel
inode page cache to user space buffer,

in mmap we have page cache directly in program
memory space. Plus in fread when no page in "inode page cache"
we just read it, but in mmap that cause "page fault",
and only after that we read it.

Both variants use "read pages ahead" strategy.
The possible difference may be in "cache" policy,
we can control it in "mmap" case with madvise and flags of mmap.

So I suppose the answer is "they are almost the same in terms of speed in sequence read case like yours".

How can I keep multiple copies of a very large dataset in memory?

This sounds like a good use case for mmap.

The mmap function can be used to take an open file and map it to a region of memory. Reads and writes to the file via the returned pointer are handled internally, although you can periodically flush to disk manually. This will allow you to manipulate a data structure larger than the physical memory of the system.

This also has the advantage that you don't need to worry about moving data back and forth from disk manually. The kernel will take care of it for you.

So for each of these large arrays, you can create a memory mapping backed by a file on disk.

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

#define DATA_LEN 30000000000LL

int main()
{
    int array1_fd = open("/tmp/array1", O_RDWR | O_CREAT | O_TRUNC, 0644);
    if (array1_fd < 0) {
        perror("open failed");
        exit(1);
    }

    // make sure file is big enough
    if (lseek(array1_fd, DATA_LEN, SEEK_SET) == -1) {
        perror("seek to len failed");
        exit(1);
    }
    if (write(array1_fd, "x", 1) == -1) {
        perror("write at end failed");
        exit(1);
    }
    if (lseek(array1_fd, 0, SEEK_SET) == -1) {
        perror("seek to 0 failed");
        exit(1);
    }

    char *array1 = mmap(NULL, DATA_LEN, PROT_READ | PROT_WRITE, MAP_SHARED, array1_fd, 0);
    if (array1 == MAP_FAILED) {
        perror("mmap failed");
        exit(1);
    }

    // Use array1

    munmap(array1, DATA_LEN);
    close(array1_fd);
    return 0;
}

The important part of the mmap call is the MAP_SHARED flag. This means that updates to the mapped memory region are carried through to the underlying file descriptor.

memory map file with growing size

After some experimentations, I found a way to make it work.

First mmap the file with PROT_NONE and a large enough size. For 64-bit systems, it can be as large 1L << 46 (64TB). This does NOT consume physical memory* (at least on Linux). It will consume address space (virtual memory) for this process.

void* ptr = mmap(NULL, (1L << 40), PROT_NONE, MAP_SHARED, fd, 0);

Then, give read (and/or write) permission to the part of memory within file length using mprotect. Note that size need to be aligned with page size (which can be obtained by sysconf(_SC_PAGESIZE), usually 4096).

mprotect(ptr, aligned_size, PROT_READ | PROT_WRITE);

However, if file size is not page-size aligned, reading the portion within mapped region (with PROT_READ permission) but beyond file length will trigger a bus error, as documented on mmap manual.

Then you can use either file descriptor fd or the mapped memory to read and write file. Remember to use fsync or msync to persist the data after writing to it. The memory-mapped page with PROT_READ permission should get the latest file content (if you write to it)**. The newly mapped page with mprotect will also get the newly updated page.

Depending on the application, you might want to use ftruncate to make the file size aligned to system page size for the best performance. You might also want to use madvise with MADV_SEQUENTIAL to improve performance when reading those pages.

*This behavior is not mentioned on the manual of mmap. However, since PROT_NONE implies those pages are not accessible in anyway, it's trivial for any OS implementation to not allocating any physical memory to it at all.

**This behavior of memory region mapped before a file write getting updated after the write is completed (fsync or msync) is also not mentioned on the manual (or at least I did not see it). But it seems to be the case at least on recent Linux kernels (4.x onward).

Is that possible to mmap a very big file and using qsort?

If the file will fit in a contiguous mapping in your address space, you can do this. If it won't, you can't.

As to the differences:

if the file just about fits, and then you add some more data, the mmap will fail. A normal external sort won't suddenly stop working because you have a little more data.
if you don't map it with MAP_PRIVATE, sorting will mutate the original file. A normal external sort won't (necessarily)
if you do map it with MAP_PRIVATE, you could crash at any time if the VM doesn't have room to duplicate the whole file. Again, a strictly external sort's memory requirements don't scale linearly with the data size.

tl;dr

It is possible, it may fail unpredictably and unrecoverably, you almost certainly shouldn't do it.

Linux: Large Int Array: Mmap Vs Seek File