Read File Without Disk Caching in Linux

Read file without disk caching in Linux

You can use posix_fadvise() with the POSIX_FADV_DONTNEED advice to request that the system free the pages you've already read.

Telling Linux not to keep a file in the cache when it is written to disk

Can I tell the OS not to keep parts of the large file in cache ?

Yes, you probably want to use some system call like posix_fadvise(2) or madvise(2). In weird cases, you might use readahead(2) or userfaultfd(2) or Linux-specific flags to mmap(2). Or very cleverly handle SIGSEGV (see signal(7), signal-safety(7) and eventfd(2) and signalfd(2)) You'll need to write your C program doing that.

But I am not sure that it is worth your development efforts. In many cases, the behavior of a recent Linux kernel is good enough.

Most of the time, it is simply not worth the effort to explicitly code something to manage your page cache.

I guess that mount(2) options in your /etc/fstab file (see fstab(5)...) are in practice more important. Or changing or tuning your file system (e.g. ext4(5), xfs(5)..). Or read(2)-ing in large pieces (1Mbytes).

Play with dd(1) to measure. See also time(7)

Most applications are not disk-bound, and for those who are disk bound, renting more disk space is cheaper that adding and debugging extra code.

don't forget to benchmark, e.g. using strace(1) and time(1)

PS. Don't forget your developer costs. They often are a lot above the price of a RAM module (or of some faster SSD disk).

Read file without evicting from OS page cache

Using posix_fadvise you can hint the OS that it should drop certain file blocks from the cache. Together with information from mincore that tells us which blocks are currently cached we can alter applications to work without disturbing the buffer cache.

This delightful workaround for [un]implemented kernel features is described in detail:

http://insights.oetiker.ch/linux/fadvise/

[Edit] Implications of kernel read-ahead

For full read performance, you should make sure to only drop the pages you've already read. Otherwise you'll drop the pages that the kernel helpfully reads in advance :). (I think this should be detected as a readahead mis-predict, which would disable it and at least avoid lots of wasted IO. But read-ahead is seriously helpful, so you want to avoid disabling it).

Also, I bet if you test the pages just ahead of your last read then they always show as in-core. It won't tell you whether anyone else was using them or not. All it will show is that kernel read-ahead is working :).

The code in the linked rsync patch shold be fine (ignoring the "array of all the fds" hack). It tests the whole file before the first read. That's reasonable because it only requires an in-core allocation of 1 byte per 4kB file page.

Are file reads served from dirtied pages in the page cache?

The file read will fetch data from page cache without writing to disk. From Linux Kernel Development 3rd Edition by Robert Love:

Whenever the kernel begins a read operation—for example, when a
process issues the read() system call—it first checks if the requisite
data is in the page cache. If it is, the kernel can forgo accessing
the disk and read the data directly out of RAM.This is called a cache
hit. If the data is not in the cache, called a cache miss, the kernel
must schedule block I/O operations to read the data off the disk.

Writeback to disk happens periodically, separate from read:

The third strategy, employed by Linux, is called write-back. In a
write-back cache, processes perform write operations directly into the
page cache.The backing store is not immediately or directly updated.
Instead, the written-to pages in the page cache are marked as dirty
and are added to a dirty list. Periodically, pages in the dirty list
are written back to disk in a process called writeback, bringing the
on-disk copy in line with the inmemory cache.

Disabling disk cache in linux

You need root access to do this. You can run hdparm -W 0 /dev/sda command to disable write caching, where you have to replace /dev/sda with device for your drive:

#include <stdlib.h>
...
system("hdparm -W 0 /dev/sda1");

You can also selectively disable write caching to individual partitions like this: hdparm -W 0 /dev/sda1.

To reenable caching again just use the -W 1 argument.

man hdparm, man system

Does Linux read() copy data into the process address space

My understanding is that the first 4k of bytes are read into the page cache and then 64 bytes are copied into the buffer for the read() call.

In general, that is correct. (But there are always exceptions - in this case direct I/O. You really don't ever need to worry much about that unless you're dealing with some I/O corner cases...)

When you read in the data via read() and the 4k is stored in the file system cache, does that take up your process's virtual memory address space or is that just disk cache space that could/will be paged out later?

The latter - the disk cache is memory in kernel space that, well, caches contents of data on disk. And it can be paged out (as can most pages of memory).

That 64 bytes that is copied into the buffer returned by the read() to be used by the process, does this data take up process address space or just disk space cache?

The data is copied from the disk cache (kernel memory) into the buffer that's in user space. So the data is in both places. (Which is a reason for direct I/O - the extra copy step and the extra copy of the data itself is eliminated)

I/O performance is a complex subject. What's fastest in one case may not be even remotely the fastest in another. Everything from CPU speed to memory bandwidth to PCI bus bandwidth to disk controller characteristics to SATA/SAS/SCSI/FC/iSCSI bandwidth and latency to actual physical disk performance specifics matter. How data is laid out on disk(s) matters. How data is accessed matters. It's pretty much impossible to state something like mmap() is faster than read() - or the other way around.

Think of getting the best I/O performance as similar to impedance matching speakers on a high-end stereo system for the best sound, but with a whole lot more variables affecting the "best" answer. To get the absolute best performance, everything has to match - from the actual layout of data on physical disk(s) to the exact pattern(s) the user-space applications use to access the data.

And in general it's really not worth bothering with - almost every out-of-the-box setup will get you at least 80% or so of the maximum possible performance your hardware can deliver as long as you don't do something bad like read a file in reverse a single character at a time.

How to measure file read speed without caching?

Clear the Linux file cache

sync && echo 1 > /proc/sys/vm/drop_caches

Create a large file that uses all your RAM

dd if=/dev/zero of=dummyfile bs=1024 count=LARGE_NUMBER

(don't forget to remove dummyfile when done).

Read File Without Disk Caching in Linux