What Does O_Direct Really Mean

What does O_DIRECT really mean?

(This answer pertains to Linux - other OSes may have different caveats/semantics)

Let's start with the sub-question:

If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?

No (as @michael-foukarakis commented) - if you need a guarantee your data made it to non-volatile storage you must use/add something else.

What does O_DIRECT really mean?

It's a hint that you want your I/O to bypass the Linux kernel's caches. What will actually happen depends on things like:

  • Disk configuration
  • Whether you are opening a block device or a file in a filesystem
  • If using a file within a filesystem
    • The exact filesystem used and the options in use on the filesystem and the file
    • Whether you've correctly aligned your I/O
    • Whether a filesystem has to do a new block allocation to satisfy your I/O
  • If the underlying disk is local, what layers you have in your kernel storage stack before you reach the disk block device
  • Linux kernel version
  • ...

The list above is not exhaustive.

In the "best" case, setting O_DIRECT will avoid making extra copies of data while transferring it and the call will return after transfer is complete. You are more likely to be in this case when directly opening block devices of "real" local disks. As previously stated, even this property doesn't guarantee that data of a successful write() call will survive sudden power loss. IF the data is DMA'd out of RAM to non-volatile storage (e.g. battery backed RAID controller) or the RAM itself is persistent storage THEN you may have a guarantee that the data reached stable storage that can survive power loss. To know if this is the case you have to qualify your hardware stack so you can't assume this in general.

In the "worst" case, O_DIRECT can mean nothing at all even though setting it wasn't rejected and subsequent calls "succeed". Sometimes things in the Linux storage stack (like certain filesystem setups) can choose to ignore it because of what they have to do or because you didn't satisfy the requirements (which is legal) and just silently do buffered I/O instead (i.e. write to a buffer/satisfy read from already buffered data). It is unclear whether extra effort will be made to ensure that the data of an acknowledged write was at least "with the device" (but in the O_DIRECT and barriers thread Christoph Hellwig posts that the O_DIRECT fallback will ensure data has at least been sent to the device). A further complication is that using O_DIRECT implies nothing about file metadata so even if write data is "with the device" by call completion, key file metadata (like the size of the file because you were doing an append) may not be. Thus you may not actually be able to get at the data you thought had been transferred after a crash (it may appear truncated, or all zeros etc).

While brief testing can make it look like data using O_DIRECT alone always implies data will be on disk after a write returns, changing things (e.g. using an Ext4 filesystem instead of XFS) can weaken what is actually achieved in very drastic ways.

As you mention "guarantee that the data" (rather than metadata) perhaps you're looking for O_DSYNC/fdatasync()? If you want to guarantee metadata was written too, you will have to look at O_SYNC/fsync().

References

  • Ext4 Wiki: Clarifying Direct IO's Semantics. Also contains notes about what O_DIRECT does on a few non-Linux OSes.
  • The "[PATCH 1/1 linux-next] ext4: add compatibility flag check to the patch" LKML thread has a reply from Ext4 lead dev Ted Ts'o talking about how filesystems can fallback to buffered I/O for O_DIRECT rather than failing the open() call.
  • In the "ubifs: Allow O_DIRECT" LKML thread Btrfs lead developer Chris Mason states Btrfs resorts to buffered I/O when O_DIRECT is requested on compressed files.
  • ZFS on Linux commit message discussing the semantics of O_DIRECT in different scenarios. Also see the (at the time of writing mid-2020) proposed new O_DIRECT semantics for ZFS on Linux (the interactions are complex and defy a brief explanation).
  • Linux open(2) man page (search for O_DIRECT in the Description section and the Notes section)
  • Ensuring data reaches disk LWN article
  • Infamous Linus Torvalds O_DIRECT LKML thread summary (for even more context you can see the full LKML thread)

Why writes with O_DIRECT and O_SYNC still causing io merge?

On Linux, doing direct I/O doesn't mean "do this exact I/O" - it is a hint to bypass Linux's page cache. At the time of writing the open man page says this about O_DIRECT:

Try to minimize cache effects of the I/O to and from this file.

This means things like the Linux I/O scheduler are still free to do their thing with regard to merges, reorderings (your use of fio's sync=1 is what stops the reordering) etc with O_DIRECT I/O.

Additionally, if you are doing I/O to a file in a filesystem, then it is legitimate for said filesystem to ignore the O_DIRECT hint and fallback to buffered I/O.

See the different parameters of nomerges in https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt for how to teach the scheduler to avoid merging/rearranging but note that you can't control the splitting of a request that is too large.

Having said all the above, it doesn't look like all that much I/O merging (as given by wrqm/s) is happening in your scenario but there's still something a bit strange. The avgrq-sz is 9.36 and since that value is in 512 byte sectors, we get 4792.32 bytes as the average request size being submitted down to the disk. This value is fairly close to the 4096 byte block size fio is using. Since you can't do non-sector sized I/O to a disk and assuming the disk's block size is 512 bytes this suggests a merge of 4KBytes + 512 bytes (I assume the rest is noise) but since it's an average there could be something doing large(r) I/O at the same time fio is doing small I/O and the average is just coming out to something in-between. Because I/O is happening to a file in a filesystem, this might be explained by filesystem metadata being updated...

Why is writing files witht syscall.O_DIRECT flag make writing files slower in go?

O_DIRECT doesn't do what you think. While it does less memory copying (since it doesn't copy to the cache before copying to the device driver), that doesn't give you a performance boost.

The filesystem cache ensures that the system call can return early before the data is written to the device, and buffer data to send data in larger chunks.

With O_DIRECT, the system call waits until the data is completely transferred to the device.

From the man page for the open call:

O_DIRECT (since Linux 2.4.10)

Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from
user-space buffers. The O_DIRECT flag on its own makes an
effort to transfer data synchronously, but does not give
the guarantees of the O_SYNC flag that data and necessary
metadata are transferred.

See also: What does O_DIRECT really mean?

You don't need to manually release the cache after using it.
The cache is considered free available memory by the Linux kernel. If a process needs memory that is occupied by the cache, the kernel will flush/release the cache at that point. The cache doesn't "use up" memory.

When does O_SYNC have an effect?

Summarizing the comments:

  1. The main issue is that the progress bar is decorating the reader (as Yotam Salmon noted), not the writer; the delay is on the side of the writer.

  2. On most Linux systems, O_DIRECT is indeed 0o40000, but on ARM (including Raspberry Pi) it is 0o200000, with 0o40000 being O_DIRECTORY. This explains the "not a directory" error.

  3. O_SYNC is in fact the bit you want, or you can simply issue an fsync system call (use Flush if appropriate, and then Sync, as noted in When to flush a file in Go?). The O_SYNC bit implies an fsync system call as part of each write system call.

Fully synchronous I/O is a bit of a minefield: some devices lie about whether they've written data to nonvolatile storage. However, O_SYNC or fsync is the most guarantee you'll get here. O_DIRECT is likely irrelevant since you're going directly to a device partition /dev file. O_SYNC or fsync may be passed through to the device driver, which may do something with it, which may get the device to write to nonvolatile storage. There's more about this in What does O_DIRECT really mean?

O_DIRECT vs. O_SYNC on Linux/FreeBSD

With current harddisks, there is no assurance that a file is actually written to disk even if the disk reports the write as complete to the OS! This is due to built-in cache in the drive.

On freeBSD you can disable this by setting the kern.cam.ada.write_cache sysctl to 0. This will degrade write performance significantly. Last time I measured it (WDC WD5001ABYS-01YNA0 harddisk on an ICH-7 chipset, FreeBSD 8.1 AMD64), continuous write performance (measured with dd if=/dev/zero of=/tmp/foo bs=10M count=1000) dropped from 75,000,000 bytes/sec to 12,900,000 bytes/sec.

If you want to be absolutely sure that your files are written;

  • Disable write caching with sysctl kern.cam.ada.write_cache=0 followed by camcontrol reset <bus>:<target>:<lun>.
  • Open the file with the O_SYNC option.

Note:

  • Your write perfomance (on a HDD) will now absolutely suck.
  • Do not mount the partition with the sync option; that will cause all I/O (including reads) to be done syncronously.
  • Do not use O_DIRECT. It will try to bypass the cache altogether. That will probably also influence reads.


Related Topics



Leave a reply



Submit