Xfs - How to Not Modify Mtime When Writing to File

Is mtime the begin time or the end time of the modification process

Conceptually, every modification happens at a specific moment in time. The mtime is the time of the most recent such event.

If you want, you can think of a large write to the file as if it is broken into a series of individual writes of one byte (or bit, if you want!) each. The one-byte writes each occur instantaneously. So after the large write which takes a lot of time, the modification time should reflect the time when the last portion of the large write was done, that is, the end of the large write.

That's the regular writes (write(), pwrite(), writev(), etc...) It's not as cleear what should happen when a file is mapped into memory (using mmap()) and one of the memory addresses associated with the file mapping is updated. But in this case the standard has the answer. From Linux's mmap() manpage: "The st_ctime and st_mtime field for a file mapped with PROT_WRITE and MAP_SHARED will be updated after a write to the mapped region, and before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one occurs."

Opening a file doesn't count as a modification, by the way (even if you open the file for writing). Closing the file doesn't count as a modification either. Only actually writing to it does.

What updates mtime after writing to memory mapped files?

When you mmap a file, you're basically sharing memory directly between your process and the kernel's page cache — the same cache that holds file data that's been read from disk, or is waiting to be written to disk. A page in the page cache that's different from what's on disk (because it's been written to) is referred to as "dirty".

There is a kernel thread that scans for dirty pages and writes them back to disk, under the control of several parameters. One important one is dirty_expire_centisecs. If any of the pages for a file have been dirty for longer than dirty_expire_centisecs then all of the dirty pages for that file will get written out. The default value is 3000 centisecs (30 seconds).

Another set of variables is dirty_writeback_centisecs, dirty_background_ratio, and dirty_ratio. dirty_writeback_centisecs controls how often the kernel thread checks for dirty pages, and defaults to 500 (5 seconds). If the percentage of dirty pages (as a fraction of the memory available for caching) is less than dirty_background_ratio then nothing happens; if it's more than dirty_background_ratio, then the kernel will start writing some pages to disk. Finally, if the percentage of dirty pages exceeds dirty_ratio, then any processes attempting to write will block until the amount of dirty data decreases. This ensures that the amount of unwritten data can't increase without bound; eventually, processes producing data faster than the disk can write it will have to slow down to match the disk's pace.

The question of how the mtime gets updated is related to the question of how the kernel knows that a page is dirty in the first place. In the case of mmap, the answer is that the kernel sets the pages of the mapping to read-only. That doesn't mean that you can't write them, but it means that the first time you do, it triggers an exception in the memory-management unit, which is handled by the kernel. The exception handler does (at least) four things:

  1. Marks the page as dirty, so that it will get written back.
  2. Updates the file mtime.
  3. Marks the page as read-write, so that the write can succeed.
  4. Jumps back to the instruction in your program that writes to the mmaped page, which succeeds this time.

So when you write data to a clean page, it causes an mtime update, but it also causes the page to become read-write, so that further writes don't cause an exception (or an mtime update)note 1. However, when the dirty page gets flushed to disk, it becomes clean, and also becomes "read-only" again, so that any further writes to it will trigger another eventual disk write, and also another mtime update.

So now, with a few assumptions, we can start to piece together the puzzle.

First, dirty_background_ratio and dirty_ratio are probably not coming into play. If the pace of your writes was fast enough to trigger background flushes, then most likely you would see the "irregular" behavior on all files.

Second, the difference between the "irregular" files and the "30 second" files is the page access pattern. I surmise that the "irregular" files are being written to in some sort of append-mode or circular-buffer fashion, such that you start writing to a new page every few seconds. Every time you dirty a previously untouched page, it triggers an mtime update. But for the files displaying the 30-second pattern, you only write to one page (perhaps they are one page or less in length). In that case, the mtime is updated on first write, and then not again until the file is flushed to disk by exceeding dirty_expire_centisecs, which is 30 seconds.

Note 1: This behavior is, technically, wrong. It's unpredictable, but the standards allow for some degree of unpredictability. But they do require that the mtime be sometime at or after the last write to a file, and at or before an msync (if any). In the case where a page is written to multiple times in the interval before it's flushed to disk, this isn't what happens — the mtime gets the timestamp of the first write. This has been discussed, but a patch that would have fixed it wasn't accepted. Therefore, when using mmap, mtimes can be in error. dirty_expire_centisecs sort of limits that error, but only partially, since other disk traffic might cause the flush to have to wait, extending the window for a write to bypass mtime even further.

How to reduce the default metadata size for an XFS file system?

Based on the specific percentage of storage that you're seeing missing, it seems likely that you're being misled by the difference between binary and decimal units of storage. Since disks are measured in decimal terabytes, using software tools that measure available storage in binary terabytes (which are 10% larger) will give you results that appear to be about 9% too low. The storage hasn't actually gone anywhere, though; you're just using units that make it look smaller!

By default, the coreutils versions of the df and du commands use binary units. You can use the -H flag to make them use decimal units instead.

file.createNewFile() creates files with last-modified time before actual creation time

Filesystems do not store time precisely, and often not at millisecond resolution, e.g. FAT has a 2-second resolution for creation time, and NTFS can delay updating the last access time by up to an hour. (details on MSDN.) Although not in your case, in general, there is also the problem of synchronizing clocks if the file is created on another computer.

It seems this might be an issue for the JPoller folks, since this is where the time handling logic is. Until it's fixed, you could workaround this by manually setting the last modified time of each file written to be +4 seconds from the actual time - +4 is an arbitrary value that should be larger than the resolution of the filesystem you are working on. When the files are written to the file system, they will be rounded down, but by less than the value you have added. Not pretty, but it will work!



Related Topics



Leave a reply



Submit