Difference Between Vm.Dirty_Ratio and Vm.Dirty_Background_Ratio

Difference between vm.dirty_ratio and vm.dirty_background_ratio?

Has the higher value of dirty_background_ratio and dirty_ratio any meaning or is it just a matter of "what is the lower value and who has it"?

In simpler words:

vm.dirty_background_ratio is the percentage of system memory which when dirty, causes the system to start writing data to the disk.

vm.dirty_ratio is the percentage of system memory which when dirty, causes the process doing writes to block and write out dirty pages to the disk.

These tunable depend on what your system is running; if you run a large database it's recommended to keep these values low, to avoid I/O bottlenecks when the system load increases.

e.g.:

vm.dirty_background_ratio=10
vm.dirty_ratio=15

In this example, when the dirty pages exceed vm.dirty_background_ratio=10 I/O starts, i.e they start getting flushed / written to the disk. When the total number of dirty pages exceed vm.dirty_ratio=15 all writes get blocked until some of the dirty pages get written to disk. You can think of the vm.dirty_ratio=15 as the upper limit.

What is a good way to test the use of msync on recent Linux kernels?

With apologies to @samold, "swappiness" has nothing to do with this. Swappiness just affects how the kernel trades off swapping dirty anonymous pages versus evicting page cache pages when memory is low.

You need to play with the Linux VM tunables controlling the pdflush task. For starters, I would suggest:

sysctl -w vm.dirty_writeback_centisecs=360000

By default, vm.dirty_writeback_centisecs is 3000, which means the kernel will consider any dirty page older than 30 seconds to be "too old" and try to flush it to disk. By cranking it up to 1 hour, you should be able to avoid flushing dirty pages to disk at all, at least during a short test. Except...

sysctl -w vm.dirty_background_ratio=80

By default, vm.dirty_background_ratio is 10, as in 10 percent. That means when more than 10 percent of physical memory is occupied by dirty pages, the kernel will think it needs to get busy flushing something to disk, even if it is younger than dirty_writeback_centisecs. Crank this one up to 80 or 90 and the kernel should be willing to tolerate most of RAM being occupied by dirty pages. (I would not set this too high, though, since I bet nobody ever does that and it might trigger strange behavior.) Except...

sysctl -w vm.dirty_ratio=90

By default, vm.dirty_ratio is 40, which means once 40% of RAM is dirty pages, processes attempting to create more dirty pages will block until something gets evicted. Always make this one bigger than dirty_background_ratio. Hm, come to think of it, set this one before that one, just to make sure this one is always larger.

That's it for my initial suggestions. It is possible that your kernel will start evicting pages anyway; the Linux VM is a mysterious beast and seems to get tweaked on every release. Hopefully this provides a starting point.

See Documentation/sysctl/vm.txt in the kernel sources for a complete list of VM tunables. (Preferably refer to the documentation for the kernel version you are actually using.)

Finally, use the /proc/PID/pagemap interface to see which pages are actually dirty at any time.

Why linux disables disk write buffer when system ram is greater than 8GB?

If you use a 32 bit kernel with more than 2G of RAM, you are running in a sub-optimal configuration where significant tradeoffs must be made. This is because in these configurations, the kernel can no longer map all of physical memory at once.

As the amount of physical memory increases beyond this point, the tradeoffs become worse and worse, because the struct page array that is used to manage all physical memory must be kept mapped at all times, and that array grows with physical memory.

The physical memory that isn't directly mapped by the kernel is called "highmem", and by default the writeback code treats highmem as undirtyable. This is what results in your zero values for the dirty thresholds.

You can change this by setting /proc/sys/vm/highmem_is_dirtyable to 1, but with that much memory you will be far better off if you install a 64-bit kernel instead.

Why does Java memory mapped buffer cause massive unexpected disk IO?

Memory Mapping is entirely implemented in the OS. The JVM has no say in how it is flushed to disk except by means of the force() method and the "rws" options when you option the file.

Linux will flush to disk based on the kernel parameters set in sysctl.

$ sysctl -a | grep dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500

These are the defaults on my laptop. The ratio 10 means it will start writing the data to disk in the background when 10% of main memory is dirty. The writeback of 20% means the writing program will stop, untilt he dirty percent drops below 20%. In any case, data will be written to disk after 3000 centi-seconds or 30 seconds.

An interesting comparison, it to memory map a file on a tmpfs filesystem. I have /tmp mounted as tmpfs but most systems have /dev/shm.

BTW You might find this class interesting. MemoryStore allows you to map any size of memory i.e. >> 2 GB and perform thread safe operation on it. e.g. you can shared the memory across processes. It supports off heap locks, volatile read/write, ordered write and CAS.

I have a test where two processes lock, toggle, unlock records and the latency is 50 ns on average on my laptop.

BTW2: Linux has sparse files which means you can map in regions not only larger than your main memory, but larger than your free disk space. e.g. if you map in 8 TB and only use random pieces of 4 GB, it will use up to 4 GB in memory and 4 GB on disk. If you use du {file} you can see the actual space used. Note: lazy allocation of disk space can lead to highly fragmented files which can be a performance problem for HDD.

Why doesn't this memory eater really eat memory?

When your malloc() implementation requests memory from the system kernel (via an sbrk() or mmap() system call), the kernel only makes a note that you have requested the memory and where it is to be placed within your address space. It does not actually map those pages yet.

When the process subsequently accesses memory within the new region, the hardware recognizes a segmentation fault and alerts the kernel to the condition. The kernel then looks up the page in its own data structures, and finds that you should have a zero page there, so it maps in a zero page (possibly first evicting a page from page-cache) and returns from the interrupt. Your process does not realize that any of this happened, the kernels operation is perfectly transparent (except for the short delay while the kernel does its work).

This optimization allows the system call to return very quickly, and, most importantly, it avoids any resources to be committed to your process when the mapping is made. This allows processes to reserve rather large buffers that they never need under normal circumstances, without fear of gobbling up too much memory.

So, if you want to program a memory eater, you absolutely have to actually do something with the memory you allocate. For this, you only need to add a single line to your code:

int eat_kilobyte()
{
    if (memory == NULL)
        memory = malloc(1024);
    else
        memory = realloc(memory, (eaten_memory * 1024) + 1024);
    if (memory == NULL)
    {
        return 1;
    }
    else
    {
        //Force the kernel to map the containing memory page.
        ((char*)memory)[1024*eaten_memory] = 42;

        eaten_memory++;
        return 0;
    }
}

Note that it is perfectly sufficient to write to a single byte within each page (which contains 4096 bytes on X86). That's because all memory allocation from the kernel to a process is done at memory page granularity, which is, in turn, because of the hardware that does not allow paging at smaller granularities.

Dedicated database server heavy iowait spikes

IO spikes are likely to happen when postgresql is checkpointing.
You can verify that by logging checkpoints and see if they coincide with the lack of response of the server.

If that's the case, tuning checkpoints_segments and checkpoint_completion_target is likely to help.
See the wiki's advice about that and the doc about the WAL configuration.

What do the Erlang emulator info statements mean?

[async-threads:0]

Size of async thread pool available for loaded drivers to use. This allows blocking syscalls to be performed in a separate kernel thread from the beam vm. Use command switch +A N to adjust the size of the pool.

[hipe]

Support for native compilation of erlang source and bytecode. Tends to mostly be useful for number crunching code. IO-bound code do fine on the bytecode interpreter.

[kernel-poll:false]

There is the old select(2) and poll(2) system calls for receiving notification that some file descriptor is ready for unblocking writing or reading. They do not scale well to high number of open file descriptors. Modern operatingsystems have alternative interfaces, linux has epoll, freebsd has kqueue. Enable with command switch +K true

Difference Between Vm.Dirty_Ratio and Vm.Dirty_Background_Ratio