How to Determine the Ideal Buffer Size When Using Fileinputstream

How do you determine the ideal buffer size when using FileInputStream?

Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency.

Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.

This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads.

Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.

So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly.

There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).

This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).

Buffer Size for BufferedInputStream

The default, which is deliberately undocumented, is 8192 bytes. Unless you have a compelling reason to change it, don't change it.

Determining Appropriate Buffer Size

To answer your direct question: (1) filesystems tend to use powers of 2, so you want to do the same. (2) the larger your working buffer, the less effect any mis-sizing will have.

As you say, if you allocate 4100 and the actual block size is 4096, you'll need two reads to fill the buffer. If, instead, you have a 1,000,000 byte buffer, then being one block high or low doesn't matter (because it takes 245 4096-byte blocks to fill that buffer). Moreover, the larger buffer means that the OS has a better chance to order the reads.

That said, I wouldn't use NIO for this. Instead, I'd use a simple BufferedInputStream, with maybe a 1k buffer for my read()s.

The main benefit of NIO is keeping data out of the Java heap. If you're reading and writing a file, for example, using an InputStream means that the OS reads the data into a JVM-managed buffer, the JVM copies that into an on-heap buffer, then copies it again to an off-heap buffer, then the OS reads that off-heap buffer to write the actual disk blocks (and typically adds its own buffers). In this case, NIO will eliminate that native-heap copies.

However, to compute a hash, you need to have the data in the Java heap, and the Mac SPI will move it there. So you don't get the benefit of NBI keeping the data off-heap, and IMO the "old IO" is easier to write.

Just don't forget that InputStream.read() is not guaranteed to read all the bytes you ask for.

What would be an ideal buffer size?

SOURCE:

How do you determine the ideal buffer size when using FileInputStream?

Optimum buffer size is related to a number of things: file system
block size, CPU cache size and cache latency.

Most file systems are configured to use block sizes of 4096 or 8192.
In theory, if you configure your buffer size so you are reading a few
bytes more than the disk block, the operations with the file system
can be extremely inefficient (i.e. if you configured your buffer to
read 4100 bytes at a time, each read would require 2 block reads by
the file system). If the blocks are already in cache, then you wind up
paying the price of RAM -> L3/L2 cache latency. If you are unlucky and
the blocks are not in cache yet, the you pay the price of the
disk->RAM latency as well.

This is why you see most buffers sized as a power of 2, and generally
larger than (or equal to) the disk block size. This means that one of
your stream reads could result in multiple disk block reads - but
those reads will always use a full block - no wasted reads.

Ensuring this also typically results in other performance friendly parameters affecting both reading and subsequent processing: data bus width alignment, DMA alignment, memory cache line alignment, whole number of virtual memory pages.

Why is it that FileInputStream read is slower with bigger array

TL;DR The performance drop is caused by memory allocation, not by file reading issues.

A typical benchmarking problem: you benchmark one thing, but actually measure another.

First of all, when I rewrote the sample code using RandomAccessFile, FileChannel and ByteBuffer.allocateDirect, the threshold has disappeared. File reading performance became roughly the same for 128K and 1M buffer.

Unlike direct ByteBuffer I/O FileInputStream.read cannot load data directly into Java byte array. It needs to get data into some native buffer first, and then copy it to Java using JNI SetByteArrayRegion function.

So we have to look at the native implementation of FileInputStream.read. It comes down to the following piece of code in io_util.c:

    if (len == 0) {
        return 0;
    } else if (len > BUF_SIZE) {
        buf = malloc(len);
        if (buf == NULL) {
            JNU_ThrowOutOfMemoryError(env, NULL);
            return 0;
        }
    } else {
        buf = stackBuf;
    }

Here BUF_SIZE == 8192. If the buffer is larger than this reserved stack area, a temporary buffer is allocated by malloc. On Windows malloc is usually implemented via HeapAlloc WINAPI call.

Next, I measured the performance of HeapAlloc + HeapFree calls alone without file I/O. The results were interesting:

     128K:    5 μs
     256K:   10 μs
     384K:   15 μs
     512K:   20 μs
     640K:   25 μs
     768K:   29 μs
     896K:   33 μs
    1024K:  316 μs  <-- almost 10x leap
    1152K:  356 μs
    1280K:  399 μs
    1408K:  436 μs
    1536K:  474 μs
    1664K:  511 μs
    1792K:  553 μs
    1920K:  592 μs
    2048K:  628 μs

As you can see, the performance of OS memory allocation drastically changes at 1MB boundary. This can be explained by different allocation algorithms used for small chunks and for large chunks.

UPDATE

The documentation for HeapCreate confirms the idea about specific allocation strategy for blocks larger than 1MB (see dwMaximumSize description).

Also, the largest memory block that can be allocated from the heap is slightly less than 512 KB for a 32-bit process and slightly less than 1,024 KB for a 64-bit process.

...

Requests to allocate memory blocks larger than the limit for a fixed-size heap do not automatically fail; instead, the system calls the VirtualAlloc function to obtain the memory that is needed for large blocks.

Optimal Buffer size for read-process-write

See what Microsoft has to say about IO size: http://technet.microsoft.com/en-us/library/cc938632.aspx. Basically, they say you should probably do IO in 64K blocks.

On *NIX platforms, struct stat has a st_blksize member which says what should be a minimal IO block size.