Default Buffer Size for a File on Linux

Default buffer size for a file on Linux

Since you linked to the 2.7 docs, I'm assuming you're using 2.7. (In Python 3.x, this all gets a lot simpler, because a lot more of the buffering is exposed at the Python level.)

All open actually does (on POSIX systems) is call fopen, and then, if you've passed anything for buffering, setvbuf. Since you're not passing anything, you just end up with the default buffer from fopen, which is up to your C standard library. (See the source for details. With no buffering, it passes -1 to PyFile_SetBufSize, which does nothing unless bufsize >= 0.)

If you read the glibc setvbuf manpage, it explains that if you never call any of the buffering functions:

Normally all files are block buffered. When the first I/O operation occurs on a file, malloc(3) is called, and a buffer is obtained.

Note that it doesn't say what size buffer is obtained. This is intentional; it means the implementation can be smart and choose different buffer sizes for different cases. (There is a BUFSIZ constant, but that's only used when you call legacy functions like setbuf; it's not guaranteed to be used in any other case.)

So, what does happen? Well, if you look at the glibc source, ultimately it calls the macro _IO_DOALLOCATE, which can be hooked (or overridden, because glibc unifies C++ streambuf and C stdio buffering), but ultimately, it allocates a buf of _IO_BUFSIZE, which is an alias for the platform-specific macro _G_BUFSIZE, which is 8192.

Of course you probably want to trace down the macros on your own system rather than trust the generic source.


You may wonder why there is no good documented way to get this information. Presumably it's because you're not supposed to care. If you need a specific buffer size, you set one manually; if you trust that the system knows best, just trust it. Unless you're actually working on the kernel or libc, who cares? In theory, this also leaves open the possibility that the system could do something smart here, like picking a bufsize based on the block size for the file's filesystem, or even based on running stats data, although it doesn't look like linux/glibc, FreeBSD, or OS X do anything other than use a constant. And most likely that's because it really doesn't matter for most applications. (You might want to test that out yourself—use explicit buffer sizes ranging from 1KB to 2MB on some buffered-I/O-bound script and see what the performance differences are.)

Buffer size in file I/O

Note that the correct return type for main() is int, not void.

This code compiles on Linux (Ubuntu 14.04 derivative tested):

#include <stdio.h>
#include <stdio_ext.h>

int main(void)
{
FILE *f;
size_t bufsize;

f = fopen("test.txt", "wb");
if (f == NULL)
{
perror("fopen failed\n");
return -1;
}

bufsize = __fbufsize(f);
printf("The buffer size is %zd\n", bufsize);

putc('\n', f);
bufsize = __fbufsize(f);
printf("The buffer size is %zd\n", bufsize);

fclose(f);
return 0;
}

When run, it produces:

The buffer size is 0
The buffer size is 4096

As suggested in the comments, until you use the file stream, the buffer size is not set. Until then, you could change the size with setvbuf(), so the library doesn't set the buffer size until you try to use it.

The macro BUFSIZ defined in <stdio.h> is the default buffer size. There's no standard way to find the buffer size set by setvbuf(). You need to identify the platform you're working on to allow useful commentary on __fbufsize() as a function (though it seems to be a GNU libc extension: __fbufsize()).

There are numerous small improvements that should be made in the program, but they're not immediately germane.

gnu sort - default buffer size

I went digging through the coreutils sort source code and found these functions: default_sort_size and sort_buffer_size.

It turns out that --buffer-size (sort_size in the source code) isn't the target buffer size but rather the maximum buffer size. If no --buffer-size value is specified, the default_sort_size function is used to determine a safe maximum buffer size. It does this based on resource limits, available memory, and total memory. A summary of the function is as follows:

size = MIN(SIZE_MAX, resource_limit) / 2;
mem = MAX(available_memory, total_memory / 8);

if ( size > total_memory * 0.75 )
size = total * 0.75;

buffer_max = MIN(mem, size);
buffer_max = MAX(buffer, MIN_SORT_SIZE);

The other function, sort_buffer_size, is used to determine exactly how much memory to allocate for the given input files. A summary of the function is as follows:

if (sort_size is set)
size_bound = sort_size;
else
size_bound = default_sort_size();

buffer_size = line_bytes + 2;

for each input_file
if (input_file is regular)
file_size = input_file_size;
else
if (sort_size is set)
return sort_size;
else
file_size = guess;

worst_case = file_size * worst_case_per_input_byte + 1;

if (worst_case overflows || size + worst_case >= size_bound)
return size_bound;
else
size += worst_case;

return size;

Possibly the most important point of the sort_buffer_size function is that if you're sorting data from STDIN or a pipe, it will automatically default to sort_size (i.e. --buffer-size) if it was provided. Otherwise, for regular files it will make some rough calculations based on the file sizes and only use sort_size as an upper limit.

How much can less utility buffer?

The manpage of less covers the topic:

   -bn or --buffers=n
Specifies the amount of buffer space less will use for each file,
in units of kilobytes (1024 bytes). By default 64K of buffer space
is used for each file (unless the file is a pipe; see the -B
option). The -b option specifies instead that n kilobytes of buf‐
fer space should be used for each file. If n is -1, buffer space
is unlimited; that is, the entire file can be read into memory.

-B or --auto-buffers
By default, when data is read from a pipe, buffers are allocated
automatically as needed. If a large amount of data is read from
the pipe, this can cause a large amount of memory to be allocated.
The -B option disables this automatic allocation of buffers for
pipes, so that only 64K (or the amount of space specified by the -b
option) is used for the pipe. Warning: use of -B can result in
erroneous display, since only the most recently viewed part of the
piped data is kept in memory; any earlier data is lost.

The manpage implies that the buffer will eventually grow as big as the whole input, if you do not restrict it by -B and -b options.

How to find the socket buffer size of linux

If you want see your buffer size in terminal, you can take a look at:

  • /proc/sys/net/ipv4/tcp_rmem (for read)
  • /proc/sys/net/ipv4/tcp_wmem (for write)

They contain three numbers, which are minimum, default and maximum memory size values (in byte), respectively.

determining the optimal buffer size for file read in linux

You did not disable the line #define SIZE 100 in your source code so the definition via option (-DSIZE=1000) does have influence only above this #define. On my compiler I get a warning for this (<command-line>:0:0: note: this is the location of the previous definition) at compile time.

If you comment out the #define you should be able to fix this error.

Another aspect which comes to mind:

If you create a file on a machine and read it right away after that, it will be in the OS's disk cache (which is large enough to store all of this file), so the actual disk block size won't have much of an influence here.

Stevens's book was written in 1992 when RAM was way more expensive than today, so maybe some information in there is outdated. I also doubt that newer editions of the book have taken things like these out because in general they are still true.

In C, what's the size of stdout buffer?

The actual size is defined by the individual implementation; the standard doesn't mandate a minimum size (based on what I've been able to find, anyway). Don't have a clue on how you'd determine the size of the buffer.

Edit

Chapter and verse:

7.19.3 Files

...

3 When a stream is unbuffered, characters are intended to appear from the source or at the
destination as soon as possible. Otherwise characters may be accumulated and
transmitted to or from the host environment as a block. When a stream is fully buffered,
characters are intended to be transmitted to or from the host environment as a block when
a buffer is filled. When a stream is line buffered, characters are intended to be
transmitted to or from the host environment as a block when a new-line character is
encountered. Furthermore, characters are intended to be transmitted as a block to the host
environment when a buffer is filled, when input is requested on an unbuffered stream, or
when input is requested on a line buffered stream that requires the transmission of
characters from the host environment. Support for these characteristics is
implementation-defined, and may be affected via the setbuf and setvbuf functions
.

Emphasis added.

"Implementation-defined" is not a euphemism for "I don't know", it's simply a statement that the language standard explicitly leaves it up to the implementation to define the behavior.

And having said that, there is a non-programmatic way to find out; consult the documentation for your compiler. "Implementation-defined" also means that the implementation must document the behavior:

3.4.1

1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made

2 EXAMPLE An example of implementation-defined behavior is the propagation of the high-order bit
when a signed integer is shifted right.

Meaning of the default buffer size(8KB) of ' BufferedInputStream ' ? (JAVA)

Could the buffer size of this class be critical in deciding performance of some programs? I'm curious if anyone uses the above form of the constructor to change the buffer size to fit/optimize his/her program.

Probably not. Changing from a buffer size of 1 to 2 will about double your performance (by reducing system calls). Changing from 2 to 4 will double it again. Changing from 4 to 8, again. You get the idea. At some point this ceases being true, as the performance ceases being dominated by system calls and starts being dominated by transfer sizes. 8k is a good place to stop. Use more if you like but you won't notice much difference.

Is there any profound meaning to the default buffer size of 8KB?

There isn't. It is 8k in size. By default. That's the meaning. You can change it via a constructor. Nothing more to it.



Related Topics



Leave a reply



Submit