Std::Fstream Buffering VS Manual Buffering (Why 10X Gain with Manual Buffering)

std::fstream buffering vs manual buffering (why 10x gain with manual buffering)?

This is basically due to function call overhead and indirection. The ofstream::write() method is inherited from ostream. That function is not inlined in libstdc++, which is the first source of overhead. Then ostream::write() has to call rdbuf()->sputn() to do the actual writing, which is a virtual function call.

On top of that, libstdc++ redirects sputn() to another virtual function xsputn() which adds another virtual function call.

If you put the characters into the buffer yourself, you can avoid that overhead.

std::ofstream - no buffering string longer than 1023 (instant flush)

Per [filebuf.virtuals]/12:

basic_streambuf* setbuf(char_type* s, streamsize n) override;

Effects: If setbuf(0, 0) is called on a stream before any I/O has occurred on that stream, the stream becomes unbuffered. Otherwise the
results are implementation-defined. “Unbuffered” means that pbase()
and pptr() always return null and output to the file should appear
as soon as possible.

“Implementation-defined” includes “works fine” and “there is only a single write” and other things. In fact, here's what libstdc++ 7.3.0 says:

First, are you sure that you understand buffering? Particularly the
fact that C++ may not, in fact, have anything to do with it?

The rules for buffering can be a little odd, but they aren't any
different from those of C. (Maybe that's why they can be a bit odd.)
Many people think that writing a newline to an output stream
automatically flushes the output buffer. This is true only when the
output stream is, in fact, a terminal and not a file or some other
device -- and that may not even be true since C++ says nothing about
files nor terminals. All of that is system-dependent. (The
"newline-buffer-flushing only occurring on terminals" thing is mostly
true on Unix systems, though.)

Some people also believe that sending endl down an output stream only
writes a newline. This is incorrect; after a newline is written, the
buffer is also flushed. Perhaps this is the effect you want when
writing to a screen -- get the text out as soon as possible, etc --
but the buffering is largely wasted when doing this to a file:

output << "a line of text" << endl;
output << some_data_variable << endl;
output << "another line of text" << endl;

The proper thing to do in this case to just write the data out and let
the libraries and the system worry about the buffering. If you need a
newline, just write a newline:

output << "a line of text\n"
<< some_data_variable << '\n'
<< "another line of text\n";

I have also joined the output statements into a single statement. You
could make the code prettier by moving the single newline to the start
of the quoted text on the last line, for example.

If you do need to flush the buffer above, you can send an endl if
you also need a newline, or just flush the buffer yourself:

output << ...... << flush;    // can use std::flush manipulator
output.flush(); // or call a member fn

On the other hand, there are times when writing to a file should be
like writing to standard error; no buffering should be done because
the data needs to appear quickly (a prime example is a log file for
security-related information). The way to do this is just to turn off
the buffering before any I/O operations at all have been done (note
that opening counts as an I/O operation):

std::ofstream    os;
std::ifstream is;
int i;

os.rdbuf()->pubsetbuf(0,0);
is.rdbuf()->pubsetbuf(0,0);

os.open("/foo/bar/baz");
is.open("/qux/quux/quuux");
...
os << "this data is written immediately\n";
is >> i; // and this will probably cause a disk read

Since all aspects of buffering are handled by a streambuf-derived
member, it is necessary to get at that member with rdbuf(). Then the
public version of setbuf can be called. The arguments are the same
as those for the Standard C I/O Library function (a buffer area
followed by its size).

A great deal of this is implementation-dependent. For example,
streambuf does not specify any actions for its own setbuf()-ish
functions; the classes derived from streambuf each define behavior
that "makes sense" for that class: an argument of (0,0) turns off
buffering for filebuf but does nothing at all for its siblings
stringbuf and strstreambuf, and specifying anything other than
(0,0) has varying effects. User-defined classes derived from
streambuf can do whatever they want. (For filebuf and arguments
for (p,s) other than zeros, libstdc++ does what you'd expect: the
first s bytes of p are used as a buffer, which you must allocate
and deallocate.)

A last reminder: there are usually more buffers involved than just
those at the language/library level. Kernel buffers, disk buffers, and
the like will also have an effect. Inspecting and changing those are
system-dependent.

Effect of using std::stringbuf for buffering while performing a write via insertion operator ' '

It doesn't "only perform a single write operation"; you are not considering the cost of building up that string, which is not zero.

You may find that a buffer.reserve(100000000 * strlen("Hello world!\n")) helps things a little.

Is std::ifstream significantly slower than FILE?

I don't think that'd make a difference. Especially if you're reading char by char, the overhead of I/O is likely to completely dominate anything else.
Why do you read single bytes at a time? You know how extremely inefficient it is?

On a 326kb file, the fastest solution will most likely be to just read it into memory at once.

The difference between std::ifstream and the C equivalents, is basically a virtual function call or two. It may make a difference if executed a few tens of million times per second, otherwise, not reall. file I/O is generally so slow that the API used to access it doesn't really matter. What matters far more is the read/write pattern. Lots of seeks are bad, sequential reads/writes good.

File buffering in lexers: is it advisable, now that the OS (and language libraries) already implement buffers internally?

As others have pointed out, whether the OS buffers or not (it does) it is very costly for your application to rely on it as those OS/File System buffers are not in your applications address space. Why? Because for your app to get at that data it typically needs to travel through layers of calls to get to the OS buffers. If you are doing this for 1 character/byte at a time this will incur overhead.

If you are using an IO library: Some of them do or will read 'ahead' for performance reasons and keep the OS calls to a minimum.

If you, on the other hand, are operating without leveraging a library then it is strongly advised for you to setup buffered IO capability for the same reason other libraries will.

Finally, the end result of your compilation is an executable thing. Unless you do not allow IO to occur you will want to have your language specific (assume self hosted) run-time to provide buffered IO for the same reasons. If your run-time is based on a language or series of libraries that provide it, you should be good.

Why should buffering not be used in the following example?

For the sake of illustration, suppose the protocol allows the server to query the client for some information, e.g. (silly example follows)

 hPutStr sock "Please choose between A or B"
choice <- hGetLine sock
case decode choice of
Just A -> handleA
Just B -> handleB
Nothing -> protocolError

Everything looks fine... but the server seems to hang. Why? This is because the message was not really sent over the network by hPutStr, but merely inserted in a local buffer. Hence, the other end never receives the query, so does not reply, causing the server to get stuck in its read.

A solution here would be to insert an hFlush sock before reading. This has to be manually inserted at the "right" points, and is prone to error. A lazier option would be to disable buffering entirely -- this is safer, albeit it severely impacts performance.



Related Topics



Leave a reply



Submit