Why Is Std::Cout So Time Consuming

Why is std::cout so time consuming?

std::cout ultimately results in the operating system being invoked.

If you want something to compute fast, you have to make sure that no external entities are involved in the computation, especially entities that have been written with versatility more than performance in mind, like the operating system.

Want it to run faster? You have a few options:

Replace << std::endl; with << '\n'. This will refrain from flushing the internal buffer of the C++ runtime to the operating system on every single line. It should result in a huge performance improvement.
Use std::ios::sync_with_stdio(false); as user Galik Mar suggests in a comment.
Collect as much as possible of your outgoing text in a buffer, and output the entire buffer at once with a single call.
Write your output to a file instead of the console, and then keep that file displayed by a separate application such as Notepad++ which can keep track of changes and keep scrolling to the bottom.

As for why it is so "time consuming", (in other words, slow,) that's because the primary purpose of std::cout (and ultimately the operating system's standard output stream) is versatility, not performance. Think about it: std::cout is a C++ library function which will invoke the operating system; the operating system will determine that the file being written to is not really a file, but the console, so it will send the data to the console subsystem; the console subsystem will receive the data and it will start invoking the graphics subsystem to render the text in the console window; the graphics subsystem will be drawing font glyphs on a raster display, and while rendering the data, there will be scrolling of the console window, which involves copying large amounts of video RAM. That's an awful lot of work, even if the graphics card takes care of some of it in hardware.

As for the C# version, I am not sure exactly what is going on, but what is probably happening is something quite different: In C# you are not invoking Console.Out.Flush(), so your output is cached and you are not suffering the overhead incurred by C++'s std::cout << std::endl which causes each line to be flushed to the operating system. However, when the buffer does become full, C# must flush it to the operating system, and then it is hit not only by the overhead represented by the operating system, but also by the formidable managed-to-native and native-to-managed transition that is inherent in the way it's virtual machine works.

C++: Does cout statement makes code slower

As already mentioned, writing to the terminal is almost definitely going to be slower. Why?

depending on your OS, std::cout may use line buffering - which means each line may be sent to the terminal program separately. When you use std::endl rather than '\n' it definitely flushes the buffer. Writing the data in smaller chunks means extra system calls and rendering efforts that slow things down significantly.
some operating systems / compilers are even slower - for example, Visual C++: https://connect.microsoft.com/VisualStudio/feedback/details/642876/std-wcout-is-ten-times-slower-than-wprintf-performance-bug-in-c-library
terminals displaying output need to make calls to wipe out existing screen content, render the fonts, update the scroll bar, copy the lines into the history/buffer. Especially when they get new content in small pieces, they can't reliably guess how much longer they'd have to wait for some more and are likely to try to update the screen for the little bit they've received: that's costly, and a reason excessive flushing or unbuffered output is slow.
- Some terminals offer the option of "jump scrolling" which means if they find they're say 10 pages behind they immediately render the last page and the earlier 9 pages of content never appear on the screen: that can be nice and fast. Still, "jump scrolling" is not always used or wanted, as it means output is never presented to the end users eyes: perhaps the program is meant to print a huge red error message in some case - with jump scrolling there wouldn't even be a flicker of it to catch the user's attention, but without jump scrolling you'd probably notice it.
- when I worked for Bloomberg we had a constant stream of log file updates occupying several monitors - at times the displayed output would get several minutes behind; a switch from the default Solaris xterm to rxvt ensured it always kept pace
redirecting output to /dev/null is a good way to see how much your particular terminal is slowing things down

printf more than 5 times faster than std::cout?

For a true apples-to-apples comparison, re-write your test so that the only thing changing between the test cases is the print function being used:

int main(int argc, char* argv[])
{
    const char* teststring = "Test output string\n";
    std::clock_t start;
    double duration;

    std::cout << "Starting std::cout test." << std::endl;
    start = std::clock();

    for (int i = 0; i < 1000; i++)
        std::cout << teststring;
    /* Display timing results, code trimmed for brevity */

    for (int i = 0; i < 1000; i++) {
        std::printf(teststring);
        std::fflush(stdout);
    }
    /* Display timing results, code trimmed for brevity */
    return 0;
}

With that, you will be testing nothing but the differences between the printf and cout function calls. You won't incur any differences due to multiple << calls, etc. If you try this, I suspect that you'll get a much different result.

C++ cout printing slowly

NOTE: This experimental result is valid for MSVC. In some other implementation of library, the result will vary.

printf could be (much) faster than cout. Although printf parses the format string in runtime, it requires much less function calls and actually needs small number of instruction to do a same job, comparing to cout. Here is a summary of my experimentation:

The number of static instruction

In general, cout generates a lot of code than printf. Say that we have the following cout code to print out with some formats.

os << setw(width) << dec << "0x" << hex << addr << ": " << rtnname <<
  ": " << srccode << "(" << dec << lineno << ")" << endl;

On a VC++ compiler with optimizations, it generates around 188 bytes code. But, when you replace it printf-based code, only 42 bytes are required.

The number of dynamically executed instruction

The number of static instruction just tells the difference of static binary code. What is more important is the actual number of instruction that are dynamically executed in runtime. I also did a simple experimentation:

Test code:

int a = 1999;
char b = 'a';
unsigned int c = 4200000000;
long long int d = 987654321098765;
long long unsigned int e = 1234567890123456789;
float f = 3123.4578f;
double g = 3.141592654;

void Test1()
{
    cout 
        << "a:" << a << “\n”
        << "a:" << setfill('0') << setw(8) << a << “\n”
        << "b:" << b << “\n”
        << "c:" << c << “\n”
        << "d:" << d << “\n”
        << "e:" << e << “\n”
        << "f:" << setprecision(6) << f << “\n”
        << "g:" << setprecision(10) << g << endl;
}

void Test2()
{
    fprintf(stdout,
        "a:%d\n"
        "a:%08d\n"
        "b:%c\n"
        "c:%u\n"
        "d:%I64d\n"
        "e:%I64u\n"
        "f:%.2f\n"
        "g:%.9lf\n",
        a, a, b, c, d, e, f, g);
    fflush(stdout);
}

int main()
{
    DWORD A, B;
    DWORD start = GetTickCount();
    for (int i = 0; i < 10000; ++i)
        Test1();
    A = GetTickCount() - start;

    start = GetTickCount();
    for (int i = 0; i < 10000; ++i)
        Test2();
    B = GetTickCount() - start;
    
    cerr << A << endl;
    cerr << B << endl;
    return 0;
}

Here is the result of Test1 (cout):

# of executed instruction: 423,234,439
# of memory loads/stores: approx. 320,000 and 980,000
Elapsed time: 52 seconds

Then, what about printf? This is the result of Test2:

# of executed instruction: 164,800,800
# of memory loads/stores: approx. 70,000 and 180,000
Elapsed time: 13 seconds

In this machine and compiler, printf was much faster cout. In both number of executed instructions, and # of load/store (indicates # of cache misses) have 3~4 times differences.

I know this is an extreme case. Also, I should note that cout is much easier when you're handling 32/64-bit data and require 32/64-platform independence. There is always trade-off. I'm using cout when checking type is very tricky.

Okay, cout in MSVS just sucks :)

C++ Performance with/without cout

Most of this is a simple matter of optimization: as long as you don't produce any output from the loops, the compiler determines that the loops are basically dead code, and simply doesn't execute them at all. It has to do a little bit of pre-computation to determine the value that vector[5].y would have following the loops, but it can do that entirely at compile time, so at run-time, it's basically just printing out a fixed number.

When you produce visible output inside the loops, the compiler can't just eliminate executing them, so the code runs dramatically slower.

Why Is Std::Cout So Time Consuming