Why Is Printing to Stdout So Slow? Can It Be Sped Up

Why is printing to stdout so slow? Can it be sped up?

Thanks for all the comments! I've ended up answering it myself with your help. It feels dirty answering your own question, though.

Question 1: Why is printing to stdout slow?

Answer: Printing to stdout is not inherently slow. It is the terminal you work with that is slow. And it has pretty much zero to do with I/O buffering on the application side (eg: python file buffering). See below.

Question 2: Can it be sped up?

Answer: Yes it can, but seemingly not from the program side (the side doing the 'printing' to stdout). To speed it up, use a faster different terminal emulator.

Explanation...

I tried a self-described 'lightweight' terminal program called wterm and got significantly better results. Below is the output of my test script (at the bottom of the question) when running in wterm at 1920x1200 in on the same system where the basic print option took 12s using gnome-terminal:


-----
timing summary (100k lines each)
-----
print                         : 0.261 s
write to file (+fsync)        : 0.110 s
print with stdout = /dev/null : 0.050 s

0.26s is MUCH better than 12s! I don't know whether wterm is more intelligent about how it renders to screen along the lines of how I was suggesting (render the 'visible' tail at a reasonable frame rate), or whether it just "does less" than gnome-terminal. For the purposes of my question I've got the answer, though. gnome-terminal is slow.

So - If you have a long running script that you feel is slow and it spews massive amounts of text to stdout... try a different terminal and see if it is any better!

Note that I pretty much randomly pulled wterm from the ubuntu/debian repositories. This link might be the same terminal, but I'm not sure. I did not test any other terminal emulators.

Update: Because I had to scratch the itch, I tested a whole pile of other terminal emulators with the same script and full screen (1920x1200). My manually collected stats are here:


wterm           0.3s
aterm           0.3s
rxvt            0.3s
mrxvt           0.4s
konsole         0.6s
yakuake         0.7s
lxterminal        7s
xterm             9s
gnome-terminal   12s
xfce4-terminal   12s
vala-terminal    18s
xvt              48s

The recorded times are manually collected, but they were pretty consistent. I recorded the best(ish) value. YMMV, obviously.

As a bonus, it was an interesting tour of some of the various terminal emulators available out there! I'm amazed my first 'alternate' test turned out to be the best of the bunch.

C++ does printing to terminal significantly slow down code?

Yes, rendering to screen takes longer than writing to file.

In windows its even slower as the program rendering is not the program that is running, so there are constantly messages sent between processes to get it drawn.

I guess its same in linux since virtual terminal is on a different process than the one that is running.

Printing to the console vs writing to a file (speed)

Writing to a file would be much faster. This is especially true since you are flushing the buffer after every line with endl.

On a side note, you could speed the printing significantly by doing repeating cout << "text!\n"; 5000 times, then flushing the buffer using flush().

Slow print waiting too long before printing

You don't define the flush of the print stdout. By not including the flush=True in the print command, it will just store all the characters in the buffer until the function call resolves, and it all prints in a single instance.

import time

def print_slow(str):
    for letter in str:
        print(letter, end='', flush=True)
        time.sleep(.4)

print_slow("junk")

How to speed up printf in C

Beneath is a slightly unoptimized implementation (although I skipped the intermediate list and print directly) of what I think you were supposed to do. Running that program on an AMD A8-6600K with a small load (mainly a Youtube music-video for some personal entertainment) results in

real    0m1.211s
user    0m0.047s
sys     0m0.122s

averaged over a couple of runs. So the problem lies in your implementation of the sieve or you are hiding some essential facts about your hardware.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
#include <limits.h>
#include <string.h>

/* I call it a general bitset. Others might call it an abomination. YMMV. */
#   define ERAT_BITS (sizeof(uint32_t)*CHAR_BIT)
#   define GET_BIT(s,n)  ((*(s+(n/ERAT_BITS)) &   ( 1<<( n % ERAT_BITS ))) != 0)
#   define SET_BIT(s,n)   (*(s+(n/ERAT_BITS)) |=  ( 1<<( n % ERAT_BITS )))
#   define CLEAR_BIT(s,n) (*(s+(n/ERAT_BITS)) &= ~( 1<<( n % ERAT_BITS )))
#   define TOG_BIT(s,n)   (*(s+(n/ERAT_BITS)) ^=  ( 1<<( n % ERAT_BITS )))
/* size is the size in bits, the overall size might be bigger */
typedef struct mp_bitset_t {
    uint32_t size;
    uint32_t *content;
} mp_bitset_t;
#   define mp_bitset_alloc(bst, n) \
  do {\
      (bst)->content=malloc(( n /(sizeof(uint32_t)) + 1 ));\
      if ((bst)->content == NULL) {\
          fprintf(stderr, "memory allocation for bitset failed");\
          exit(EXIT_FAILURE);\
        }\
      (bst)->size = n;\
  } while (0)
#   define mp_bitset_size(bst)  ((bst)->size)
#   define mp_bitset_setall(bst) memset((bst)->content,~(uint32_t)(0),\
   (bst->size /(sizeof(uint32_t) ) +1 ))
#   define mp_bitset_clearall(bst) memset((bst)->content,0,\
   (bst->size /(sizeof(uint32_t) ) +1 ))
#   define mp_bitset_clear(bst,n) CLEAR_BIT((bst)->content, n)
#   define mp_bitset_set(bst,n)     SET_BIT((bst)->content, n)
#   define mp_bitset_get(bst,n)     GET_BIT((bst)->content, n)
#   define mp_bitset_free(bst) \
  do {\
     free((bst)->content);\
     free(bst);\
  } while (0)

uint32_t mp_bitset_nextset(mp_bitset_t * bst, uint32_t n);
uint32_t mp_bitset_prevset(mp_bitset_t * bst, uint32_t n);
void mp_eratosthenes(mp_bitset_t * bst);


/* It's called Hallek's method but it has many inventors*/
static uint32_t isqrt(uint32_t n)
{
   uint32_t s, rem, root;
   if (n < 1)
      return 0;
   /* This is actually the highest square but it goes
    * downward from this, quite fast */
   s = 1 << 30;
   rem = n;
   root = 0;
   while (s > 0) {
      if (rem >= (s | root)) {
         rem -= (s | root);
         root >>= 1;
         root |= s;
      } else {
         root >>= 1;
      }
      s >>= 2;
   }
   return root;
}

uint32_t mp_bitset_nextset(mp_bitset_t *bst, uint32_t n)
{
   while ((n < mp_bitset_size(bst)) && (!mp_bitset_get(bst, n))) {
      n++;
   }
   return n;
}

/*
 * Standard method, quite antique now, but good enough for the handful
 * of primes needed here.
 */
void mp_eratosthenes(mp_bitset_t *bst)
{
   uint32_t n, k, r, j;

   mp_bitset_setall(bst);
   mp_bitset_clear(bst, 0);
   mp_bitset_clear(bst, 1);

   n = mp_bitset_size(bst);
   r = isqrt(n);
   for (k = 4; k < n; k += 2)
      mp_bitset_clear(bst, k);
   k = 0;
   while ((k = mp_bitset_nextset(bst, k + 1)) < n) {
      if (k > r) {
         break;
      }
      for (j = k * k; j < n; j += k * 2) {
         mp_bitset_clear(bst, j);
      }
   }
}

#define UPPER_LIMIT 1000000 /* one million */

int main(void) {
  mp_bitset_t *bst;
  uint32_t n, k, j;

  bst = malloc(sizeof(mp_bitset_t));
  if(bst == NULL) {
    fprintf(stderr, "failed to allocate %zu bytes\n",sizeof(mp_bitset_t));
    exit(EXIT_FAILURE);
  }
  mp_bitset_alloc(bst, UPPER_LIMIT);

  mp_bitset_setall(bst);
  mp_bitset_clear(bst, 0);      // 0 is not prime b.d.
  mp_bitset_clear(bst, 1);      // 1 is not prime b.d.

  n = mp_bitset_size(bst);
  for (k = 4; k < n; k += 2) {
    mp_bitset_clear(bst, k);
  }
  k = 0;

  while ((k = mp_bitset_nextset(bst, k + 1)) < n) {
    printf("%" PRIu32 "\n", k);
    for (j = k * k; j < n; j += k * 2) {
      mp_bitset_clear(bst, j);
    }
  }
  mp_bitset_free(bst);

  return EXIT_SUCCESS;
}

Compiled with

gcc-4.9 -O3 -g3 -W -Wall -Wextra -Wuninitialized -Wstrict-aliasing -pedantic  -std=c11 tests.c -o tests

(GCC is gcc-4.9.real (Ubuntu 4.9.4-2ubuntu1~14.04.1) 4.9.4)

R writing to stdout very slow. Any ways to improve?

Binary reads are fast. Printing to stdout is slow for two reasons:

formatting
actual printing

You can benchmark / profile either. But if you really want to be "fast", stay away from formatting for printing lots of data.

Compiled code can help make the conversion faster. But again, the fastest solution will to

remain with binary
not write to stdout out or file (but use eg something like Redis).

Why Is Printing to Stdout So Slow? Can It Be Sped Up