Understanding Ruby and Os I/O Buffering

Understanding Ruby and OS I/O buffering

The Ruby IO documentation is not 100% clear on how this buffering works, but this is what you can extract from the documentation:

Ruby IO has its own internal buffer
In addition to that the underlying operating system may or may not further buffer data.

The relevant methods to look at:

IO.flush: Flushes IO. I also looked at the Ruby source and a call to IO.flush also calls the underlying OS fflush(). This should be enough to get the file cached, but does not guarantee physical data to disk.
IO.sync=: If set to true, no Ruby internal buffering is done. Everything is immidiately sent to the OS, and fflush() is called for each write.
IO.sync: Returns the current sync setting (true or false).
IO.fsync: Flushes both the Ruby buffers + calls fsync() on the OS (if it supports it). This will guarantee a full flush all the way to the physical disk file.
IO.close: Closes the Ruby IO and writes pending data to the OS. Note that this does not imply fsync(). The POSIX documentation on close() says that it does NOT guarantee data is physically written to the file. So you need to use an explicit fsync() call for that.

Conclusion: flush and/or close should be enough to get the file cached so that it can be read fully by another process or operation. To get the file all the way to the physical media with certainty, you need to call IO.fsync.

Other related methods:

IO.syswrite: Bypass Ruby internal buffers and do a straight OS write. If you use this then do not mix it with IO.read/write.
IO.sysread: Same as above, but for reading.

Consequences of unbuffered file I/O

Unbuffered I/O is slower than buffered I/O when write operations have small counts and the situation is reversed for large count write operations. In a middle range around 1,000 to 10,000 bytes per operation it doesn't make much difference.

You will also see slightly better performance when operations are aligned

Tell Ruby Display a Line before going to next etc

It's a buffering issue. Try adding $stdout.sync = true before you use puts. Setting sync to true disables buffering.

x = 1
y = 2

$stdout.sync = true

puts x + y
sleep(5)
puts x + y * 2
sleep(10)
puts (x + y) * 2

Or, you can flush stdout manually each time:

x = 1
y = 2

puts x + y
$stdout.flush
sleep(5)

puts x + y * 2 
$stdout.flush
sleep(10)

puts (x + y) * 2
$stdout.flush

For more info on Ruby buffering, check out this well-done answer.

Why does writing concurrently to a file using different processes produce weird result?

Having multiple processes write to the same file is usually not a good idea. In most cases, unless you absolutely know what you are doing, the result will be unpredictable, as you just demonstrated with your example.

The reason why you get your strange result, is that the Ruby IO object has its own internal buffer. This buffer is kept in memory, and is NOT guaranteed to be written to disk when you call <<.

What happens here is that the string hello from parent only gets written to the internal buffer, and not to the disk. Then when you call fork, you will be copying this buffer into the child. Then the child will append hello from child to the buffer, and only THEN will the buffer be flushed to disk.

The result is that all children will write hello from parent, in addition to writing hello from child, because this is what the internal memory buffer will contain by the time Ruby decides to write the buffer to disk.

To get around this problem you can call IO.flush before forking, to ensure the memory buffer is empty and gets flushed to disk before forking. This ensures that the buffer is empty in the child, and you will now get your expected output:

CSV.open(...) do |target_file|
  target_file << ...
  target_file.flush  # <-- Make sure the internal buffer is flushed to disk before forking

  a.each do |num|
    ... Process.fork ...
  end
end
...

Why is IO::WaitReadable being raised differently for STDOUT than STDERR?

In most OS's STDOUT is buffered while STDERR is not. What popen3 does is basically open a pipe between the exeutable you launch and Ruby.

Any output that is in buffered mode is not sent through this pipe until either:

The buffer is filled (thereby forcing a flush).
The sending application exits (EOF is reached, forcing a flush).
The stream is explicitly flushed.

The reason STDERR is not buffered is that it's usually considered important for error messages to appear instantly, rather than go for for efficiency through buffering.

So, knowing this, you can emulate STDERR behaviour with STDOUT like this:

#!/usr/bin/env ruby

3.times do
  STDOUT.puts 'message on stdout'
  STDOUT.flush 
  STDERR.puts 'message on stderr'
  sleep 1
end

and you will see the difference.

You might also want to check "Understanding Ruby and OS I/O buffering".

Line-oriented streaming in Ruby (like grep)

It looks like your best bet is to use STDOUT.syswrite and STDOUT.sysread - the following seemed to have reasonably good performance, despite being ugly code:

STDIN.sync = true
STDOUT.syswrite "Looking for #{ARGV[0]}\n"

def next_line
  mybuff = @overflow || ""
  until mybuff[/\n/]
    mybuff += STDIN.sysread(8)
  end
  overflow = mybuff.split("\n")
  out, *others = overflow
  @overflow = others.join("\n")
  out
rescue EOFError => e
  false  # NB: There's a bug here, see below
end

line = next_line
while line
  STDOUT.syswrite "#{line}\n" if line =~ /#{ARGV[0]}/i
  line = next_line
end

Note: Not sure you need #sync with #sysread, but if so you should probably sync STDOUT too. Also, it reads 8 bytes at a time into mybuff - you should experiment with this value, it's highly inefficient / CPU heavy. Lastly, this code is hacky and needs a refactor, but it works - tested it using ls -l ~/* | ruby rgrep.rb doc (where 'doc' is the search term)

Second note: Apparently, I was so busy trying to get it to perform well, I failed to get it to perform correctly! As Dmitry Shevkoplyas has noted, if there is text in @overflow when EOFError is raised, that text will be lost. I believe if you replace the catch with the following, it should fix the problem:

rescue EOFError => e
  return false unless @overflow && @overflow.length > 0
  output = @overflow
  @overflow = ""
  output
end

(if you found that helpful, please upvote Dmitry's answer!)