Understanding Ruby and OS I/O buffering
The Ruby IO documentation is not 100% clear on how this buffering works, but this is what you can extract from the documentation:
- Ruby IO has its own internal buffer
- In addition to that the underlying operating system may or may not further buffer data.
The relevant methods to look at:
IO.flush
: FlushesIO
. I also looked at the Ruby source and a call toIO.flush
also calls the underlying OSfflush()
. This should be enough to get the file cached, but does not guarantee physical data to disk.IO.sync=
: If set totrue
, no Ruby internal buffering is done. Everything is immidiately sent to the OS, andfflush()
is called for each write.IO.sync
: Returns the current sync setting (true
orfalse
).IO.fsync
: Flushes both the Ruby buffers + callsfsync()
on the OS (if it supports it). This will guarantee a full flush all the way to the physical disk file.IO.close
: Closes the RubyIO
and writes pending data to the OS. Note that this does not implyfsync()
. The POSIX documentation onclose()
says that it does NOT guarantee data is physically written to the file. So you need to use an explicitfsync()
call for that.
Conclusion: flush
and/or close
should be enough to get the file cached so that it can be read fully by another process or operation. To get the file all the way to the physical media with certainty, you need to call IO.fsync
.
Other related methods:
IO.syswrite
: Bypass Ruby internal buffers and do a straight OSwrite
. If you use this then do not mix it withIO.read/write
.IO.sysread
: Same as above, but for reading.
Consequences of unbuffered file I/O
Unbuffered I/O is slower than buffered I/O when write operations have small counts and the situation is reversed for large count write operations. In a middle range around 1,000 to 10,000 bytes per operation it doesn't make much difference.
You will also see slightly better performance when operations are aligned
Tell Ruby Display a Line before going to next etc
It's a buffering issue. Try adding $stdout.sync = true
before you use puts
. Setting sync to true disables buffering.
x = 1
y = 2
$stdout.sync = true
puts x + y
sleep(5)
puts x + y * 2
sleep(10)
puts (x + y) * 2
Or, you can flush stdout
manually each time:
x = 1
y = 2
puts x + y
$stdout.flush
sleep(5)
puts x + y * 2
$stdout.flush
sleep(10)
puts (x + y) * 2
$stdout.flush
For more info on Ruby buffering, check out this well-done answer.
Why does writing concurrently to a file using different processes produce weird result?
Having multiple processes write to the same file is usually not a good idea. In most cases, unless you absolutely know what you are doing, the result will be unpredictable, as you just demonstrated with your example.
The reason why you get your strange result, is that the Ruby IO object has its own internal buffer. This buffer is kept in memory, and is NOT guaranteed to be written to disk when you call <<
.
What happens here is that the string hello from parent
only gets written to the internal buffer, and not to the disk. Then when you call fork
, you will be copying this buffer into the child. Then the child will append hello from child
to the buffer, and only THEN will the buffer be flushed to disk.
The result is that all children will write hello from parent
, in addition to writing hello from child
, because this is what the internal memory buffer will contain by the time Ruby decides to write the buffer to disk.
To get around this problem you can call IO.flush
before forking, to ensure the memory buffer is empty and gets flushed to disk before forking. This ensures that the buffer is empty in the child, and you will now get your expected output:
CSV.open(...) do |target_file|
target_file << ...
target_file.flush # <-- Make sure the internal buffer is flushed to disk before forking
a.each do |num|
... Process.fork ...
end
end
...
Why is IO::WaitReadable being raised differently for STDOUT than STDERR?
In most OS's STDOUT is buffered while STDERR is not. What popen3
does is basically open a pipe between the exeutable you launch and Ruby.
Any output that is in buffered mode is not sent through this pipe until either:
- The buffer is filled (thereby forcing a flush).
- The sending application exits (EOF is reached, forcing a flush).
- The stream is explicitly flushed.
The reason STDERR is not buffered is that it's usually considered important for error messages to appear instantly, rather than go for for efficiency through buffering.
So, knowing this, you can emulate STDERR behaviour with STDOUT like this:
#!/usr/bin/env ruby
3.times do
STDOUT.puts 'message on stdout'
STDOUT.flush
STDERR.puts 'message on stderr'
sleep 1
end
and you will see the difference.
You might also want to check "Understanding Ruby and OS I/O buffering".
Line-oriented streaming in Ruby (like grep)
It looks like your best bet is to use STDOUT.syswrite and STDOUT.sysread - the following seemed to have reasonably good performance, despite being ugly code:
STDIN.sync = true
STDOUT.syswrite "Looking for #{ARGV[0]}\n"
def next_line
mybuff = @overflow || ""
until mybuff[/\n/]
mybuff += STDIN.sysread(8)
end
overflow = mybuff.split("\n")
out, *others = overflow
@overflow = others.join("\n")
out
rescue EOFError => e
false # NB: There's a bug here, see below
end
line = next_line
while line
STDOUT.syswrite "#{line}\n" if line =~ /#{ARGV[0]}/i
line = next_line
end
Note: Not sure you need #sync with #sysread, but if so you should probably sync STDOUT too. Also, it reads 8 bytes at a time into mybuff - you should experiment with this value, it's highly inefficient / CPU heavy. Lastly, this code is hacky and needs a refactor, but it works - tested it using ls -l ~/* | ruby rgrep.rb doc
(where 'doc' is the search term)
Second note: Apparently, I was so busy trying to get it to perform well, I failed to get it to perform correctly! As Dmitry Shevkoplyas has noted, if there is text in @overflow when EOFError is raised, that text will be lost. I believe if you replace the catch with the following, it should fix the problem:
rescue EOFError => e
return false unless @overflow && @overflow.length > 0
output = @overflow
@overflow = ""
output
end
(if you found that helpful, please upvote Dmitry's answer!)
Why is IO::WaitReadable being raised differently for STDOUT than STDERR?
In most OS's STDOUT is buffered while STDERR is not. What popen3
does is basically open a pipe between the exeutable you launch and Ruby.
Any output that is in buffered mode is not sent through this pipe until either:
- The buffer is filled (thereby forcing a flush).
- The sending application exits (EOF is reached, forcing a flush).
- The stream is explicitly flushed.
The reason STDERR is not buffered is that it's usually considered important for error messages to appear instantly, rather than go for for efficiency through buffering.
So, knowing this, you can emulate STDERR behaviour with STDOUT like this:
#!/usr/bin/env ruby
3.times do
STDOUT.puts 'message on stdout'
STDOUT.flush
STDERR.puts 'message on stderr'
sleep 1
end
and you will see the difference.
You might also want to check "Understanding Ruby and OS I/O buffering".
Related Topics
Ruby Gsub Doesn't Escape Single-Quotes
Can Ruby Print Out Time Difference (Duration) Readily
How to Install MySQL2 Gem on Windows 7
Set Utf-8 as Default String Encoding in Heroku
Find the Newest Record in Rails 3
Converting Ruby Array to Array of Consecutive Pairs
Is There a Hook Similar to Class#Inherited That's Triggered Only After a Ruby Class Definition
Ruby: Parsing a String Representation of Nested Arrays into an Array
Split the String to Get Only the First 5 Characters
How to Update Ruby on Linux (Ubuntu)
Best Practices for New Rails Deployments on Linux
Using Rvm on Ubuntu 12.04 to Use Rails. the Program 'Rails' Is Currently Not Installed
Ruby: Class Instance Variables VS Instance Variables