Ruby File Reading Parallelisim

Simple parallelism with Fibers?

Fibers by themselves will not let you achieve parallelism, at least not without using some sort of callback mechanism such as eventmachine framework.

What you wrote is simply trying to interleave synchronous execution among code blocks. The reason you do not get expected sequence is because while you did simulate the kick-off, you never resumed the fibers after yeilding.

You might find the following post useful, particularly the example at the end:

http://schmurfy.github.io/2011/09/25/on_fibers_and_threads.html

Another example showing fibers transferring control to each other:

https://gist.github.com/aprescott/971008

This should give you expected results:

#!/usr/bin/env ruby

require 'fiber'

f1 = Fiber.new do
    puts "Fiber1 starting @ #{Time.new}."
    Fiber.yield
    sleep 2
    puts "Fiber1 done @ #{Time.new}."
    1
end
f2 = Fiber.new do
    puts "Fiber2 starting @ #{Time.new}."
    Fiber.yield
    sleep 2
    puts "Fiber2 done @ #{Time.new}."
    2
end

puts "Waiting @ #{Time.new}."
r1 = f1.resume
puts "f1 back @ #{Time.new} - #{r1}."
r2 = f2.resume
puts "f2 back @ #{Time.new} - #{r2}."

# Resume right after the yield in the fiber block and 
# execute until it encounters another yield or the block ends.
puts "Resuming f1"
f1.resume 
puts "Resuming f2"
f2.resume

sleep 1
puts "Done @ #{Time.now}."

Output:

Waiting @ 2016-06-05 00:35:29 -0700.
Fiber1 starting @ 2016-06-05 00:35:29 -0700.
f1 back @ 2016-06-05 00:35:29 -0700 - .
Fiber2 starting @ 2016-06-05 00:35:29 -0700.
f2 back @ 2016-06-05 00:35:29 -0700 - .
Resuming f1
Fiber1 done @ 2016-06-05 00:35:31 -0700.
Resuming f2
Fiber2 done @ 2016-06-05 00:35:33 -0700.
Done @ 2016-06-05 00:35:34 -0700.

How to pass back data from forks through memory instead of having to write files in Ruby?

Well, I've managed to solve it in a different way using parallel gem:

https://github.com/grosser/parallel

gem install parallel

require "parallel"

# passing data to 3 processes
data = [ 1, 2, 3 ]

result = Parallel.map( data ){|d|
    # some expensive computation ...
    d ** 3
}

p result

[1, 8, 27]

Ruby performance with multiple threads vs one thread

I think (but I'm not sure) the problem is that you are reading (using multiple threads) contents placed on the same disk, so all your threads can't run simultaneously because they wait for IO (disk).

Some days ago I had to do a similar thing (but fetching data from network) and the difference between sequential vs threads was huge.

A possible solution could be to load all file content instead of load it like you did in your code. In your code you read contents line by line. If you load all the content and then process it you should be able to perform much better (because threads should not wait for IO)

Using Parallel gem in Ruby; how many cores to use?

No idea how memory-intensive your program is but those requirements could also cause major unexpected issues for people with less memory than the machine you're testing it on.

Since it's a CLI tool, why not add a flag like --procs that takes an argument for the number of processes to use, and leave it up to the user to decide?

Ruby String operations on HUGE String

Here's a benchmark against a real-life log file. Of the methods used to read the file, only the one using foreach is scalable because it avoids slurping the file.

Using lazy adds overhead, resulting in slower times than map alone.

Notice that foreach is right in there as far as processing speed goes, and results in a scalable solution. Ruby won't care if the file is a zillion lines or a zillion TB, it's still only seeing a single line at a time. See "Why is "slurping" a file not a good practice?" for some related information about reading files.

People often gravitate to using something that pulls in an entire file at once, then splitting it into parts. That ignores the job Ruby then has to do to rebuild the array based on line ends using split or something similar. That adds up, and is why I think foreach pulls ahead.

Also notice that the results shift a little between the two benchmark runs. This is probably due to system tasks running on my Mac Pro as the jobs are running. The important thing is that shows the difference is a wash, confirming to me that using foreach is the right way to process big files, because it's not going to kill the machine if the input file exceeds available memory.

require 'benchmark'

REGEX = /\bfoo\z/
LOG = 'debug.log'
N = 1

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def lazy_map(filename)
  File.open("lazy_map.out", 'w') do |fo|
    fo.puts File.readlines(filename).lazy.map { |li|
      li.gsub(REGEX, 'bar')
    }.force
  end
end

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def map(filename)
  File.open("map.out", 'w') do |fo|
    fo.puts File.readlines(filename).map { |li|
      li.gsub(REGEX, 'bar')
    }
  end
end

# "Reads the entire file specified by name as individual lines, and returns
# those lines in an array."
# 
# As a result of returning all the lines in an array this isn't scalable. It
# will work if Ruby has enough memory to hold the array plus all other
# variables and its overhead.
def readlines(filename)
  File.open("readlines.out", 'w') do |fo|
    File.readlines(filename).each do |li|
      fo.puts li.gsub(REGEX, 'bar')
    end
  end
end

# This is completely scalable because no file slurping is involved.
# "Executes the block for every line in the named I/O port..."
#
# It's slower, but it works reliably.
def foreach(filename)
  File.open("foreach.out", 'w') do |fo|
    File.foreach(filename) do |li|
      fo.puts li.gsub(REGEX, 'bar')
    end
  end
end

puts "Ruby version: #{ RUBY_VERSION }"
puts "log bytes: #{ File.size(LOG) }"
puts "log lines: #{ `wc -l #{ LOG }`.to_i }"

2.times do
  Benchmark.bm(13) do |b|
    b.report('lazy_map')  { lazy_map(LOG)  }
    b.report('map')       { map(LOG)       }
    b.report('readlines') { readlines(LOG) }
    b.report('foreach')   { foreach(LOG)   }
  end
end

%w[lazy_map map readlines foreach].each do |s|
  puts `wc #{ s }.out`
end

Which results in:

Ruby version: 2.0.0
log bytes: 733978797
log lines: 5540058
                    user     system      total        real
lazy_map       35.010000   4.120000  39.130000 ( 43.688429)
map            29.510000   7.440000  36.950000 ( 43.544893)
readlines      28.750000   9.860000  38.610000 ( 43.578684)
foreach        25.380000   4.120000  29.500000 ( 35.414149)
                    user     system      total        real
lazy_map       32.350000   9.000000  41.350000 ( 51.567903)
map            24.740000   3.410000  28.150000 ( 32.540841)
readlines      24.490000   7.330000  31.820000 ( 37.873325)
foreach        26.460000   2.540000  29.000000 ( 33.599926)
5540058 83892946 733978797 lazy_map.out
5540058 83892946 733978797 map.out
5540058 83892946 733978797 readlines.out
5540058 83892946 733978797 foreach.out

The use of gsub is innocuous since every method uses it, but it's not needed and was added for a bit of frivolous resistive loading.