Working with a Large Data Object Between Ruby Processes

Working with a large data object between ruby processes

A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.

Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.

# server.rb
require 'drb'

class InterestServer < Hash
include DRbUndumped # don't send the data over!

def closest(cur_user_id)
cur_interests = fetch(cur_user_id)
selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}

scores = map do |user_id, interests|
nb_match = selected_interests.count{|i| interests[i] }
[nb_match, user_id]
end
scores.sort!
end
end

DRb.start_service nil, InterestServer.new
puts DRb.uri

DRb.thread.join

# client.rb

uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri

USERS_COUNT = 10_000
INTERESTS_COUNT = 500

# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }

# Initial send over user interests
users.each do |user|
interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end

# query at will
puts interest_server.closest(users.first[:id]).inspect

# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true }

puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly

To run in terminal, start the server and give the output as the argument to the client:

$ ruby server.rb
druby://mal.lan:51630

$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]

[[45, 42], [45, 178902]]

Shared Variable Among Ruby Processes

One problem is you need to use Process.wait to wait for your forked processes to complete. The other is that you can't do interprocess communication through variables. To see this:

@one = nil
@two = nil
@hash = {}
pidA = fork do
sleep 1
@one = 1
@hash[:one] = 1
p [:one, @one, :hash, @hash] #=> [ :one, 1, :hash, { :one => 1 } ]
end
pidB = fork do
sleep 2
@two = 2
@hash[:two] = 2
p [:two, @two, :hash, @hash] #=> [ :two, 2, :hash, { :two => 2 } ]
end
Process.wait(pidB)
Process.wait(pidA)
p [:one, @one, :two, @two, :hash, @hash] #=> [ :one, nil, :two, nil, :hash, {} ]

One way to do interprocess communication is using a pipe (IO::pipe). Open it before you fork, then have each side of the fork close one end of the pipe.

From ri IO::pipe:

    rd, wr = IO.pipe

if fork
wr.close
puts "Parent got: <#{rd.read}>"
rd.close
Process.wait
else
rd.close
puts "Sending message to parent"
wr.write "Hi Dad"
wr.close
end

_produces:_

Sending message to parent
Parent got: <Hi Dad>

If you want to share variables, use threads:

@one = nil
@two = nil
@hash = {}
threadA = Thread.fork do
sleep 1
@one = 1
@hash[:one] = 1
p [:one, @one, :hash, @hash] #=> [ :one, 1, :hash, { :one => 1 } ] # (usually)
end
threadB = Thread.fork do
sleep 2
@two = 2
@hash[:two] = 2
p [:two, @two, :hash, @hash] #=> [ :two, 2, :hash, { :one => 1, :two => 2 } ] # (usually)
end
threadA.join
threadB.join
p [:one, @one, :two, @two, :hash, @hash] #=> [ :one, 1, :two, 2, :hash, { :one => 1, :two => 2 } ]

However, I'm not sure if threading will get you any gain when you're IO bound.

Sharing heap between ruby processes

So I figured out this was not possible. Java can do this because of its virtual machine, but unfortunately ruby can't.

Pass variables between separate instances of ruby (without writing to a text file or database)

You need Drb. It works by creating a distributed ruby service(server), a client then connects to it and is able to fetch Ruby objects from it.

http://www.ruby-doc.org/stdlib-1.9.3/libdoc/drb/rdoc/DRb.html

Working with multiple processes in Ruby

Combining DRb, which provides simple inter-process communication, with Queue or SizedQueue, which are both threadsafe queues, should give you what you need.

You may also want to check out beanstalkd which is also hosted on github

Respond with large amount of objects through a Rails API

You need to pass a lot of information through a ruby process, that's always not simple, I don't think you're missing anything here.

If you decide to generate CSVs at the API level then what do you get with maintaining the service? You could just ditch the service altogether because replacing your service with an nginx proxy would do the same thing better (if you're just streaming the response from API host)?

If you decide to paginate, there will be a performance reduction for sure, but nobody can tell you exactly how much you should paginate - bigger pages will be faster and consume more memory (reducing throughput by being able to run less workers), smaller pages will be slower and consume less memory but demand more workers because of IO wait times,

exact numbers will depend on the IO response times of your API app and the cloud and your infrastructure, I'm afraid no one can give you a simple answer you can follow without experimentation with a stress test, and once you set up a stress test, you will get a number of your own anyway - better than anybody's estimate.

A suggestion, write a bit more about your problem, constraints you are working under etc and maybe someone can help you with a bit more radical solution. For some reason I get the feeling that what you're really looking for is a background processor like sidekiq or delayed job, or maybe connect your service to the DB directly through a DB view if you are anxoius to decouple your apps, or an nginx proxy for API responses, or nothing at all... but I really can't tell without more information.

Processing large recordsets in Rails

You want to use ActiveRecord's find_each for this.

Dataset.find_each do |data|
...
end


Related Topics



Leave a reply



Submit