Performance of Arrays and Hashes in Ruby

Performance of Arrays and Hashes in Ruby

Hashes are much faster for lookups:

require 'benchmark'
Document = Struct.new(:id,:a,:b,:c)
documents_a = []
documents_h = {}
1.upto(10_000) do |n|
d = Document.new(n)
documents_a << d
documents_h[d.id] = d
end
searchlist = Array.new(1000){ rand(10_000)+1 }

Benchmark.bm(10) do |x|
x.report('array'){searchlist.each{|el| documents_a.any?{|d| d.id == el}} }
x.report('hash'){searchlist.each{|el| documents_h.has_key?(el)} }
end

# user system total real
#array 2.240000 0.020000 2.260000 ( 2.370452)
#hash 0.000000 0.000000 0.000000 ( 0.000695)

Which is more efficient in Ruby, Hash vs Array

Does it simply mean that array is more efficient, because the internal representation of an array is simpler than a hash?

Yes. The algorithmic complexity may be the same, but an array is going to be faster generally, because the lookup procedure is much simpler. On the other hand when using an array the "keys" (indices) can only be integers, and an array is not sparse - if you store to a[100], there will be at least 101 elements in the array afterwards.

(For extremely sparse arrays, a hash map should actually perform better).

Unexpected access performance differences between arrays and hashes

Consider the memory layout of an array of arrays, say with dimensions 3x3... you've got something like this:

memory address       usage/content
base [0][0]
base+sizeof(int) [0][1]
base+2*sizeof(int) [0][2]
base+3*sizeof(int) [1][0]
base+4*sizeof(int) [1][1]
...

Given an array of dimensions [M][N], all that's needed to access an element in at indices [i][j] is to add the base memory address to the data element size times (i * M + j)... a tiny bit of simple arithmetic, and therefore extremely fast.

Hashes are far more complicated data structures and inherently slower. With a hash, you need to take time to hash the key (and the harder the hash tries to make sure different keys will - statistically - scatter pretty randomly throughout the hash output range even if they're similar keys the slower the hash tends to be, if the hash function doesn't make that effort you'll have more collisions in the hash table and slower performance there), then the hash value needs to be mapped on to the current hash table size (usually using "%"), then you need to compare keys to see if you've found the hoped-for key or a colliding element or an empty element. It's a far more involved process than array indexing. You should probably do some background reading about hash function and hash table implementations....

The reason hashes are so often useful is that the key doesn't need to be numeric (you can always work out some formula to generate a number from arbitrary key data) and need not be near-contiguous for memory efficiency (i.e. a hash table with say memory capacity for 5 integers could happily store keys 1, 1000 and 12398239 - whereas for an array keyed on those values there would be a lot of virtual address space wasted for all the indices in between, which have no data anyway, and anyway more data packed into a memory page means more cache hits).

Further - you should be careful with benchmarks - when you do clearly repetitive work with unchanging values overwriting the same variable, an optimiser may avoid it and you may not be timing what you think you are. It's good to use some run-time inputs (e.g. storing different values in the containers) and accumulate some dependent result (e.g. summing the element accesses rather than overwriting it), then outputting the result so any lazy evaluation is forced to conclude. With things like JITs and VMs involved there can also be kinks in your benchmarks as compilation kicks in or branch prediction results are incorporated.

Performance of `Hash#has_key?` and `Array#index`

To answer your question, "Method 2" should be faster. Now, that is a very loaded statement which partially depends on the very nature of hashes (e.g. collisions when inserting).

However, for your specific use case, I think arrays and hashes are both the "wrong tool for the job". In general, if you're using a hash to check unique set existence (hint hint), use a set.

One final thought, which may or may not be valuable depending on how contrived your example is. If you're storing some finite set of ordered values ('a'-'d' in your example) an array is definitely the way to go. Why? Because you can easily map the values of your alphabet to an array index (e.g. a maps to 0, b maps to 1 and so forth) by, in your case, converting the letters to ascii and subtracting to get their desired location. This would give you an O(1) lookup time.

Improve performance to find an array of ids from array of hashes in Ruby

This will improve speed:

#!/usr/bin/env ruby
require 'pp'
require 'benchmark'

a = []
5000.times {|c| a << {"id" => "#{c}", "imageUrl" => "test#{c}"}}
b1 = (1..2500).to_a.shuffle.map(&:to_s)
b2 = b1.dup()

puts "method1"
puts Benchmark.measure { b1.map! {|id| a.select {|h| h['id'] == id }} }

puts "method2"
result = Benchmark.measure do
ah = Hash.new([])
a.each{|x| ah[x["id"]] = x}
b2.map!{|be| ah[be]}
end
puts result

Results:

method1
2.820000 0.010000 2.830000 ( 2.827695)
method2
0.000000 0.000000 0.000000 ( 0.002607)

Updated benchmark - it uses 250000 elements in b instead of 2500 (method 1 commented out to protect the innocent - it's too slow and I got bored waiting for it):

#!/usr/bin/env ruby
require 'pp'
require 'benchmark'

a = []
5000.times {|c| a << {"id" => "#{c}", "imageUrl" => "test#{c}"}}
b1 = (1..250000).to_a.collect{|x| x%2500}.shuffle.map(&:to_s)
b2 = b1.dup()
b3 = b1.dup()

# puts "method1"
# puts Benchmark.measure { b1.map! {|id| a.select {|h| h['id'] == id }} }

puts "method2"
result = Benchmark.measure do
ah = Hash.new([])
a.each{|x| ah[x["id"]] = x}
b2.map!{|be| ah[be]}
end
puts result

puts "method3"
result = Benchmark.measure do
h = a.each_with_object({}) { |g,h| h.update(g['id']=>g) }
b3.map! { |s| h.key?(s) ? [h[s]] : [] }
end
puts result

And the results are:

method2
0.050000 0.000000 0.050000 ( 0.045294)
method3
0.100000 0.010000 0.110000 ( 0.109646)

Ruby Hashes vs. Arrays: quickest way to find an item?

Yes, you could use an array, which would be O(logN) at best, but it is both faster and semantically better to use a set.

require 'set'

hashes = Set.new
hashes << 'foo'
hashes << 'bar'
hashes.include?('bar') # => true

In ruby sets are implemented with hashtables, so lookups are O(1).

Performance - Ruby - Compare large array of hashes (dictionary) to primary hash; update resulting value

Assuming "no" on both of my questions in the comments:

#!/usr/bin/env ruby

require 'csv'

@dictionary_data = CSV.open('dict_data.csv') { |csv|
Hash[csv.map { |name, tag| [name[/^.+(?=-\w+$)/], tag] }]
}

@app_data = [{'name' => 'server_name1.sub.emea', 'tag' => 'Europe', 'state' => 'Online'},
{'name' => 'server_name2.sub.us', 'tag' => 'US E.', 'state' => 'Online'},
{'name' => 'server_name3.sub.us', 'tag' => 'US W.', 'state' => 'Failover'}]

STATE_MAP = {
'Online' => 'Up',
'Failover' => 'F.O.'
}

@app_data = @app_data.map do |server|
name = server['name'][/^[^.]+/]
{
'name' => name,
'tag' => @dictionary_data[name],
'state' => STATE_MAP[server['state']],
}
end

p @app_data
# => [{"name"=>"server_name1", "tag"=>"Paris, France", "state"=>"Up"},
# {"name"=>"server_name2", "tag"=>"New York, USA", "state"=>"Up"},
# {"name"=>"server_name3", "tag"=>"Portland, USA", "state"=>"F.O."}]

EDIT: I find it more convenient here to read the CSV without headers, as I don't want it to generate an array of hashes. But to read a headerless CSV as if it had headers, you don't need to touch the data itself, as Ruby's CSV is quite powerful:

CSV.read('dict_data.csv', headers: %i(name tag)).map(&:to_hash)


Related Topics



Leave a reply



Submit