Performance of Arrays and Hashes in Ruby
Hashes are much faster for lookups:
require 'benchmark'
Document = Struct.new(:id,:a,:b,:c)
documents_a = []
documents_h = {}
1.upto(10_000) do |n|
d = Document.new(n)
documents_a << d
documents_h[d.id] = d
end
searchlist = Array.new(1000){ rand(10_000)+1 }
Benchmark.bm(10) do |x|
x.report('array'){searchlist.each{|el| documents_a.any?{|d| d.id == el}} }
x.report('hash'){searchlist.each{|el| documents_h.has_key?(el)} }
end
# user system total real
#array 2.240000 0.020000 2.260000 ( 2.370452)
#hash 0.000000 0.000000 0.000000 ( 0.000695)
Which is more efficient in Ruby, Hash vs Array
Does it simply mean that array is more efficient, because the internal representation of an array is simpler than a hash?
Yes. The algorithmic complexity may be the same, but an array is going to be faster generally, because the lookup procedure is much simpler. On the other hand when using an array the "keys" (indices) can only be integers, and an array is not sparse - if you store to a[100]
, there will be at least 101 elements in the array afterwards.
(For extremely sparse arrays, a hash map should actually perform better).
Unexpected access performance differences between arrays and hashes
Consider the memory layout of an array of arrays, say with dimensions 3x3... you've got something like this:
memory address usage/content
base [0][0]
base+sizeof(int) [0][1]
base+2*sizeof(int) [0][2]
base+3*sizeof(int) [1][0]
base+4*sizeof(int) [1][1]
...
Given an array of dimensions [M][N]
, all that's needed to access an element in at indices [i][j]
is to add the base memory address to the data element size times (i * M + j)... a tiny bit of simple arithmetic, and therefore extremely fast.
Hashes are far more complicated data structures and inherently slower. With a hash, you need to take time to hash the key (and the harder the hash tries to make sure different keys will - statistically - scatter pretty randomly throughout the hash output range even if they're similar keys the slower the hash tends to be, if the hash function doesn't make that effort you'll have more collisions in the hash table and slower performance there), then the hash value needs to be mapped on to the current hash table size (usually using "%"), then you need to compare keys to see if you've found the hoped-for key or a colliding element or an empty element. It's a far more involved process than array indexing. You should probably do some background reading about hash function and hash table implementations....
The reason hashes are so often useful is that the key doesn't need to be numeric (you can always work out some formula to generate a number from arbitrary key data) and need not be near-contiguous for memory efficiency (i.e. a hash table with say memory capacity for 5 integers could happily store keys 1, 1000 and 12398239 - whereas for an array keyed on those values there would be a lot of virtual address space wasted for all the indices in between, which have no data anyway, and anyway more data packed into a memory page means more cache hits).
Further - you should be careful with benchmarks - when you do clearly repetitive work with unchanging values overwriting the same variable, an optimiser may avoid it and you may not be timing what you think you are. It's good to use some run-time inputs (e.g. storing different values in the containers) and accumulate some dependent result (e.g. summing the element accesses rather than overwriting it), then outputting the result so any lazy evaluation is forced to conclude. With things like JITs and VMs involved there can also be kinks in your benchmarks as compilation kicks in or branch prediction results are incorporated.
Performance of `Hash#has_key?` and `Array#index`
To answer your question, "Method 2" should be faster. Now, that is a very loaded statement which partially depends on the very nature of hashes (e.g. collisions when inserting).
However, for your specific use case, I think arrays and hashes are both the "wrong tool for the job". In general, if you're using a hash to check unique set existence (hint hint), use a set
.
One final thought, which may or may not be valuable depending on how contrived your example is. If you're storing some finite set of ordered values ('a'-'d' in your example) an array is definitely the way to go. Why? Because you can easily map the values of your alphabet to an array index (e.g. a maps to 0, b maps to 1 and so forth) by, in your case, converting the letters to ascii and subtracting to get their desired location. This would give you an O(1) lookup time.
Improve performance to find an array of ids from array of hashes in Ruby
This will improve speed:
#!/usr/bin/env ruby
require 'pp'
require 'benchmark'
a = []
5000.times {|c| a << {"id" => "#{c}", "imageUrl" => "test#{c}"}}
b1 = (1..2500).to_a.shuffle.map(&:to_s)
b2 = b1.dup()
puts "method1"
puts Benchmark.measure { b1.map! {|id| a.select {|h| h['id'] == id }} }
puts "method2"
result = Benchmark.measure do
ah = Hash.new([])
a.each{|x| ah[x["id"]] = x}
b2.map!{|be| ah[be]}
end
puts result
Results:
method1
2.820000 0.010000 2.830000 ( 2.827695)
method2
0.000000 0.000000 0.000000 ( 0.002607)
Updated benchmark - it uses 250000 elements in b instead of 2500 (method 1 commented out to protect the innocent - it's too slow and I got bored waiting for it):
#!/usr/bin/env ruby
require 'pp'
require 'benchmark'
a = []
5000.times {|c| a << {"id" => "#{c}", "imageUrl" => "test#{c}"}}
b1 = (1..250000).to_a.collect{|x| x%2500}.shuffle.map(&:to_s)
b2 = b1.dup()
b3 = b1.dup()
# puts "method1"
# puts Benchmark.measure { b1.map! {|id| a.select {|h| h['id'] == id }} }
puts "method2"
result = Benchmark.measure do
ah = Hash.new([])
a.each{|x| ah[x["id"]] = x}
b2.map!{|be| ah[be]}
end
puts result
puts "method3"
result = Benchmark.measure do
h = a.each_with_object({}) { |g,h| h.update(g['id']=>g) }
b3.map! { |s| h.key?(s) ? [h[s]] : [] }
end
puts result
And the results are:
method2
0.050000 0.000000 0.050000 ( 0.045294)
method3
0.100000 0.010000 0.110000 ( 0.109646)
Ruby Hashes vs. Arrays: quickest way to find an item?
Yes, you could use an array, which would be O(logN)
at best, but it is both faster and semantically better to use a set.
require 'set'
hashes = Set.new
hashes << 'foo'
hashes << 'bar'
hashes.include?('bar') # => true
In ruby sets are implemented with hashtables, so lookups are O(1)
.
Performance - Ruby - Compare large array of hashes (dictionary) to primary hash; update resulting value
Assuming "no" on both of my questions in the comments:
#!/usr/bin/env ruby
require 'csv'
@dictionary_data = CSV.open('dict_data.csv') { |csv|
Hash[csv.map { |name, tag| [name[/^.+(?=-\w+$)/], tag] }]
}
@app_data = [{'name' => 'server_name1.sub.emea', 'tag' => 'Europe', 'state' => 'Online'},
{'name' => 'server_name2.sub.us', 'tag' => 'US E.', 'state' => 'Online'},
{'name' => 'server_name3.sub.us', 'tag' => 'US W.', 'state' => 'Failover'}]
STATE_MAP = {
'Online' => 'Up',
'Failover' => 'F.O.'
}
@app_data = @app_data.map do |server|
name = server['name'][/^[^.]+/]
{
'name' => name,
'tag' => @dictionary_data[name],
'state' => STATE_MAP[server['state']],
}
end
p @app_data
# => [{"name"=>"server_name1", "tag"=>"Paris, France", "state"=>"Up"},
# {"name"=>"server_name2", "tag"=>"New York, USA", "state"=>"Up"},
# {"name"=>"server_name3", "tag"=>"Portland, USA", "state"=>"F.O."}]
EDIT: I find it more convenient here to read the CSV without headers, as I don't want it to generate an array of hashes. But to read a headerless CSV as if it had headers, you don't need to touch the data itself, as Ruby's CSV is quite powerful:
CSV.read('dict_data.csv', headers: %i(name tag)).map(&:to_hash)
Related Topics
Why Can't I Install Rails on Lion Using Rvm
Undefined Method 'Visit' When Using Rspec and Capybara in Rails
What Is a Regex to Match a String Not At the End of a Line
Read and Write Yaml Files Without Destroying Anchors and Aliases
Execute a Sudo Command in Ruby on Rails App
The <<- Operator on Ruby, Where Is It Documented
Undefined Symbol: Sslv2_Method When Running Bundle Install
Best Ruby on Rails Social Networking Framework
How to Access Method Arguments in Ruby
Imagemagick - "Core_Rl_Magick_.Dll Not Found" or How to Install Rmagick on Windows With Ruby 1.9.2
Rails Model, View, Controller, and Helper: What Goes Where
Devise Custom Routes and Login Pages
How to Define Action With Simple Form For
How to Access a Variable Defined in a Ruby File I Required in Irb