Why Use Symbols as Hash Keys in Ruby

Why use symbols as hash keys in Ruby?

TL;DR:

Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.

Ruby Symbols are immutable (can't be changed), which makes looking something up much easier

Short(ish) answer:

Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.

Symbols in Ruby are basically "immutable strings" .. that means that they can not be changed, and it implies that the same symbol when referenced many times throughout your source code, is always stored as the same entity, e.g. has the same object id.

Strings on the other hand are mutable, they can be changed anytime. This implies that Ruby needs to store each string you mention throughout your source code in it's separate entity, e.g. if you have a string "name" multiple times mentioned in your source code, Ruby needs to store these all in separate String objects, because they might change later on (that's the nature of a Ruby string).

If you use a string as a Hash key, Ruby needs to evaluate the string and look at it's contents (and compute a hash function on that) and compare the result against the (hashed) values of the keys which are already stored in the Hash.

If you use a symbol as a Hash key, it's implicit that it's immutable, so Ruby can basically just do a comparison of the (hash function of the) object-id against the (hashed) object-ids of keys which are already stored in the Hash. (much faster)

Downside:
Each symbol consumes a slot in the Ruby interpreter's symbol-table, which is never released.
Symbols are never garbage-collected.
So a corner-case is when you have a large number of symbols (e.g. auto-generated ones). In that case you should evaluate how this affects the size of your Ruby interpreter.

Notes:

If you do string comparisons, Ruby can compare symbols just by comparing their object ids, without having to evaluate them. That's much faster than comparing strings, which need to be evaluated.

If you access a hash, Ruby always applies a hash-function to compute a "hash-key" from whatever key you use. You can imagine something like an MD5-hash. And then Ruby compares those "hashed keys" against each other.

Every time you use a string in your code, a new instance is created - string creation is slower than referencing a symbol.

Starting with Ruby 2.1, when you use frozen strings, Ruby will use the same string object. This avoids having to create new copies of the same string, and they are stored in a space that is garbage collected.

Long answers:

https://web.archive.org/web/20180709094450/http://www.reactive.io/tips/2009/01/11/the-difference-between-ruby-symbols-and-strings

http://www.randomhacks.net.s3-website-us-east-1.amazonaws.com/2007/01/20/13-ways-of-looking-at-a-ruby-symbol/

https://www.rubyguides.com/2016/01/ruby-mutability/

What is the use of a hash key with :

Both symbols and strings can be used as hash keys. The keys which start with : are symbols.

Symbols are immutable (they can't be changed) and exist only once in memory:

:foo.object_id == :foo.object_id    #=> true
'foo'.object_id == 'foo'.object_id #=> false

Every time you use a string in Ruby it creates a new object, but this is not true for symbols. For this reason they are used frequently in Ruby as hash keys. Happily they are also one character shorter to type than strings.

Symbols are so common as hash keys in ruby that ruby 1.9 introduced a condensed syntax especially for using symbols as keys.

hash2 = Hash[:a => 100 , :b => 200]

Can also be written:

hash2 = Hash[a: 100, b: 200]

Or more commonly:

hash2 = {a: 100, b: 200}

..if you are using symbols.

More info about symbols vs strings.

Why would one use String keys for a hash over symbols

Because the source of the keys--the query string--is made up of strings, so searching through this string for keys, it is most directly convenient to index the hash via the strings.

Every Symbol that is created in the Ruby runtime is allocated and never released. There is a theoretical (but unlikely) DOS attack available by sending hundreds of thousands of requests with unique query string parameters. If these were symbolized, each request would slowly grow the runtime memory pool.

Strings, on the other hand, may be garbage collected. Thousands of unique strings handled across various requests will eventually go away, with no long-term impact.

Edit: Note that with Sinatra, symbols are also available for accessing the params hash. However, this is done by creating a hash that is indexed by strings, and converting symbols (in your code) to strings when you make a request. Unless you do something like the following:

params.each{ |key,_| key.to_sym }

...you are not at risk of any symbol pseudo-DOS attack.

What is the use of symbols?

Ok, so the misunderstanding probably stems from this:

A symbol is not a variable, it is a value. like 9 is a value that is a number.

A symbol is a value that is kinda of roughly a string... it's just not a string that you can change... and because you can't change it, we can use a shortcut -> all symbols with the same name/value are stored in the same memory-spot (to save space).

You store the symbol into a variable, or use the value somewhere - eg as the key of a hash.... this last is probably one of the most common uses of a symbol.

you make a hash that contains key-value pairs eg:

thing_attrs = {:name => "My thing", :colour => "blue", :size => 6}
thing_attrs[:colour] # 'blue'

In this has - the symbols are the keys you can use any object as a key, but symbols are good to use as they use english words, and are thus easy to understand what you're storing/fetching... much better than, say numbers. Imagine you had:

thing_attrs = {0 => "My thing", 1 => "blue", 2 => 6}
thing_attrs[1] # => "blue"

It would be annoying and hard to remember that attribute 1 is the colour... it's much nicer to give names that you can read when you're reading the code. Thus we have two options: a string, or a symbol.

There would be very little difference between the two. A string is definitely usable eg:

thing_attrs = {"name" => "My thing", "colour" => "blue", "size" => 6}
thing_attrs["colour"] # 'blue'

except that as we know... symbols use less memory. Not a lot less, but enough less that in a large program, over time, you will notice it.
So it has become a ruby-standard to use symbols instead.

Create Hash with strings or symbols as keys

Both ways are correct. In the first case, the hash key will be symbols, in the second case, they will be strings.

Generally speaking, it's common to use symbols as hash keys as they are slightly more efficient since when you type the same symbol more than once, it just gets allocated once, conversely if you type the same string N times, it will be allocated N times.

In fact, there is even a shortcut for it.

boxes = [
{ name: "playground", ip: "19" },
{ name: "elkstack", ip: "22" },
{ name: "audit", ip: "23" }
]

How are Symbols faster than Strings in Hash lookups?

There's no obligation for hash to be equivalent to object_id. Those two things serve entirely different purposes. The point of hash is to be as deterministic and yet random as possible so that the values you're inserting into your hash are evenly distributed. The point of object_id is to define a unique object identifier, though there's no requirement that these be random or evenly distributed. In fact, randomizing them is counter-productive, that'd just make things slower for no reason.

The reason symbols tend to be faster is because the memory for them is allocated once (garbage collection issues aside) and recycled for all instances of the same symbol. Strings are not like that. They can be constructed in a multitude of ways, and even two strings that are byte-for-byte identical are likely to be different objects. In fact, it's safer to presume they are than otherwise unless you know for certain they're the same object.

Now when it comes to computing hash, the value must be randomly different even if the string changes very little. Since the symbol can't change computing it can be optimized more. You could just compute a hash of the object_id since that won't change, for example, while the string needs to take into account the content of itself, which is presumably dynamic.

Try benchmarking things:

require 'benchmark'

count = 100000000

Benchmark.bm do |bm|
bm.report('Symbol:') do
count.times { :symbol.hash }
end
bm.report('String:') do
count.times { "string".hash }
end
end

This gives me results like this:

       user     system      total        real
Symbol: 6.340000 0.020000 6.360000 ( 6.420563)
String: 11.380000 0.040000 11.420000 ( 11.454172)

Which in this most trivial case is easily 2x faster. Based on some basic testing the performance of the string code degrades O(N) as the strings get longer but the symbol times remain constant.

When to use symbols in Ruby

Symbols, or "internals" as they're also referred to as, are useful for hash keys, common arguments, and other places where the overhead of having many, many duplicate strings with the same value is inefficient.

For example:

params[:name]
my_function(with: { arguments: [ ... ] })
record.state = :completed

These are generally preferable to strings because they will be repeated frequently.

The most common uses are:

  • Hash keys
  • Arguments to methods
  • Option flags or enum-type property values

It's better to use strings when handling user data of an unknown composition. Unlike strings which can be garbage collected, symbols are permanent. Converting arbitrary user data to symbols may fill up the symbol table with junk and possibly crash your application if someone's being malicious.

For example:

user_data = JSON.load(...).symbolize_keys

This would allow an attacker to create JSON data with intentionally long, randomized names that, in time, would bloat your process with all kinds of useless junk.

When to use symbols instead of strings in Ruby?

TL;DR

A simple rule of thumb is to use symbols every time you need internal identifiers. For Ruby < 2.2 only use symbols when they aren't generated dynamically, to avoid memory leaks.

Full answer

The only reason not to use them for identifiers that are generated dynamically is because of memory concerns.

This question is very common because many programming languages don't have symbols, only strings, and thus strings are also used as identifiers in your code. You should be worrying about what symbols are meant to be, not only when you should use symbols. Symbols are meant to be identifiers. If you follow this philosophy, chances are that you will do things right.

There are several differences between the implementation of symbols and strings. The most important thing about symbols is that they are immutable. This means that they will never have their value changed. Because of this, symbols are instantiated faster than strings and some operations like comparing two symbols is also faster.

The fact that a symbol is immutable allows Ruby to use the same object every time you reference the symbol, saving memory. So every time the interpreter reads :my_key it can take it from memory instead of instantiate it again. This is less expensive than initializing a new string every time.

You can get a list all symbols that are already instantiated with the command Symbol.all_symbols:

symbols_count = Symbol.all_symbols.count # all_symbols is an array with all 
# instantiated symbols.
a = :one
puts a.object_id
# prints 167778

a = :two
puts a.object_id
# prints 167858

a = :one
puts a.object_id
# prints 167778 again - the same object_id from the first time!

puts Symbol.all_symbols.count - symbols_count
# prints 2, the two objects we created.

For Ruby versions before 2.2, once a symbol is instantiated, this memory will never be free again. The only way to free the memory is restarting the application. So symbols are also a major cause of memory leaks when used incorrectly. The simplest way to generate a memory leak is using the method to_sym on user input data, since this data will always change, a new portion of the memory will be used forever in the software instance. Ruby 2.2 introduced the symbol garbage collector, which frees symbols generated dynamically, so the memory leaks generated by creating symbols dynamically it is not a concern any longer.

Answering your question:

Is it true I have to use a symbol instead of a string if there is at least two the same strings in my application or script?

If what you are looking for is an identifier to be used internally at your code, you should be using symbols. If you are printing output, you should go with strings, even if it appears more than once, even allocating two different objects in memory.

Here's the reasoning:

  1. Printing the symbols will be slower than printing strings because they are cast to strings.
  2. Having lots of different symbols will increase the overall memory usage of your application since they are never deallocated. And you are never using all strings from your code at the same time.

Use case by @AlanDert

@AlanDert: if I use many times something like %input{type: :checkbox} in haml code, what should I use as checkbox?

Me: Yes.

@AlanDert: But to print out a symbol on html page, it should be converted to string, shouldn't it? what's the point of using it then?

What is the type of an input? An identifier of the type of input you want to use or something you want to show to the user?

It is true that it will become HTML code at some point, but at the moment you are writing that line of your code, it is mean to be an identifier - it identifies what kind of input field you need. Thus, it is used over and over again in your code, and have always the same "string" of characters as the identifier and won't generate a memory leak.

That said, why don't we evaluate the data to see if strings are faster?

This is a simple benchmark I created for this:

require 'benchmark'
require 'haml'

str = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: "checkbox"}').render
end
end.total

sym = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: :checkbox}').render
end
end.total

puts "String: " + str.to_s
puts "Symbol: " + sym.to_s

Three outputs:

# first time
String: 5.14
Symbol: 5.07
#second
String: 5.29
Symbol: 5.050000000000001
#third
String: 4.7700000000000005
Symbol: 4.68

So using smbols is actually a bit faster than using strings. Why is that? It depends on the way HAML is implemented. I would need to hack a bit on HAML code to see, but if you keep using symbols in the concept of an identifier, your application will be faster and reliable. When questions strike, benchmark it and get your answers.



Related Topics



Leave a reply



Submit