Ruby Symbols Are Not Garbage Collected!? Then, Isn't It Better to Use a String

Ruby symbols are not garbage collected!? Then, isn't it better to use a String?

Seeing as symbols are almost always created via literals, there isn't much potential for a memory explosion here. Their behavior is pretty much required by their usage: every time you refer to a symbol, it's the same one.

Similarly, strings need to be unique in Ruby. This is due to the way they're used - text processing etc.

Decide which one to use depending on their semantics, don't optimize prematurely.

Why don't more projects use Ruby Symbols instead of Strings?

In ruby, after creating the AST, each symbol is represented as a unique integer. Having symbols as hash keys makes the computing a lot faster, as the main operation is comparison.

When to use symbols instead of strings in Ruby?

TL;DR

A simple rule of thumb is to use symbols every time you need internal identifiers. For Ruby < 2.2 only use symbols when they aren't generated dynamically, to avoid memory leaks.

Full answer

The only reason not to use them for identifiers that are generated dynamically is because of memory concerns.

This question is very common because many programming languages don't have symbols, only strings, and thus strings are also used as identifiers in your code. You should be worrying about what symbols are meant to be, not only when you should use symbols. Symbols are meant to be identifiers. If you follow this philosophy, chances are that you will do things right.

There are several differences between the implementation of symbols and strings. The most important thing about symbols is that they are immutable. This means that they will never have their value changed. Because of this, symbols are instantiated faster than strings and some operations like comparing two symbols is also faster.

The fact that a symbol is immutable allows Ruby to use the same object every time you reference the symbol, saving memory. So every time the interpreter reads :my_key it can take it from memory instead of instantiate it again. This is less expensive than initializing a new string every time.

You can get a list all symbols that are already instantiated with the command Symbol.all_symbols:

symbols_count = Symbol.all_symbols.count # all_symbols is an array with all 
# instantiated symbols.
a = :one
puts a.object_id
# prints 167778

a = :two
puts a.object_id
# prints 167858

a = :one
puts a.object_id
# prints 167778 again - the same object_id from the first time!

puts Symbol.all_symbols.count - symbols_count
# prints 2, the two objects we created.

For Ruby versions before 2.2, once a symbol is instantiated, this memory will never be free again. The only way to free the memory is restarting the application. So symbols are also a major cause of memory leaks when used incorrectly. The simplest way to generate a memory leak is using the method to_sym on user input data, since this data will always change, a new portion of the memory will be used forever in the software instance. Ruby 2.2 introduced the symbol garbage collector, which frees symbols generated dynamically, so the memory leaks generated by creating symbols dynamically it is not a concern any longer.

Answering your question:

Is it true I have to use a symbol instead of a string if there is at least two the same strings in my application or script?

If what you are looking for is an identifier to be used internally at your code, you should be using symbols. If you are printing output, you should go with strings, even if it appears more than once, even allocating two different objects in memory.

Here's the reasoning:

  1. Printing the symbols will be slower than printing strings because they are cast to strings.
  2. Having lots of different symbols will increase the overall memory usage of your application since they are never deallocated. And you are never using all strings from your code at the same time.

Use case by @AlanDert

@AlanDert: if I use many times something like %input{type: :checkbox} in haml code, what should I use as checkbox?

Me: Yes.

@AlanDert: But to print out a symbol on html page, it should be converted to string, shouldn't it? what's the point of using it then?

What is the type of an input? An identifier of the type of input you want to use or something you want to show to the user?

It is true that it will become HTML code at some point, but at the moment you are writing that line of your code, it is mean to be an identifier - it identifies what kind of input field you need. Thus, it is used over and over again in your code, and have always the same "string" of characters as the identifier and won't generate a memory leak.

That said, why don't we evaluate the data to see if strings are faster?

This is a simple benchmark I created for this:

require 'benchmark'
require 'haml'

str = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: "checkbox"}').render
end
end.total

sym = Benchmark.measure do
10_000.times do
Haml::Engine.new('%input{type: :checkbox}').render
end
end.total

puts "String: " + str.to_s
puts "Symbol: " + sym.to_s

Three outputs:

# first time
String: 5.14
Symbol: 5.07
#second
String: 5.29
Symbol: 5.050000000000001
#third
String: 4.7700000000000005
Symbol: 4.68

So using smbols is actually a bit faster than using strings. Why is that? It depends on the way HAML is implemented. I would need to hack a bit on HAML code to see, but if you keep using symbols in the concept of an identifier, your application will be faster and reliable. When questions strike, benchmark it and get your answers.

Aren't modern computers powerful enough to handle Strings without needing to use Symbols (in Ruby)

Your computer may well be able to handle "a little bit of extra garbage collection", but what about when that "little bit" takes place in an inner loop that runs millions of times? What about when it's running on an embedded system with limited memory?

There are a lot of places you can get away with using strings willy-nilly, but in some you can't. It all depends on the context.

Why are symbols not frozen strings?

This answer drastically different from my original answer, but I ran into a couple interesting threads on the Ruby mailing list. (Both good reads)

So, at one point in 2006, matz implemented the Symbol class as Symbol < String. Then the Symbol class was stripped down to remove any mutability. So a Symbol was in fact a immutable String.

However, it was reverted. The reason given was

Even though it is highly against DuckTyping, people tend to use case
on classes, and Symbol < String often cause serious problems.

So the answer to your question is still: a Symbol is like a String, but it isn't.
The problem isn't that a Symbol shouldn't be String, but instead that it historically wasn't.

How are Symbols faster than Strings in Hash lookups?

There's no obligation for hash to be equivalent to object_id. Those two things serve entirely different purposes. The point of hash is to be as deterministic and yet random as possible so that the values you're inserting into your hash are evenly distributed. The point of object_id is to define a unique object identifier, though there's no requirement that these be random or evenly distributed. In fact, randomizing them is counter-productive, that'd just make things slower for no reason.

The reason symbols tend to be faster is because the memory for them is allocated once (garbage collection issues aside) and recycled for all instances of the same symbol. Strings are not like that. They can be constructed in a multitude of ways, and even two strings that are byte-for-byte identical are likely to be different objects. In fact, it's safer to presume they are than otherwise unless you know for certain they're the same object.

Now when it comes to computing hash, the value must be randomly different even if the string changes very little. Since the symbol can't change computing it can be optimized more. You could just compute a hash of the object_id since that won't change, for example, while the string needs to take into account the content of itself, which is presumably dynamic.

Try benchmarking things:

require 'benchmark'

count = 100000000

Benchmark.bm do |bm|
bm.report('Symbol:') do
count.times { :symbol.hash }
end
bm.report('String:') do
count.times { "string".hash }
end
end

This gives me results like this:

       user     system      total        real
Symbol: 6.340000 0.020000 6.360000 ( 6.420563)
String: 11.380000 0.040000 11.420000 ( 11.454172)

Which in this most trivial case is easily 2x faster. Based on some basic testing the performance of the string code degrades O(N) as the strings get longer but the symbol times remain constant.

Using Ruby Symbols

In short, symbols are lightweight strings, but they also are immutable and non-garbage-collectable.

You should not use them as immutable strings in your data processing tasks (remember, once symbol is created, it can't be destroyed). You typically use symbols for naming things.

# typical use cases

# access hash value
user = User.find(params[:id])

# name something
attr_accessor :first_name

# set hash value in opts parameter
db.collection.update(query, update, multi: true, upsert: true)

Let's take first example, params[:id]. In a moderately big rails app there may be hundreds/thousands of those scattered around the codebase. If we accessed that value with a string, params["id"], that means new string allocation each time (and that string needs to be collected afterwards). In case of symbol, it's actually the same symbol everywhere. Less work for memory allocator, garbage collector and even you (: is faster to type than "")

If you have a simple one-word string that appears often in your code and you don't do something funky to it (interpolation, gsub, upcase, etc), then it's likely a good candidate to be a symbol.

However, does this apply only to text that is used as part of the actual program logic such as naming, not text that you get while actually running the program...such as text from the user/web etc?

I can not think of a single case where I'd want to turn data from user/web to symbol (except for parsing command-line options, maybe). Mainly because of the consequences (once created symbols live forever).

Also, many editors provide different coloring for symbols, to highlight them in the code. Take a look at this example

symbol vs string



Related Topics



Leave a reply



Submit