Array Size Too Big - Ruby

Array size too big - ruby

An array with 500 million elements is 2 GiBytes in size, which – depending on the specific OS you are using – is typically the maximum that a process can address. In other words: your array is bigger than your address space.

So, the solutions are obvious: either make the array smaller (by, say, breaking it up in chunks) or make the address space bigger (in Linux, you can patch the kernel to get 3, 3.5 and even 4 GiByte of address space, and of course switching to a 64 bit OS and a 64 bit Ruby implementation(!) would also work).

Alternatively, you need to rethink your approach. Maybe use mmap instead of an array, or something like that. Maybe lazy-load only the parts you need.

Is there a limit on the size of an array in ruby?

There is not a software limit imposed by Ruby, but there is a limit as to how much the process can support. If you have a regular home server running the Ruby server, it would be able to handle an array until the array became too large, at which point it would begin to 'bog down', lag, crash, etc. On the other hand, if you had an extremely powerful corporate server, it could handle a much larger array, but would still eventually crash/lag if the array became too large for the process and the hardware (memory) to handle.

I don't have any concrete numbers for you, because it all depends on the hardware and software on the server.

Does the size of the objects in a Ruby array contribute to the memory usage of the array itself?

In MRI Ruby, an array does not copy and contain the values of all the objects inside it. The array contains a series of pointers in memory. So the size of an array is the size of the struct necessary for the array's internal pieces (e.g. capacity, length), plus the size of one pointer (a void*, so 4 bytes on a 32-bit architecture) per object in the array. The objects in the array occupy their own memory space elsewhere.

How can I set the length of an array?

There are various ways to do this.

If you just want an array of nils of length size then:

a = Array.new(size)

If you want an array of something other than nil, then:

a = Array.new(size, default_value)
a = [default_value] * size

will work but beware that these can lead to problems if default_value is mutable. Both of these share one reference for all elements so strange things can happen if you're not aware of the reference sharing; for example:

a = Array.new(6, 'pointer')
# ["pointer", "pointer", "pointer", "pointer", "pointer", "pointer"] 
a[0].upcase!
# "POINTER" 
a
# ["POINTER", "POINTER", "POINTER", "POINTER", "POINTER", "POINTER"]

Many things in Ruby are mutable so Array.new(size, default_value) is rarely what you want. If you're using booleans, nils, numbers, frozen strings, symbols, and the like then this is fine because you can't change those and the reference sharing won't bite you.

You can also supply a block to Array.new:

a = Array.new(6) { 'unique' }

That block will be executed for each entry so you want get any surprise reference sharing:

a = Array.new(6) { 'unique' }
# ["unique", "unique", "unique", "unique", "unique", "unique"] 
a[0].upcase!
# "UNIQUE" 
a
# ["UNIQUE", "unique", "unique", "unique", "unique", "unique"]

This is usually the one you want. Sure, it can consume more memory and take more time to create the array but if either of those is significant then you're probably taking the wrong approach in Ruby.

Ruby big array and memory

If I use the Ruby Garbage Collection Profiler on a lightly modified version of your code:

GC::Profiler.enable
GC::Profiler.clear

a = []
5_000_000.times do
  a << [rand(36**10).to_s(36)]
end

puts "\n size is #{a.size}"
a = []

GC::Profiler.report

I get the following output (on Ruby 1.9.3)(some columns and rows removed):

GC 60 invokes.
Index    Invoke Time(sec)       Use Size(byte)     Total Size(byte)     ...
    1               0.109               131136               409200     ...
    2               0.125               192528               409200     ...
  ...
   58              33.484            199150344            260938656     ...
   59              36.000            211394640            260955024     ...

The profile starts with 131 136 bytes used, and ends with 211 394 640 bytes used, without decreasing in size anywhere in the run, we can assume that no garbage collection has taken place.

If I then add a line of code which adds a single element to the array a, placed after a has grown to 5 million elements, and then has an empty array assigned to it:

GC::Profiler.enable
GC::Profiler.clear

a = []
5_000_000.times do
  a << [rand(36**10).to_s(36)]
end

puts "\n size is #{a.size}"
a = []

# the only change is to add one element to the (now) empty array a
a << [rand(36**10).to_s(36)]

GC::Profiler.report

This changes the profiler output to (some columns and rows removed):

GC 62 invokes.
Index    Invoke Time(sec)       Use Size(byte)     Total Size(byte)     ...
    1               0.156               131376               409200     ...
    2               0.172               192792               409200     ...
  ...
   59              35.375            211187736            260955024     ...
   60              36.625            211395000            469679760     ...
   61              41.891              2280168            307832976     ...

This profiler run now starts with 131 376 bytes used, which is similar to the previous run, grows, but ends with 2 280 168 bytes used, significantly lower than the previous profile run that ended with 211 394 640 bytes used, we can assume that garbage collection took place this during this run, probably triggered by our new line of code that adds an element to a.

The short answer is no, you don't need to do anything special to remove the data that was assigned to a, but hopefully this gives you the tools to prove it.

Array#product RangeError: too big to product

What you want is impossible.

The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.

This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.

Does anyone know some workaround to figure out of it? Some ways to chunk them?

Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.

NoMemoryError in Array#max

As documented here, if n is given then maximum n elements is returned, essentially allocating array of size n as well, which is the cause of your NoMemoryError.

Quicksort not working with little larger array size

Your implementation of quick sort algorithm is not correct. In a line:

startSort(array, pivot + 1, @tail)
you always call startSort method for pivot + 1 and array.size - 1 because @tail is an instance variable. It is assigned to @array.size - 1 only once and its value never changes. However, simply changing this line to

startSort(array, pivot + 1, tail)
is not enough to fix your code. With this change, it works fast even for large arrays but produces incorrect answer. This line should actually be:

startSort(array, pivot, tail).

With this change, it works fast for large arrays and sorts the array properly.

Count, size, length...too many choices in Ruby?

For arrays and hashes size is an alias for length. They are synonyms and do exactly the same thing.

count is more versatile - it can take an element or predicate and count only those items that match.

> [1,2,3].count{|x| x > 2 }
=> 1

In the case where you don't provide a parameter to count it has basically the same effect as calling length. There can be a performance difference though.

We can see from the source code for Array that they do almost exactly the same thing. Here is the C code for the implementation of array.length:

static VALUE
rb_ary_length(VALUE ary)
{
    long len = RARRAY_LEN(ary);
    return LONG2NUM(len);
}

And here is the relevant part from the implementation of array.count:

static VALUE
rb_ary_count(int argc, VALUE *argv, VALUE ary)
{
    long n = 0;

    if (argc == 0) {
        VALUE *p, *pend;

        if (!rb_block_given_p())
            return LONG2NUM(RARRAY_LEN(ary));

        // etc..
    }
}

The code for array.count does a few extra checks but in the end calls the exact same code: LONG2NUM(RARRAY_LEN(ary)).

Hashes (source code) on the other hand don't seem to implement their own optimized version of count so the implementation from Enumerable (source code) is used, which iterates over all the elements and counts them one-by-one.

In general I'd advise using length (or its alias size) rather than count if you want to know how many elements there are altogether.

Regarding ActiveRecord, on the other hand, there are important differences. check out this post:

Counting ActiveRecord associations: count, size or length?

Are there performance reasons to prefer size over length or count in Ruby?

It seems a bit wrong, for most commonly used cases (Array, Hash, String), size and length are either aliases or are implemented in the same way (you can read more here or check the implementation of each method) and will run in O(1).

count however:

For Hash is not redefined and will fallback to Enumerable#count, which means its complexity will be O(n) as all key-values will be traversed.
For Array it is redefined (Array#count) and at the very least it will check the number of arguments given which is something that neither Array#size nor Array#length have to do.
for String it's used to count substrings.

All in all, I would say that

Prefer size or length over count for performance reasons.

would be more accurate.

Array Size Too Big - Ruby