Activerecord Find_Each Combined with Limit and Order

ActiveRecord find_each combined with limit and order

The documentation says that find_each and find_in_batches don't retain sort order and limit because:

  • Sorting ASC on the PK is used to make the batch ordering work.
  • Limit is used to control the batch sizes.

You could write your own version of this function like @rorra did. But you can get into trouble when mutating the objects. If for example you sort by created_at and save the object it might come up again in one of the next batches. Similarly you might skip objects because the order of results has changed when executing the query to get the next batch. Only use that solution with read only objects.

Now my primary concern was that I didn't want to load 30000+ objects into memory at once. My concern was not the execution time of the query itself. Therefore I used a solution that executes the original query but only caches the ID's. It then divides the array of ID's into chunks and queries/creates the objects per chunk. This way you can safely mutate the objects because the sort order is kept in memory.

Here is a minimal example similar to what I did:

batch_size = 512
ids = Thing.order('created_at DESC').pluck(:id) # Replace .order(:created_at) with your own scope
ids.each_slice(batch_size) do |chunk|
Thing.find(chunk, :order => "field(id, #{chunk.join(',')})").each do |thing|
# Do things with thing
end
end

The trade-offs to this solution are:

  • The complete query is executed to get the ID's
  • An array of all the ID's is kept in memory
  • Uses the MySQL specific FIELD() function

Hope this helps!

Using limit and offset in rails together with updated_at and find_each - will that cause a problem?

As many have noted in the comments, it seems like using find_each will ignore the order and limit. I found this answer (ActiveRecord find_each combined with limit and order) that seems to be working for me. It's not working 100% but it is a definite improvement. The rest seems to be a memory issue, i.e. I cannot have too many processes running at the same time on Heroku.

Is there any way to order by specific column when using find_each?

I suppose you could add a pagination gem (you may already have one in your Gemfile, will_paginate or kaminari)

That would let you do...

total_batches = (Book.all.count / 50.0).ceil
(1..total_batches).each do |batch|
Book.order(:name).paginate(page: batch, per_page: 50).each do |book|
# do stuff
end
end

Ordered batches clear solution

Straight from the docs:

NOTE: It’s not possible to set the order. That is automatically set to
ascending on the primary key (“id ASC”) to make the batch ordering
work. This also means that this method only works when the primary key
is orderable (e.g. an integer or string).

The reason it is deliberately limited to primary_key order because those values don't change. So if you mutate the data as you're traversing it you dont get repeated options back.

In case of id: :desc you will not get new records that were inserted after the transaction to get initial batch was started.

Refs

https://rails.lighthouseapp.com/projects/8994/tickets/2502-patch-arbase-reverse-find_in_batches

https://ww.telent.net/2012/5/4/changing_sort_order_with_activerecord_find_in_batches

ActiveRecord find_each combined with limit and order

ActiveRecord limit method does not seem to respect order of relation

As we found out in comments, the problem is that when order meets two objects with identical engagement values it "sorts" it in some specific way.

What could help is passing an additional parameter to the ORDER clause (for example id):

Company.last.contacts.order(engagement: :desc, id: :asc)

Using scopes and order with limit and offset

mu is too short's comment explained the behaviour. The cmp_id had duplicate values, and evidently the database is not required to sort equal values the same way each time. One way to fix it is to add a secondary key to break ties in a consistent fashion.



Related Topics



Leave a reply



Submit