Why Is Iterating Through a Large Django Queryset Consuming Massive Amounts of Memory

Why is iterating through a large Django QuerySet consuming massive amounts of memory?

Nate C was close, but not quite.

From the docs:

You can evaluate a QuerySet in the following ways:

  • Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:

    for e in Entry.objects.all():
    print e.headline

So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.

From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.

Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:

  • Memory Efficient Django QuerySet iterator
  • batch querysets
  • QuerySet Foreach

How to iterate a large table in Django without running out of memory?

The first thing to try is using the iterator() method on the queryset before iterating over it:

for ii in MyModel.objects.all().iterator():

Limiting Memory Use in a *Large* Django QuerySet

So what I actually ended up doing is building something that you can 'wrap' a QuerySet in. It works by making a deepcopy of the QuerySet, using the slice syntax--e.g., some_queryset[15:45]--but then it makes another deepcopy of the original QuerySet when the slice has been completely iterated through. This means that only the set of Objects returned in 'this' particular slice are stored in memory.

class MemorySavingQuerysetIterator(object):

def __init__(self,queryset,max_obj_num=1000):
self._base_queryset = queryset
self._generator = self._setup()
self.max_obj_num = max_obj_num

def _setup(self):
for i in xrange(0,self._base_queryset.count(),self.max_obj_num):
# By making a copy of of the queryset and using that to actually access
# the objects we ensure that there are only `max_obj_num` objects in
# memory at any given time
smaller_queryset = copy.deepcopy(self._base_queryset)[i:i+self.max_obj_num]
logger.debug('Grabbing next %s objects from DB' % self.max_obj_num)
for obj in smaller_queryset.iterator():
yield obj

def __iter__(self):
return self

def next(self):
return self._generator.next()

So instead of...

for obj in SomeObject.objects.filter(foo='bar'): <-- Something that returns *a lot* of Objects
do_something(obj);

You would do...

for obj in MemorySavingQuerysetIterator(in SomeObject.objects.filter(foo='bar')):
do_something(obj);

Please note that the intention of this is to save memory in your Python interpreter. It essentially does this by making more database queries. Usually people are trying to do the exact opposite of that--i.e., minimize database queries as much as possible without regards to memory usage. Hopefully somebody will find this useful though.

Iterating over a large Django queryset while the data is changing elsewhere

The following will do the job for you in Django 1.1, no loop necessary:

from django.db.models import F
Book.objects.all().update(activity=F('views')*4)

You can have a more complicated calculation too:

for book in Book.objects.all().iterator():
Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())

Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.

Simple query causes memory leak in Django

You could try something like this:

from django.core.paginator import Paginator

p = Paginator(CallLog.objects.all().only('cdate'), 2000)
for page in range(1, p.num_pages + 1):
for i in p.page(page).object_list:
i.cdate = pytz.utc.localize(datetime.datetime.strptime(i.fixed_date, "%y-%m-%d %H:%M"))
i.save()

Slicing a query set does not load all the objects in memory only to get a subset but adds limit and offset to the SQL query before hitting the database.

Django ORM really slow iterating over QuerySet

Okay, I've found the problem was the Python MySQL driver. Without using the .iterator() method a for loop would get stuck on the last element in the QuerySet. I have posted a more detailed answer on an expanded question here.

I was not using the Django recommended mysqlclient. I was using the
one created by Oracle/MySQL. There seems to be a bug that causes an
iterator to get "stuck" on the last element of the QuerySet in a for
loop and be trapped in an endless loop in certain circumstances.

Coming to think of it, it may well be that this is a design feature of the MySQL driver. I remember having a similar issue with a Java version of this driver before. Maybe I should just ditch MySQL and move to PostgreSQL?

I will try to raise a bug with Oracle anyways.



Related Topics



Leave a reply



Submit