Why is iterating through a large Django QuerySet consuming massive amounts of memory?
Nate C was close, but not quite.
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:
for e in Entry.objects.all():
print e.headline
So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.
From my reading of the docs, iterator()
does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.
Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:
- Memory Efficient Django QuerySet iterator
- batch querysets
- QuerySet Foreach
How to iterate a large table in Django without running out of memory?
The first thing to try is using the iterator()
method on the queryset before iterating over it:
for ii in MyModel.objects.all().iterator():
Limiting Memory Use in a *Large* Django QuerySet
So what I actually ended up doing is building something that you can 'wrap' a QuerySet in. It works by making a deepcopy of the QuerySet, using the slice syntax--e.g., some_queryset[15:45]
--but then it makes another deepcopy of the original QuerySet when the slice has been completely iterated through. This means that only the set of Objects returned in 'this' particular slice are stored in memory.
class MemorySavingQuerysetIterator(object):
def __init__(self,queryset,max_obj_num=1000):
self._base_queryset = queryset
self._generator = self._setup()
self.max_obj_num = max_obj_num
def _setup(self):
for i in xrange(0,self._base_queryset.count(),self.max_obj_num):
# By making a copy of of the queryset and using that to actually access
# the objects we ensure that there are only `max_obj_num` objects in
# memory at any given time
smaller_queryset = copy.deepcopy(self._base_queryset)[i:i+self.max_obj_num]
logger.debug('Grabbing next %s objects from DB' % self.max_obj_num)
for obj in smaller_queryset.iterator():
yield obj
def __iter__(self):
return self
def next(self):
return self._generator.next()
So instead of...
for obj in SomeObject.objects.filter(foo='bar'): <-- Something that returns *a lot* of Objects
do_something(obj);
You would do...
for obj in MemorySavingQuerysetIterator(in SomeObject.objects.filter(foo='bar')):
do_something(obj);
Please note that the intention of this is to save memory in your Python interpreter. It essentially does this by making more database queries. Usually people are trying to do the exact opposite of that--i.e., minimize database queries as much as possible without regards to memory usage. Hopefully somebody will find this useful though.
Iterating over a large Django queryset while the data is changing elsewhere
The following will do the job for you in Django 1.1, no loop necessary:
from django.db.models import F
Book.objects.all().update(activity=F('views')*4)
You can have a more complicated calculation too:
for book in Book.objects.all().iterator():
Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())
Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.
Simple query causes memory leak in Django
You could try something like this:
from django.core.paginator import Paginator
p = Paginator(CallLog.objects.all().only('cdate'), 2000)
for page in range(1, p.num_pages + 1):
for i in p.page(page).object_list:
i.cdate = pytz.utc.localize(datetime.datetime.strptime(i.fixed_date, "%y-%m-%d %H:%M"))
i.save()
Slicing a query set does not load all the objects in memory only to get a subset but adds limit and offset to the SQL query before hitting the database.
Django ORM really slow iterating over QuerySet
Okay, I've found the problem was the Python MySQL driver. Without using the .iterator()
method a for
loop would get stuck on the last element in the QuerySet. I have posted a more detailed answer on an expanded question here.
I was not using the Django recommended mysqlclient. I was using the
one created by Oracle/MySQL. There seems to be a bug that causes an
iterator to get "stuck" on the last element of the QuerySet in a for
loop and be trapped in an endless loop in certain circumstances.
Coming to think of it, it may well be that this is a design feature of the MySQL driver. I remember having a similar issue with a Java version of this driver before. Maybe I should just ditch MySQL and move to PostgreSQL?
I will try to raise a bug with Oracle anyways.
Related Topics
Cross Join Without Duplicate Combinations
How to Use System Username Directly in Ms Access Query
How to Select Date Without Time in SQL
SQL for Applying Conditions to Multiple Rows in a Join
Return Number of Rows Affected by Update Statements
Grouped String Aggregation/Listagg for SQL Server
Sqlite3 "Forgets" to Use Foreign Keys
Execute Dynamic Query with Go in SQL
Measuring Query Performance:"Execution Plan Query Cost" VS "Time Taken"
How to Add a Auto_Increment Primary Key in SQL Server Database
Can an Inner Join Offer Better Performance Than Exists
How to Calculate Session and Session Duration in Firebase Analytics Raw Data
Finding Rows That Don't Contain Numeric Data in Oracle
Count Returning Blank Instead of 0
Oracle: Combine Multiple Results in a Subquery into a Single Comma-Separated Value