Memory-Efficient Built-In SQLalchemy Iterator/Generator

memory-efficient built-in SqlAlchemy iterator/generator?

Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.

But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.

So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.

I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.

The window function approach is on the wiki and I use it with great success.

Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.

Sqlalchemy - Difference between query and query.all in for loops

Nope, there is no difference in DB traffic. The difference is just that for row in session.Query(Model1) does the ORM work on each row when it is about to give it to you, while for row in session.Query(Model1).all() does the ORM work on all rows, before starting to give them to you.

Note that q.all() is just sugar for list(q), i.e. collecting everything yielded by the generator into a list. Here is the source code for it, in the Query class (find def all in the linked source):

def all(self):
"""Return the results represented by this ``Query`` as a list.

This results in an execution of the underlying query.

"""
return list(self)

... where self, the query object, is an iterable, i.e. has an __iter__ method.

So logically the two ways are exactly the same in terms of DB traffic; both end up calling query.__iter__() to get a row iterator, and next()ing their way through it.

The practical difference is that the former can start giving you rows as soon as their data has arrived, “streaming” the DB result set to you, with less memory use and latency. I can't state for sure that all the current engine implementations do that (I hope they do!). In any case the latter version prevents that efficiency, for no good reason.



Related Topics



Leave a reply



Submit