How to Create a Large Pandas Dataframe from an SQL Query Without Running Out of Memory

How to create a large pandas dataframe from an sql query without running out of memory?

Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.

You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:

import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)

It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

Pandas using too much memory with read_sql_table

You need to set the chunksize argument so that pandas will iterate over smaller chunks of data. See this post: https://stackoverflow.com/a/31839639/3707607

Large (6 million rows) pandas df causes memory error with `to_sql ` when chunksize =100, but can easily save file of 100,000 with no chunksize

From stepping through the code I think it's this line, which reads creates a bunch of DataFrames:

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

Which looks like it's probably a bug. Specifically this happens prior to database insertion, in preparation.

One trick you can do is CTRL-C whilst the memory is rapidly increasing, and see which line is paused on (my bet is this one).

User Edit:

Problem was solved by using

explicit loop (rather than using chunk), ie. for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...)

Which still resulted in memory errors, but allowed the user to continue where loop left off before the crash.

A more robust solution is to "perhaps try a raw connection, rather than using SQLEngine?\ "
But user didn't have a chance to try this

Pandas: how to work with really big data?

Pandas has a nice builtin read_sql method that should be pretty efficient

i.e. just do:

df = pd.read_sql("SELECT * FROM binance.zrxeth_ob_indicators", conn)

and it should just work…

on its own 1.2 million rows isn't much, given your column count/names it's probably <300MB of RAM (30 bytes per value * 9 columns * 1.2e6 rows) and should take <10 seconds on a recent computer



Related Topics



Leave a reply



Submit