How to create a large pandas dataframe from an sql query without running out of memory?
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
Pandas using too much memory with read_sql_table
You need to set the chunksize
argument so that pandas will iterate over smaller chunks of data. See this post: https://stackoverflow.com/a/31839639/3707607
Large (6 million rows) pandas df causes memory error with `to_sql ` when chunksize =100, but can easily save file of 100,000 with no chunksize
From stepping through the code I think it's this line, which reads creates a bunch of DataFrames:
chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
Which looks like it's probably a bug. Specifically this happens prior to database insertion, in preparation.
One trick you can do is CTRL-C whilst the memory is rapidly increasing, and see which line is paused on (my bet is this one).
User Edit:
Problem was solved by using
explicit loop (rather than using chunk), ie. for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...)
Which still resulted in memory errors, but allowed the user to continue where loop left off before the crash.
A more robust solution is to "perhaps try a raw connection, rather than using SQLEngine?\ "
But user didn't have a chance to try this
Pandas: how to work with really big data?
Pandas has a nice builtin read_sql
method that should be pretty efficient
i.e. just do:
df = pd.read_sql("SELECT * FROM binance.zrxeth_ob_indicators", conn)
and it should just work…
on its own 1.2 million rows isn't much, given your column count/names it's probably <300MB of RAM (30 bytes per value * 9 columns * 1.2e6 rows) and should take <10 seconds on a recent computer
Related Topics
Concatenating Two One-Dimensional Numpy Arrays
Convert Column to Date Format (Pandas Dataframe)
Force Python to Forego Native SQLite3 and Use the (Installed) Latest SQLite3 Version
Building Python with Ssl Support in Non-Standard Location
Inverting a Dictionary with List Values
How to Change an Image Size in Pygame
Error Message: 'Chromedriver' Executable Needs to Be Path
Purpose of Calling Function Without Brackets Python
(Z3Py) Checking All Solutions for Equation
Variable Assignment and Modification (In Python)
Why Does Python Assignment Not Return a Value
Database Does Not Update Automatically with MySQL and Python
Preserving Styles Using Python's Xlrd,Xlwt, and Xlutils.Copy