Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
After contacting the developers of SQLAlchemy, a way to solve this problem has emerged. Many thanks to them for the great work!
One has to use a cursor execution event and check if the executemany
flag has been raised. If that is indeed the case, switch the fast_executemany
option on. For example:
from sqlalchemy import event
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
More information on execution events can be found here.
UPDATE: Support for fast_executemany
of pyodbc
was added in SQLAlchemy 1.3.0, so this hack is not longer necessary.
to_sql() alternative to fast_executemany for databases other than SQL Server
fast_executemany=True
is specific to the mssql+pyodbc://
dialect. It will not work with other dialects like sqlite://
. For other databases you would normally use method="multi"
(or a custom function for PostgreSQL as described in this answer).
However, SQLite appears to have a limit of 999 parameter values in a single SQL statement so you would also need to use the chunksize=
argument:
# value to pass to index= argument
use_index = False
# calculate chunksize= , taking into account whether we'll be
# uploading the index
chunk_size = 999 // (len(df.columns) + (1 if use_index else 0))
df.to_sql(
"table1",
engine,
index=use_index,
if_exists="replace",
method="multi",
chunksize=chunk_size,
)
Edit:
Based on comments below, I tested df.to_sql()
inserting 1_000_000 rows into a SQLite database on my local hard drive.
method=None
(the default): 1_000_000 rows written in 53 seconds
method="multi"
(as above): 1_000_000 rows written in 505 seconds
Apparently Python's sqlite3
module has some sort of optimization similar to fast_executemany=True
already built-in, so an .executemany()
with 1_000_000 rows (pd.to_sql(…, method=None)
) is about an order of magnitude faster than ~5 thousand .execute()
calls inserting 199 rows each (pd.to_sql(…, method="multi")
).
Pandas to_sql() slow on one DataFrame but fast on others
Cause
The performance difference is due to an issue in pyodbc
where passing None
values to SQL Server INSERT
statements when using the fast_executemany=True
option results in slow downs.
Solution
We can pack the values as JSON
and use OPENJSON
(supported on SQL Server 2016+) instead of fast_executemany
. This solution resulted in a 30x performance improvement in my application! Here's a self-contained example, based on the documentation here, but adapted for pandas
users.
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({'First Name': ['Homer', 'Ned'], 'Last Name': ['Simpson', 'Flanders']})
rows_as_json = df.to_json(orient='records')
server = '<server name>'
db = '<database name>'
table = '<table name>'
engine = create_engine(f'mssql+pyodbc://<user>:<password>@{server}/{db}')
sql = f'''
INSERT INTO {table} ([First Name], [Last Name])
SELECT [First Name], [Last Name] FROM
OPENJSON(?)
WITH (
[First Name] nvarchar(50) '$."First Name"',
[Last Name] nvarchar(50) '$."Last Name"'
)
'''
cursor = engine.raw_connection().cursor()
cursor.execute(sql, rows_as_json)
cursor.commit()
Alternative Workarounds
- Export data to CSV and use an external tool to complete the transfer (for example, the bcp utility).
- Artificially replace values that are converted to
None
to a non-empty filler value, add a helper column to indicate which rows were changed, complete the operation as normal viato_sql()
, then reset the filler values toNull
via a separate query based on the helper column.
Acknowledgement
Many thanks to @GordThompson for pointing me to the solution!
How to speed up pandas to_sql
Not sure if you have your issue resolved but did want to provide an answer here for the benefit of providing Azure SQL Database libraries for Python specific information and some useful resources to investigate and resolve this issue, as applicable.
An example of using pyodbc
to directly query an Azure SQL Database:
Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance
An example of using Pandas dataframe: How to read and write to an Azure SQL database from a Pandas dataframe
main.py
"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# 1. Constants
AZUREUID = 'myuserid' # Azure SQL database userid
AZUREPWD = '************' # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net' # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas' # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable' # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server' # ODBC Driver
def main():
"""Main function"""
# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}@{server}:1433/{database}?driver={driver}'.format(
uid=AZUREUID,
password=AZUREPWD,
server=AZURESRV,
database=AZUREDB,
driver=DRIVER.replace(' ', '+'))
# 3. Read dummydata into dataframe
df = pd.read_csv('./data/data.csv')
# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')
# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)
print(dfsql.head())
if __name__ == "__main__":
main()
And finally, the following resources should assist in comparing specific implementations, with performance issues, with the below information where the Stack Overflow thread is likely the best resource but the Monitoring and Performance tuning document is useful to investigate and mitigate ay server-side performance issues, etc.
Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance
Regards,
Mike
Downsides or cons of using fast_executemany property?
Are there situations when it is not recommended to have it enabled for SQL Server tasks? Maybe if only doing singleton inserts all the time?
No, fast_executemany=True
will have no effect on single-row inserts if pyodbc's .execute()
method is called. One example is this pandas issue where the behaviour differs between a DataFrame with a single row (.execute()
) and multiple rows (.executemany()
). The fix for that particular issue would be for pandas to always call .executemany()
, even if the DataFrame only has a single row. (Note also that fast_executemany=True
does not cause the problem, it fixes the problem.)
However, there are a couple of other known issues with fast_executemany=True
and .to_sql()
in specific cases:
1. Databases with default "supplementary character" (_SC) collations
If the database is defined with a default "…_SC" collation, e.g.,
cnxn.execute(f"CREATE DATABASE {db_name} COLLATE Latin1_General_100_CI_AS_SC")
then .to_sql()
will fail for strings longer than 2000 characters.
pyodbc issue on GitHub
2. DataFrames with a lot of NULL-like values
DataFrames that are relatively sparse (contain a lot of NULL-like values like None
, NaN
, NaT
, etc.) can degrade the insert performance of .executemany()
, although the worst-case would be that fast_executemany=True
runs about as slowly as fast_executemany=False
.
pyodbc issue on GitHub
3. Increasing memory consumption with [n]varchar(max)
columns
to_sql()
defaults to creating string columns as varchar(max)
and that can cause memory bloat with fast_executemany=True
.
pyodbc issue on GitHub
Pandas dataframe insert into SQL Server taking too long with execute and executemany
You can set fast_executemany in pyodbc itself for versions>=4.0.19. It is off by default.
import pyodbc
server_name = 'localhost'
database_name = 'AdventureWorks2019'
table_name = 'MyTable'
driver = 'ODBC Driver 17 for SQL Server'
connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes')
cursor = connection.cursor()
cursor.fast_executemany = True # reduce number of calls to server on inserts
# form SQL statement
columns = ", ".join(df.columns)
values = '('+', '.join(['?']*len(df.columns))+')'
statement = "INSERT INTO "+table_name+" ("+columns+") VALUES "+values
# extract values from DataFrame into list of tuples
insert = [tuple(x) for x in df.values]
cursor.executemany(statement, insert)
Or if you prefer sqlalchemy and dataframes directly.
import sqlalchemy as db
engine = db.create_engine('mssql+pyodbc://@'+server_name+'/'+database_name+'?trusted_connection=yes&driver='+driver, fast_executemany=True)
df.to_sql(table_name, engine, if_exists='append', index=False)
See fast_executemany in this link.
https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API
pyodbc/sqlAchemy enable fast execute many
The error you received is caused by changes introduced in Pandas version 0.23.0, reverted in 0.23.1, and reintroduced in 0.24.0, as explained here. The produced VALUES clause contains 100,000 parameter markers and it'd seem that the count is stored in a signed 16 bit integer, so it overflows and you get the funny
The SQL contains -31072 parameter markers, but 100000 parameters were supplied
You can check for yourself:
In [16]: 100000 % (2 ** 16) - 2 ** 16
Out[16]: -31072
If you would like to keep on using Pandas as is, you will have to calculate and provide a suitable chunksize
value, such as the 100 you were using, taking into account both the maximum row limit of 1,000 for VALUES clause and the maximum parameter limit of 2,100 for stored procedures. The details are again explained in the linked Q/A.
Before the change Pandas used to always use executemany()
when inserting data. Newer versions detect if the dialect in use supports VALUES clause in INSERT. This detection happens in SQLTable.insert_statement()
and cannot be controlled, which is a shame since PyODBC fixed their executemany()
performance, given the right flag is enabled.
In order to force Pandas to use executemany()
with PyODBC again SQLTable
has to be monkeypatched:
import pandas.io.sql
def insert_statement(self, data, conn):
return self.table.insert(), data
pandas.io.sql.SQLTable.insert_statement = insert_statement
This will be horribly slow, if the Cursor.fast_executemany
flag is not set, so remember to set the proper event handler.
Here is a simple performance comparison, using the following dataframe:
In [12]: df = pd.DataFrame({f'X{i}': range(1000000) for i in range(9)})
Vanilla Pandas 0.24.0:
In [14]: %time df.to_sql('foo', engine, chunksize=209)
CPU times: user 2min 9s, sys: 2.16 s, total: 2min 11s
Wall time: 2min 26s
Monkeypatched Pandas with fast executemany enabled:
In [10]: %time df.to_sql('foo', engine, chunksize=500000)
CPU times: user 12.2 s, sys: 981 ms, total: 13.2 s
Wall time: 38 s
Related Topics
Recommendations of Python Rest (Web Services) Framework
How to Save a Trained Model in Pytorch
How to Count Occurrence of Unique Values Inside a List
Filtering a List of Strings Based on Contents
Convert Unix Time to Readable Date in Pandas Dataframe
How to Dynamically Change Base Class of Instances at Runtime
How to Find All Positions of the Maximum Value in a List
How to Get the Utc Time of "Midnight" for a Given Timezone
Working with Big Data in Python and Numpy, Not Enough Ram, How to Save Partial Results on Disc
Python Split() Without Removing the Delimiter
Isprime Function for Python Language
Why Does Python Return 0 for Simple Division Calculation
Matplotlib Colorbar for Scatter
How to Reduce a Jpeg Size to a 'Desired Size'
Run a Program from Python, and Have It Continue to Run After the Script Is Killed
What Exactly Is File.Flush() Doing