Speeding Up Pandas.Dataframe.To_SQL with Fast_Executemany of Pyodbc

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC

After contacting the developers of SQLAlchemy, a way to solve this problem has emerged. Many thanks to them for the great work!

One has to use a cursor execution event and check if the executemany flag has been raised. If that is indeed the case, switch the fast_executemany option on. For example:

from sqlalchemy import event

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

More information on execution events can be found here.

UPDATE: Support for fast_executemany of pyodbc was added in SQLAlchemy 1.3.0, so this hack is not longer necessary.

to_sql() alternative to fast_executemany for databases other than SQL Server

fast_executemany=True is specific to the mssql+pyodbc:// dialect. It will not work with other dialects like sqlite://. For other databases you would normally use method="multi" (or a custom function for PostgreSQL as described in this answer).

However, SQLite appears to have a limit of 999 parameter values in a single SQL statement so you would also need to use the chunksize= argument:

# value to pass to index= argument
use_index = False

# calculate chunksize= , taking into account whether we'll be
# uploading the index
chunk_size = 999 // (len(df.columns) + (1 if use_index else 0))

df.to_sql(
    "table1",
    engine,
    index=use_index,
    if_exists="replace",
    method="multi",
    chunksize=chunk_size,
)

Edit:

Based on comments below, I tested df.to_sql() inserting 1_000_000 rows into a SQLite database on my local hard drive.

method=None (the default): 1_000_000 rows written in 53 seconds

method="multi" (as above): 1_000_000 rows written in 505 seconds

Apparently Python's sqlite3 module has some sort of optimization similar to fast_executemany=True already built-in, so an .executemany() with 1_000_000 rows (pd.to_sql(…, method=None)) is about an order of magnitude faster than ~5 thousand .execute() calls inserting 199 rows each (pd.to_sql(…, method="multi")).

Pandas to_sql() slow on one DataFrame but fast on others

Cause

The performance difference is due to an issue in pyodbc where passing None values to SQL Server INSERT statements when using the fast_executemany=True option results in slow downs.

Solution

We can pack the values as JSON and use OPENJSON (supported on SQL Server 2016+) instead of fast_executemany. This solution resulted in a 30x performance improvement in my application! Here's a self-contained example, based on the documentation here, but adapted for pandas users.

import pandas as pd
from sqlalchemy import create_engine

df = pd.DataFrame({'First Name': ['Homer', 'Ned'], 'Last Name': ['Simpson', 'Flanders']})
rows_as_json = df.to_json(orient='records')

server = '<server name>'
db = '<database name>'
table = '<table name>'
engine = create_engine(f'mssql+pyodbc://<user>:<password>@{server}/{db}')

sql = f'''
INSERT INTO {table} ([First Name], [Last Name])
SELECT [First Name], [Last Name] FROM
    OPENJSON(?)
    WITH (
        [First Name] nvarchar(50) '$."First Name"',
        [Last Name] nvarchar(50) '$."Last Name"'
    )
'''

cursor = engine.raw_connection().cursor()
cursor.execute(sql, rows_as_json)
cursor.commit()

Alternative Workarounds

Export data to CSV and use an external tool to complete the transfer (for example, the bcp utility).
Artificially replace values that are converted to None to a non-empty filler value, add a helper column to indicate which rows were changed, complete the operation as normal via to_sql(), then reset the filler values to Null via a separate query based on the helper column.

Acknowledgement

Many thanks to @GordThompson for pointing me to the solution!

How to speed up pandas to_sql

Not sure if you have your issue resolved but did want to provide an answer here for the benefit of providing Azure SQL Database libraries for Python specific information and some useful resources to investigate and resolve this issue, as applicable.

An example of using pyodbc to directly query an Azure SQL Database:
Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance

An example of using Pandas dataframe: How to read and write to an Azure SQL database from a Pandas dataframe

main.py

"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

# 1. Constants
AZUREUID = 'myuserid'                                    # Azure SQL database userid
AZUREPWD = '************'                                # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net'   # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas'                                      # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable'                                      # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server'                 # ODBC Driver

def main():
"""Main function"""

# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}@{server}:1433/{database}?driver={driver}'.format(
    uid=AZUREUID,
    password=AZUREPWD,
    server=AZURESRV,
    database=AZUREDB,
    driver=DRIVER.replace(' ', '+'))

# 3. Read dummydata into dataframe 
df = pd.read_csv('./data/data.csv')

# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')

# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)

print(dfsql.head())

if __name__ == "__main__":
    main()

And finally, the following resources should assist in comparing specific implementations, with performance issues, with the below information where the Stack Overflow thread is likely the best resource but the Monitoring and Performance tuning document is useful to investigate and mitigate ay server-side performance issues, etc.

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance

Regards,
Mike

Downsides or cons of using fast_executemany property?

Are there situations when it is not recommended to have it enabled for SQL Server tasks? Maybe if only doing singleton inserts all the time?

No, fast_executemany=True will have no effect on single-row inserts if pyodbc's .execute() method is called. One example is this pandas issue where the behaviour differs between a DataFrame with a single row (.execute()) and multiple rows (.executemany()). The fix for that particular issue would be for pandas to always call .executemany(), even if the DataFrame only has a single row. (Note also that fast_executemany=True does not cause the problem, it fixes the problem.)

However, there are a couple of other known issues with fast_executemany=True and .to_sql() in specific cases:

1. Databases with default "supplementary character" (_SC) collations

If the database is defined with a default "…_SC" collation, e.g.,

cnxn.execute(f"CREATE DATABASE {db_name} COLLATE Latin1_General_100_CI_AS_SC")

then .to_sql() will fail for strings longer than 2000 characters.

pyodbc issue on GitHub

2. DataFrames with a lot of NULL-like values

DataFrames that are relatively sparse (contain a lot of NULL-like values like None, NaN, NaT, etc.) can degrade the insert performance of .executemany(), although the worst-case would be that fast_executemany=True runs about as slowly as fast_executemany=False.

pyodbc issue on GitHub

3. Increasing memory consumption with `[n]varchar(max)` columns

to_sql() defaults to creating string columns as varchar(max) and that can cause memory bloat with fast_executemany=True.

pyodbc issue on GitHub

Pandas dataframe insert into SQL Server taking too long with execute and executemany

You can set fast_executemany in pyodbc itself for versions>=4.0.19. It is off by default.

import pyodbc

server_name = 'localhost'
database_name = 'AdventureWorks2019'
table_name = 'MyTable'
driver = 'ODBC Driver 17 for SQL Server'

connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes') 

cursor = connection.cursor()

cursor.fast_executemany = True   # reduce number of calls to server on inserts

# form SQL statement
columns = ", ".join(df.columns)

values = '('+', '.join(['?']*len(df.columns))+')'
      
statement = "INSERT INTO "+table_name+" ("+columns+") VALUES "+values

# extract values from DataFrame into list of tuples
insert = [tuple(x) for x in df.values]

cursor.executemany(statement, insert)

Or if you prefer sqlalchemy and dataframes directly.

import sqlalchemy as db

engine = db.create_engine('mssql+pyodbc://@'+server_name+'/'+database_name+'?trusted_connection=yes&driver='+driver, fast_executemany=True)

df.to_sql(table_name, engine, if_exists='append', index=False)

See fast_executemany in this link.

https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API

pyodbc/sqlAchemy enable fast execute many

The error you received is caused by changes introduced in Pandas version 0.23.0, reverted in 0.23.1, and reintroduced in 0.24.0, as explained here. The produced VALUES clause contains 100,000 parameter markers and it'd seem that the count is stored in a signed 16 bit integer, so it overflows and you get the funny

The SQL contains -31072 parameter markers, but 100000 parameters were supplied

You can check for yourself:

In [16]: 100000 % (2 ** 16) - 2 ** 16
Out[16]: -31072

If you would like to keep on using Pandas as is, you will have to calculate and provide a suitable chunksize value, such as the 100 you were using, taking into account both the maximum row limit of 1,000 for VALUES clause and the maximum parameter limit of 2,100 for stored procedures. The details are again explained in the linked Q/A.

Before the change Pandas used to always use executemany() when inserting data. Newer versions detect if the dialect in use supports VALUES clause in INSERT. This detection happens in SQLTable.insert_statement() and cannot be controlled, which is a shame since PyODBC fixed their executemany() performance, given the right flag is enabled.

In order to force Pandas to use executemany() with PyODBC again SQLTable has to be monkeypatched:

import pandas.io.sql

def insert_statement(self, data, conn):
    return self.table.insert(), data

pandas.io.sql.SQLTable.insert_statement = insert_statement

This will be horribly slow, if the Cursor.fast_executemany flag is not set, so remember to set the proper event handler.

Here is a simple performance comparison, using the following dataframe:

In [12]: df = pd.DataFrame({f'X{i}': range(1000000) for i in range(9)})

Vanilla Pandas 0.24.0:

In [14]: %time df.to_sql('foo', engine, chunksize=209)
CPU times: user 2min 9s, sys: 2.16 s, total: 2min 11s
Wall time: 2min 26s

Monkeypatched Pandas with fast executemany enabled:

In [10]: %time df.to_sql('foo', engine, chunksize=500000)
CPU times: user 12.2 s, sys: 981 ms, total: 13.2 s
Wall time: 38 s

Speeding Up Pandas.Dataframe.To_SQL with Fast_Executemany of Pyodbc