How to Speed Up Bulk Insert to Ms SQL Server Using Pyodbc

How to speed up bulk insert to MS SQL Server using pyodbc

Update - May 2022: bcpandas and bcpyaz are wrappers for Microsoft's bcp utility.

Update - April 2019: As noted in the comment from @SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).

BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...

1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07

... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...

1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07

... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...

1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07

... or a tab-delimited file (where → represents the tab character) ...

1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07

... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:

import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
    FIELDTERMINATOR='\\t',
    ROWTERMINATOR='\\n'
    );
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.

Speed up inserts into SQL Server from pyodbc

UPDATE: pyodbc 4.0.19 added a Cursor#fast_executemany option that can greatly improve performance by avoiding the behaviour described below. See this answer for details.

Your code does follow proper form (aside from the few minor tweaks mentioned in the other answer), but be aware that when pyodbc performs an .executemany what it actually does is submit a separate sp_prepexec for each individual row. That is, for the code

sql = "INSERT INTO #Temp (id, txtcol) VALUES (?, ?)"
params = [(1, 'foo'), (2, 'bar'), (3, 'baz')]
crsr.executemany(sql, params)

the SQL Server actually performs the following (as confirmed by SQL Profiler)

exec sp_prepexec @p1 output,N'@P1 bigint,@P2 nvarchar(3)',N'INSERT INTO #Temp (id, txtcol) VALUES (@P1, @P2)',1,N'foo'
exec sp_prepexec @p1 output,N'@P1 bigint,@P2 nvarchar(3)',N'INSERT INTO #Temp (id, txtcol) VALUES (@P1, @P2)',2,N'bar'
exec sp_prepexec @p1 output,N'@P1 bigint,@P2 nvarchar(3)',N'INSERT INTO #Temp (id, txtcol) VALUES (@P1, @P2)',3,N'baz'

So, for an .executemany "batch" of 10,000 rows you would be

performing 10,000 individual inserts,
with 10,000 round-trips to the server, and
sending the identical SQL command text (INSERT INTO ...) 10,000 times.

It is possible to have pyodbc send an initial sp_prepare and then do an .executemany calling sp_execute, but the nature of .executemany is that you still would do 10,000 sp_prepexec calls, just executing sp_execute instead of INSERT INTO .... That could improve performance if the SQL statement was quite long and complex, but for a short one like the example in your question it probably wouldn't make all that much difference.

One could also get creative and build "table value constructors" as illustrated in this answer, but notice that it is only offered as a "Plan B" when native bulk insert mechanisms are not a feasible solution.

pyodbc - very slow bulk insert speed

I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.

I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.

BULK INSERT into SQL Server table using pyodbc: cannot find file

The BULK INSERT statement is executed on the SQL Server machine, so the file path must be accessible from that machine. You are getting "The system cannot find the path specified" because the path

C:\\Users\\kdalal\\callerx_project\\caller_x\\new_file_name.csv

is a path on your machine, not the SQL Server machine.

Since you are dumping the contents of a dataframe to the CSV file you could simply use df.to_sql to push the contents directly to the SQL Server without an intermediate CSV file. To improve performance you can tell SQLAlchemy to use pyodbc's fast_executemany option as described in the related question

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC

basic pyodbc bulk insert

The best way to handle this is to use the pyodbc function executemany.

ds1Cursor.execute(selectSql)
result = ds1Cursor.fetchall()

ds2Cursor.executemany('INSERT INTO [TableName] (Col1, Col2, Col3) VALUES (?, ?, ?)', result)
ds2Cursor.commit()

Speed up Python executemany with Insert or Ignore -Statement

I made several attempts to speed up the query and gathered some insights which in wanted to share with everybody who may encounters the same issue:

Takeaways:

When using Azure SQL Server always try to use the INSERT INTO ... VALUES (...) Statement instead of INSERT INTO ... SELECT ..., as it performs about 350% faster (when benchmarked for the described problem and used syntax).
- The main reason why i used INSERT INTO ... SELECT ... was because of the specific DATEADD() Cast, as you can't do that without explicitly declaring variables in Azure SQL Server.
You can skip the DATEADD() in the given example if you cast the provided time to python datetime. If you choose this option make sure to not use literal strings when inserting the data to your SQL Table. Besides bad practice as adressed by @Charlieface, PYODBC has no built-in logic for that datatype when using the string-literal input (sequence-of-sequence input structure has no problem here)
The IF NOT EXISTS Statement is really expensive. Try to omit it if possible. A simple workaround, if you depend on preserving your table historically, is to create a second table that is newly created and then insert from that table to your original where no match was found. Here you can depend on your native SQL implementation instead of the PYODBC implementation. This way was by far the fastest.

The different design choices resulted in the following performance improvements:

INSERT INTO ... SELECT ... vs INSERT INTO ... VALUES (...): 350%
Leveraging a second table and native SQL support: 560%

Faster solution than executemany to insert multiple rows at once in pyodbc

Is this solution correct?

The solution you propose, which is to build a table value constructor (TVC), is not incorrect but it is really not necessary. pyodbc with fast_executemany=True and Microsoft's ODBC Driver 17 for SQL Server is about as fast as you're going to get short of using BULK INSERT or bcp as described in this answer.

Does it have some vulnerabilities?

Since you are building a TVC for a parameterized query you are protected from SQL Injection vulnerabilities, but there are still a couple of implementation considerations:

A TVC can insert a maximum of 1000 rows at a time.
pyodbc executes SQL statements by calling a system stored procedure, and stored procedures in SQL Server can accept a maximum of 2100 parameters, so the number of rows that your TVC can insert is also limited to (number_of_rows * number_of_columns < 2100).

In other words, your TVC approach will be limited to a "chunk size" of 1000 rows or less. The actual calculation is described in this answer.

How to Speed Up Bulk Insert to Ms SQL Server Using Pyodbc