How to upsert pandas DataFrame to Microsoft SQL Server table?
Update, July 2022: You can save some typing by using this function to build the MERGE statement and perform the upsert for you.
SQL Server offers the MERGE statement:
import pandas as pd
import sqlalchemy as sa
connection_string = (
"Driver=ODBC Driver 17 for SQL Server;"
"Server=192.168.0.199;"
"UID=scott;PWD=tiger^5HHH;"
"Database=test;"
"UseFMTONLY=Yes;"
)
connection_url = sa.engine.URL.create(
"mssql+pyodbc",
query={"odbc_connect": connection_string}
)
engine = sa.create_engine(connection_url, fast_executemany=True)
with engine.begin() as conn:
# step 0.0 - create test environment
conn.exec_driver_sql("DROP TABLE IF EXISTS main_table")
conn.exec_driver_sql(
"CREATE TABLE main_table (id int primary key, txt varchar(50))"
)
conn.exec_driver_sql(
"INSERT INTO main_table (id, txt) VALUES (1, 'row 1 old text')"
)
# step 0.1 - create DataFrame to UPSERT
df = pd.DataFrame(
[(2, "new row 2 text"), (1, "row 1 new text")], columns=["id", "txt"]
)
# step 1 - upload DataFrame to temporary table
df.to_sql("#temp_table", conn, index=False, if_exists="replace")
# step 2 - merge temp_table into main_table
conn.exec_driver_sql(
"""\
MERGE main_table WITH (HOLDLOCK) AS main
USING (SELECT id, txt FROM #temp_table) AS temp
ON (main.id = temp.id)
WHEN MATCHED THEN
UPDATE SET txt = temp.txt
WHEN NOT MATCHED THEN
INSERT (id, txt) VALUES (temp.id, temp.txt);
"""
)
# step 3 - confirm results
result = conn.exec_driver_sql(
"SELECT * FROM main_table ORDER BY id"
).fetchall()
print(result)
# [(1, 'row 1 new text'), (2, 'new row 2 text')]
How to update db faster using pyodbc
Perhaps you can preprocess your data_json
into two groups: one that needs to be inserted (does not already exist in the database) and one that needs to be updated (already exists in the database). Then you can use your df.to_sql
call with method="multi"
and chunksize=50
. You may be able to convert your data_json
directly into a Pandas DataFrame with pandas.DataFrame(data_json)
(here I am assuming data_json
is a Python object, probably a dictionary, instead of a string in JSON format).
If you can convert your data_json
into a data frame first you may be able to run
df = pd.DataFrame(data_json)
df.loc[lambda D: D.item_id.isin(setid_db), :].to_sql(<UPDATE>)
df.loc[lambda D: ~D.item_id.isin(setid_db), :].to_sql(<INSERT>)
Related Topics
Why Does Numpy.Zeros Takes Up Little Space
How to Use Asyncio with Existing Blocking Library
Can't Use '\1' Backreference to Capture-Group in a Function Call in Re.Sub() Repr Expression
Why Does Python Use 'Magic Methods'
Scale Everything on Pygame Display Surface
How to Solve Equations in Python
What Is the Purpose of Meshgrid in Python/Numpy
Python Regex to Find a String in Double Quotes Within a String
Overflowerror: Long Int Too Large to Convert to Float in Python
Scipy Curve_Fit Doesn't Like Math Module
Filtering a Pyspark Dataframe with SQL-Like in Clause
Writing Unit Tests in Python: How to Start
How Do Threads Work in Python, and What Are Common Python-Threading Specific Pitfalls
Representing and Solving a Maze Given an Image
How to Skip Iterations in a Loop