Pandas.To_Sql Replace Old Data With New Data Based on 'Unique Id'

pandas.to_sql replace old data with new data based on 'unique id'

There doesn't seem to be a feature to get through this easily. Currently, I just drop the entire table, and recreate a new one..

meta = MetaData()
table_to_drop = Table('Database count details',
meta, autoload=True, autoload_with=engine)
table_to_drop.drop(engine)

Pandas to_sql() to update unique values in DB?

In pandas, there is no convenient argument in to_sql to append only non-duplicates to a final table. Consider using a staging temp table that pandas always replaces and then run a final append query to migrate temp table records to final table accounting only for unique PK's using the NOT EXISTS clause.

engine = sqlalchemy.create_engine(...)

df.to_sql(name='myTempTable', con=engine, if_exists='replace')

with engine.begin() as cn:
sql = """INSERT INTO myFinalTable (Col1, Col2, Col3, ...)
SELECT t.Col1, t.Col2, t.Col3, ...
FROM myTempTable t
WHERE NOT EXISTS
(SELECT 1 FROM myFinalTable f
WHERE t.MatchColumn1 = f.MatchColumn1
AND t.MatchColumn2 = f.MatchColumn2)"""

cn.execute(sql)

This would be an ANSI SQL solution and not restricted to vendor-specific methods like UPSERT and so is compliant in practically all SQL-integrated relational databases.

REPLACE rows in mysql database table with pandas DataFrame

Till this version (0.17.1) I am unable find any direct way to do this in pandas. I reported a feature request for the same.
I did this in my project with executing some queries using MySQLdb and then using DataFrame.to_sql(if_exists='append')

Suppose

1) product_id is my primary key in table PRODUCT

2) feed_id is my primary key in table XML_FEED.

SIMPLE VERSION

import MySQLdb
import sqlalchemy
import pandas

con = MySQLdb.connect('localhost','root','my_password', 'database_name')
con_str = 'mysql+mysqldb://root:my_password@localhost/database_name'
engine = sqlalchemy.create_engine(con_str) #because I am using mysql
df = pandas.read_sql('SELECT * from PRODUCT', con=engine)
df_product_id = df['product_id']
product_id_str = (str(list(df_product_id.values))).strip('[]')
delete_str = 'DELETE FROM XML_FEED WHERE feed_id IN ({0})'.format(product_id_str)
cur = con.cursor()
cur.execute(delete_str)
con.commit()
df.to_sql('XML_FEED', if_exists='append', con=engine)# you can use flavor='mysql' if you do not want to create sqlalchemy engine but it is depreciated

Please note:-
The REPLACE [INTO] syntax allows us to INSERT a row into a table, except that if a UNIQUE KEY (including PRIMARY KEY) violation occurs, the old row is deleted prior to the new INSERT, hence no violation.

pandas dataframe to_sql for replace and add new using sqlalchemy

The if_exists='replace' is not a row wise operation. So it does not check if each row already exists and only replaces that specific row. It checks if the whole table is already there, if it finds the table, it will drop the old table and insert your new one.

Quoted from the docs:

replace: Drop the table before inserting new values.

What I think you should do is use if_exists='append' and then check for duplicate rows and remove them. That would for now be the safest approach.

The method you are looking for is being worked on atm and is called upsert, this will only insert record which do not "clash", and you can prioritise the new or old records. See GitHub ticket

Pandas to_sql - Increase table's index when appending DataFrame

Even though Pandas has a lot of export options, its main purpose is not intented to use as database management api. Managing indexes is typically something a database should take care of.

I would suggest to set index=False, if_exists='append' and create the table with an auto-increment index:

CREATE TABLE AReg (
id INT NOT NULL AUTO_INCREMENT,
# your fields here
PRIMARY KEY (id)
);

to_sql() method of pandas sends primary key column as NULL even if the column is not present in dataframe

Please note that pandas.DataFrame.to_sql() has by default parameter index=True which means that it will add an extra column (df.index) when inserting the data.

Some Databases like PostgreSQL have a data type serial which allows you to sequentially fill the column with incremental numbers.

Snowflake DB doesn't have that concept but instead, there are other ways to handle it:

First Option:
You can use CREATE SEQUENCE statement and create a sequence directly in the db - here is the official documentation on this topic. The downside of this approach is that you would need to convert your DataFrame into a proper SQL statement:

db preparation part:

CREATE OR REPLACE SEQUENCE schema.my_sequence START = 1 INCREMENT = 1;
CREATE OR REPLACE TABLE schema.my_table (i bigint, b text);

You would need to convert the DataFrame into Snowflake's INSERT statement and use schema.my_sequence.nextval to get the next ID value

INSERT INTO schema.my_table VALUES
(schema.my_sequence.nextval, 'string_1'),
(schema.my_sequence.nextval, 'string_2');

The result will be:

i b
1 string_1
2 string_2

Please note that there are some limitations to this approach and you need to ensure that each insert statement you do this way will be successful as calling schema.my_sequence.nextval and not inserting it will mean that there will be gaps numbers.
To avoid it you can have a separate script that checks if the current insert was successful and if not it will recreate the sequence by calling:

REPLACE SEQUENCE schema.my_sequence start = (SELECT max(i) FROM schema.my_table) increment = 1;

Alternative Option:
You would need to create an extra function that runs the SQL to get the last i you inserted previously.

SELECT max(i) AS max_i FROM schema.my_table;

and then update the index in your DataFrame before running to_sql()

df.index = range(max_i+1, len(df)+max_i+1)

This will ensure that your DataFrame index continues i in your table.
Once that is done you can use

df.to_sql(index_label='i', name='my_table', con=connection_object)

It will use your index as one of the columns you insert allowing you to maintain the unique index in the table.

How to insert a pandas dataframe to an already existing table in a database?

make use of if_exists parameter:

df.to_sql('db_table2', engine, if_exists='replace')

or

df.to_sql('db_table2', engine, if_exists='append')

from docstring:

"""
if_exists : {'fail', 'replace', 'append'}, default 'fail'
- fail: If table exists, do nothing.
- replace: If table exists, drop it, recreate it, and insert data.
- append: If table exists, insert data. Create if does not exist.
"""

How do I perform an UPDATE of existing rows of a db table using a Pandas DataFrame?

I think the easiest way would be to:

first delete those rows that are going to be "upserted". This can be done in a loop, but it's not very efficient for bigger data sets (5K+ rows), so i'd save this slice of the DF into a temporary MySQL table:

# assuming we have already changed values in the rows and saved those changed rows in a separate DF: `x`
x = df[mask] # `mask` should help us to find changed rows...

# make sure `x` DF has a Primary Key column as index
x = x.set_index('a')

# dump a slice with changed rows to temporary MySQL table
x.to_sql('my_tmp', engine, if_exists='replace', index=True)

conn = engine.connect()
trans = conn.begin()

try:
# delete those rows that we are going to "upsert"
engine.execute('delete from test_upsert where a in (select a from my_tmp)')
trans.commit()

# insert changed rows
x.to_sql('test_upsert', engine, if_exists='append', index=True)
except:
trans.rollback()
raise

PS i didn't test this code so it might have some small bugs, but it should give you an idea...



Related Topics



Leave a reply



Submit