How to Write Dataframe to Postgres Table

How to write DataFrame to postgres table

Starting from pandas 0.14 (released end of May 2014), postgresql is supported. The sql module now uses sqlalchemy to support different database flavors. You can pass a sqlalchemy engine for a postgresql database (see docs). E.g.:

from sqlalchemy import create_engine
engine = create_engine('postgresql://username:password@localhost:5432/mydatabase')
df.to_sql('table_name', engine)

You are correct that in pandas up to version 0.13.1 postgresql was not supported. If you need to use an older version of pandas, here is a patched version of pandas.io.sql: https://gist.github.com/jorisvandenbossche/10841234.

I wrote this a time ago, so cannot fully guarantee that it always works, buth the basis should be there). If you put that file in your working directory and import it, then you should be able to do (where con is a postgresql connection):

import sql  # the patched version (file is named sql.py)
sql.write_frame(df, 'table_name', con, flavor='postgresql')

How to upsert pandas DataFrame to PostgreSQL table?

If you are using PostgreSQL 9.5 or later you can perform the UPSERT using a temporary table and an INSERT ... ON CONFLICT statement:

import sqlalchemy as sa

# …

with engine.begin() as conn:
# step 0.0 - create test environment
conn.exec_driver_sql("DROP TABLE IF EXISTS main_table")
conn.exec_driver_sql(
"CREATE TABLE main_table (id int primary key, txt varchar(50))"
)
conn.exec_driver_sql(
"INSERT INTO main_table (id, txt) VALUES (1, 'row 1 old text')"
)
# step 0.1 - create DataFrame to UPSERT
df = pd.DataFrame(
[(2, "new row 2 text"), (1, "row 1 new text")], columns=["id", "txt"]
)

# step 1 - create temporary table and upload DataFrame
conn.exec_driver_sql(
"CREATE TEMPORARY TABLE temp_table AS SELECT * FROM main_table WHERE false"
)
df.to_sql("temp_table", conn, index=False, if_exists="append")

# step 2 - merge temp_table into main_table
conn.exec_driver_sql(
"""\
INSERT INTO main_table (id, txt)
SELECT id, txt FROM temp_table
ON CONFLICT (id) DO
UPDATE SET txt = EXCLUDED.txt
"""
)

# step 3 - confirm results
result = conn.exec_driver_sql("SELECT * FROM main_table ORDER BY id").all()
print(result) # [(1, 'row 1 new text'), (2, 'new row 2 text')]

Dataframe to PostgreSQL DB

If I'm correct in assuming that the data is in the data frame you should just be able to do

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.drop_duplicates(subset=None) # Replace None with list of column names that define the primary key ex. ['column_name1', 'column_name2']
df.to_sql('table_main', engine, if_exists='append')

Edit due to comment:

If that's the case you have the right idea. You can make it more efficient by using to_sql to insert the data into the temp table first like so.

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.to_sql('table_temp', engine, if_exists='replace')
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
# cur.execute("""DROP TABLE table_temp CASCADE;"""); # You can drop if you want to but the replace option in to_sql will drop and recreate the table
conn.commit()

How to write data frame to Postgres table without using SQLAlchemy engine?

You can use those connections and avoid SQLAlchemy. This is going to sound rather unintuitive, but it will be much faster than regular inserts (even if you were to drop the ORM and make a general query e.g. with executemany). Inserts are slow, even with raw queries, but you'll see that COPY is mentioned several times in How to speed up insertion performance in PostgreSQL. In this instance, my motivations for the approach below are:

  1. Use COPY instead of INSERT
  2. Don't trust Pandas to generate the correct SQL for this operation (although, as noted by Ilja Everilä, this approach actually got added to Pandas in V0.24)
  3. Don't write the data to disk to make an actual file object; keep it all in memory

Suggested approach using cursor.copy_from():

import csv
import io
import psycopg2

df = "<your_df_here>"

# drop all the columns you don't want in the insert data here

# First take the headers
headers = df.columns

# Now get a nested list of values
data = df.values.tolist()

# Create an in-memory CSV file
string_buffer = io.StringIO()
csv_writer = csv.writer(string_buffer)
csv_writer.writerows(data)

# Reset the buffer back to the first line
string_buffer.seek(0)

# Open a connection to the db (which I think you already have available)
with psycopg2.connect(dbname=current_app.config['POSTGRES_DB'],
user=current_app.config['POSTGRES_USER'],
password=current_app.config['POSTGRES_PW'],
host=current_app.config['POSTGRES_URL']) as conn:
c = conn.cursor()

# Now upload the data as though it was a file
c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
conn.commit()

This should be orders of magnitude faster than actually doing inserts.

Writing dataframe to Postgres database psycopg2

If you use pd.DataFrame.to_sql, you can supply the index_label parameter to use that as a column.

data_pandas.to_sql('FiguresUSAByState', con=dbConnection, index_label='Index')

If you would prefer to stick with the custom SQL and for loop you have, you will need to reset_index first.

for row in data_pandas.reset_index().to_dict('rows'):
query = """
INSERT into FiguresUSAByState(index, Province_State, NumberByState) values(%i, '%s', %i);
""" % (row['index'], row['Province_State'], row['NumberByState'])

Note that the default name for the new column is index, uncapitalized, rather than Index.

How to create a postgres table from a pandas dataframe?

A table with the name you specify will be created if it does not already exist. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html



Related Topics



Leave a reply



Submit