How to write DataFrame to postgres table
Starting from pandas 0.14 (released end of May 2014), postgresql is supported. The sql
module now uses sqlalchemy
to support different database flavors. You can pass a sqlalchemy engine for a postgresql database (see docs). E.g.:
from sqlalchemy import create_engine
engine = create_engine('postgresql://username:password@localhost:5432/mydatabase')
df.to_sql('table_name', engine)
You are correct that in pandas up to version 0.13.1 postgresql was not supported. If you need to use an older version of pandas, here is a patched version of pandas.io.sql
: https://gist.github.com/jorisvandenbossche/10841234.
I wrote this a time ago, so cannot fully guarantee that it always works, buth the basis should be there). If you put that file in your working directory and import it, then you should be able to do (where con
is a postgresql connection):
import sql # the patched version (file is named sql.py)
sql.write_frame(df, 'table_name', con, flavor='postgresql')
How to upsert pandas DataFrame to PostgreSQL table?
If you are using PostgreSQL 9.5 or later you can perform the UPSERT using a temporary table and an INSERT ... ON CONFLICT
statement:
import sqlalchemy as sa
# …
with engine.begin() as conn:
# step 0.0 - create test environment
conn.exec_driver_sql("DROP TABLE IF EXISTS main_table")
conn.exec_driver_sql(
"CREATE TABLE main_table (id int primary key, txt varchar(50))"
)
conn.exec_driver_sql(
"INSERT INTO main_table (id, txt) VALUES (1, 'row 1 old text')"
)
# step 0.1 - create DataFrame to UPSERT
df = pd.DataFrame(
[(2, "new row 2 text"), (1, "row 1 new text")], columns=["id", "txt"]
)
# step 1 - create temporary table and upload DataFrame
conn.exec_driver_sql(
"CREATE TEMPORARY TABLE temp_table AS SELECT * FROM main_table WHERE false"
)
df.to_sql("temp_table", conn, index=False, if_exists="append")
# step 2 - merge temp_table into main_table
conn.exec_driver_sql(
"""\
INSERT INTO main_table (id, txt)
SELECT id, txt FROM temp_table
ON CONFLICT (id) DO
UPDATE SET txt = EXCLUDED.txt
"""
)
# step 3 - confirm results
result = conn.exec_driver_sql("SELECT * FROM main_table ORDER BY id").all()
print(result) # [(1, 'row 1 new text'), (2, 'new row 2 text')]
Dataframe to PostgreSQL DB
If I'm correct in assuming that the data is in the data frame you should just be able to do
engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.drop_duplicates(subset=None) # Replace None with list of column names that define the primary key ex. ['column_name1', 'column_name2']
df.to_sql('table_main', engine, if_exists='append')
Edit due to comment:
If that's the case you have the right idea. You can make it more efficient by using to_sql to insert the data into the temp table first like so.
engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.to_sql('table_temp', engine, if_exists='replace')
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
# cur.execute("""DROP TABLE table_temp CASCADE;"""); # You can drop if you want to but the replace option in to_sql will drop and recreate the table
conn.commit()
How to write data frame to Postgres table without using SQLAlchemy engine?
You can use those connections and avoid SQLAlchemy. This is going to sound rather unintuitive, but it will be much faster than regular inserts (even if you were to drop the ORM and make a general query e.g. with executemany
). Inserts are slow, even with raw queries, but you'll see that COPY
is mentioned several times in How to speed up insertion performance in PostgreSQL. In this instance, my motivations for the approach below are:
- Use
COPY
instead ofINSERT
- Don't trust Pandas to generate the correct SQL for this operation (although, as noted by Ilja Everilä, this approach actually got added to Pandas in V0.24)
- Don't write the data to disk to make an actual file object; keep it all in memory
Suggested approach using cursor.copy_from()
:
import csv
import io
import psycopg2
df = "<your_df_here>"
# drop all the columns you don't want in the insert data here
# First take the headers
headers = df.columns
# Now get a nested list of values
data = df.values.tolist()
# Create an in-memory CSV file
string_buffer = io.StringIO()
csv_writer = csv.writer(string_buffer)
csv_writer.writerows(data)
# Reset the buffer back to the first line
string_buffer.seek(0)
# Open a connection to the db (which I think you already have available)
with psycopg2.connect(dbname=current_app.config['POSTGRES_DB'],
user=current_app.config['POSTGRES_USER'],
password=current_app.config['POSTGRES_PW'],
host=current_app.config['POSTGRES_URL']) as conn:
c = conn.cursor()
# Now upload the data as though it was a file
c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
conn.commit()
This should be orders of magnitude faster than actually doing inserts.
Writing dataframe to Postgres database psycopg2
If you use pd.DataFrame.to_sql, you can supply the index_label
parameter to use that as a column.
data_pandas.to_sql('FiguresUSAByState', con=dbConnection, index_label='Index')
If you would prefer to stick with the custom SQL and for loop you have, you will need to reset_index first.
for row in data_pandas.reset_index().to_dict('rows'):
query = """
INSERT into FiguresUSAByState(index, Province_State, NumberByState) values(%i, '%s', %i);
""" % (row['index'], row['Province_State'], row['NumberByState'])
Note that the default name for the new column is index
, uncapitalized, rather than Index
.
How to create a postgres table from a pandas dataframe?
A table with the name you specify will be created if it does not already exist. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
Related Topics
List Nearby/Discoverable Bluetooth Devices, Including Already Paired, in Python, on Linux
Pip Error:'Module' Object Has No Attribute 'Cryptography_Has_Ssl_St'
Can You Fool Isatty and Log Stdout and Stderr Separately
/Usr/Bin/Ld: Cannot Find -Lpython2.7
Anaconda: Disable Prompt Change
Find Broken Symlinks with Python
Mismatch Between Sys.Executable and Sys.Version in Python
Groupby Weighted Average and Sum in Pandas Dataframe
How to Setup Environment Variable R_User to Use Rpy2 in Python
What Are Python Pandas Equivalents for R Functions Like Str(), Summary(), and Head()
Why Xgrabkey Generates Extra Focus-Out and Focus-In Events
Executing Command Using "Su -L" in Ssh Using Python
Get Local Network Interface Addresses Using Only Proc
Is There a Simple Way to Remove Multiple Spaces in a String