Bulk/batch update/upsert in PostgreSQL
I've used 3 strategies for batch transactional work:
- Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
- JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
- Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a
flush()
method against the HibernateSession
, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.
Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize
, when fetching associations, Hibernate will use IN
instead of =
, leading to fewer SELECT
statements to load up the collections.
Bulk insert, update if on conflict (bulk upsert) on Postgres
Turns out a special table named excluded
contains the row-to-be-inserted
(strange name though)
insert into USERS(
id, username, profile_picture)
select unnest(array['12345']),
unnest(array['Peter']),
unnest(array['someURL'])
on conflict (id) do
update set
username = excluded.username,
profile_picture = excluded.profile_picture;
http://www.postgresql.org/docs/9.5/static/sql-insert.html#SQL-ON-CONFLICT
The SET and WHERE clauses in ON CONFLICT DO UPDATE have access to the existing row using the table's name (or an alias), and to rows proposed for insertion using the special excluded table...
Most efficient way to do a bulk UPDATE with pairs of input
Normally you want to batch-update from a table
with sufficient index to make the merge easy:
CREATE TEMP TABLE updates_table
( id integer not null primary key
, val varchar
);
INSERT into updates_table(id, val) VALUES
( 1, 'foo' ) ,( 2, 'bar' ) ,( 3, 'baz' )
;
UPDATE target_table t
SET value = u.val
FROM updates_table u
WHERE t.id = u.id
;
So you should probably populate your update_table by something like:
INSERT into updates_table(id, val)
SELECT
split_part(x,',',1)::INT AS id,
split_part(x,',',2)::VARCHAR AS value
FROM (
SELECT UNNEST(ARRAY['1,foo','2,bar','3,baz'])
) AS x
;
Remember: an index (or the primary key) on the id
field in the updates_table
is important. (but for small sets like this one, a hashjoin will probably by chosen by the optimiser)
In addition: for updates, it is important to avoid updates with the same value, these cause extra rowversions to be created + plus the resulting VACUUM
activity after the update was committed:
UPDATE target_table t
SET value = u.val
FROM updates_table u
WHERE t.id = u.id
AND (t.value IS NULL OR t.value <> u.value)
;
How do I increase the speed of a bulk UPSERT in postgreSQL?
Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).
Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.
Postgresql Batch insert and on conflict batch update
You are going to use obviously incorrect syntax. Having the table
create table a_table(id serial primary key, x1 int, x2 int);
try this in psql
insert into a_table (x1, x2)
values (1,2), (3,4)
on conflict do
update set (x1, x2) = (1,2), (3,4);
to get
ERROR: syntax error at or near "3"
LINE 4: update set (x1, x2) = (1,2), (3,4);
On the other hand, ON CONFLICT
makes no sense in this case. A conflict will never happen, as none of the used columns (or group of columns) is unique.
Check INSERT
syntax, read more about UPSERT
in wiki.
Related Topics
Huge Performance Difference When Using Group by VS Distinct
Subquery in from Must Have an Alias
Return Pre-Update Column Values Using SQL Only
How to Insert Multiple Records and Get the Identity Value
Return a Value If No Record Is Found
Efficient SQL Test Query or Validation Query That Will Work Across All (Or Most) Databases
Is There Ever a Time Where Using a Database 1:1 Relationship Makes Sense
Postgres Window Function and Group by Exception
Protecting Against SQL Injection in Python
Are a Case Statement and a Decode Equivalent
Linq Version of SQL "In" Statement
Mysql, Iterate Through Column Names
Using SQL Function Generate_Series() in Redshift
Why (And How) to Split Column Using Master..Spt_Values
Does Sparksql Support Subquery
How to Select Id with Max Date Group by Category in Postgresql