Return Rows from Insert with on Conflict Without Needing to Update

Return rows from INSERT with ON CONFLICT without needing to update

It's the recurring problem of SELECT or INSERT, related to (but different from) an UPSERT. The new UPSERT functionality in Postgres 9.5 is still instrumental.

WITH ins AS (
INSERT INTO names(name)
VALUES ('bob')
ON CONFLICT ON CONSTRAINT names_name_key DO UPDATE
SET name = NULL
WHERE FALSE -- never executed, but locks the row
RETURNING id
)
SELECT id FROM ins
UNION ALL
SELECT id FROM names
WHERE name = 'bob' -- only executed if no INSERT
LIMIT 1;

This way you do not actually write a new row version without need.

I assume you are aware that in Postgres every UPDATE writes a new version of the row due to its MVCC model - even if name is set to the same value as before. This would make the operation more expensive, add to possible concurrency issues / lock contention in certain situations and bloat the table additionally.

However, there is still a tiny corner case for a race condition. Concurrent transactions may have added a conflicting row, which is not yet visible in the same statement. Then INSERT and SELECT come up empty.

Proper solution for single-row UPSERT:

  • Is SELECT or INSERT in a function prone to race conditions?

General solutions for bulk UPSERT:

  • How to use RETURNING with ON CONFLICT in PostgreSQL?

Without concurrent write load

If concurrent writes (from a different session) are not possible you don't need to lock the row and can simplify:

WITH ins AS (
INSERT INTO names(name)
VALUES ('bob')
ON CONFLICT ON CONSTRAINT names_name_key DO NOTHING -- no lock needed
RETURNING id
)
SELECT id FROM ins
UNION ALL
SELECT id FROM names
WHERE name = 'bob' -- only executed if no INSERT
LIMIT 1;

How to include excluded rows in RETURNING from INSERT ... ON CONFLICT

The error you get:

ON CONFLICT DO UPDATE command cannot affect row a second time

... indicates you are trying to upsert the same row more than once in a single command. In other words: you have dupes on (name, url, email) in your VALUES list. Fold duplicates (if that's an option) and the error goes away. This chooses an arbitrary row from each set of dupes:

INSERT INTO feeds_person (created, modified, name, url, email)
SELECT DISTINCT ON (name, url, email) *
FROM (
VALUES
('blah', 'blah', 'blah', 'blah', 'blah')
-- ... more rows
) AS v(created, modified, name, url, email) -- match column list
ON CONFLICT (name, url, email) DO UPDATE
SET url = feeds_person.url
RETURNING id;

Since we use a free-standing VALUES expression now, you have to add explicit type casts for non-default types. Like:

VALUES
(timestamptz '2016-03-12 02:47:56+01'
, timestamptz '2016-03-12 02:47:56+01'
, 'n3', 'u3', 'e3')
...

Your timestamptz columns need an explicit type cast, while the string types can operate with default text. (You could still cast to varchar(n) right away.)

If you want to have a say in which row to pick from each set of dupes, there are ways to do that:

  • Select first row in each GROUP BY group?

You are right, there is (currently) no way to use excluded columns in the RETURNING clause. I quote the Postgres Wiki:

Note that RETURNING does not make visible the "EXCLUDED.*" alias
from the UPDATE (just the generic "TARGET.*" alias is visible
there). Doing so is thought to create annoying ambiguity for the
simple, common cases [30] for little to no benefit. At some
point in the future, we may pursue a way of exposing if
RETURNING-projected tuples were inserted and updated, but this
probably doesn't need to make it into the first committed iteration of
the feature [31].

However, you shouldn't be updating rows that are not supposed to be updated. Empty updates are almost as expensive as regular updates - and might have unintended side effects. You don't strictly need UPSERT to begin with, your case looks more like "SELECT or INSERT". Related:

  • Is SELECT or INSERT in a function prone to race conditions?

One cleaner way to insert a set of rows would be with data-modifying CTEs:

WITH val AS (
SELECT DISTINCT ON (name, url, email) *
FROM (
VALUES
(timestamptz '2016-1-1 0:0+1', timestamptz '2016-1-1 0:0+1', 'n', 'u', 'e')
, ('2016-03-12 02:47:56+01', '2016-03-12 02:47:56+01', 'n1', 'u3', 'e3')
-- more (type cast only needed in 1st row)
) v(created, modified, name, url, email)
)
, ins AS (
INSERT INTO feeds_person (created, modified, name, url, email)
SELECT created, modified, name, url, email FROM val
ON CONFLICT (name, url, email) DO NOTHING
RETURNING id, name, url, email
)
SELECT 'inserted' AS how, id FROM ins -- inserted
UNION ALL
SELECT 'selected' AS how, f.id -- not inserted
FROM val v
JOIN feeds_person f USING (name, url, email);

The added complexity should pay for big tables where INSERT is the rule and SELECT the exception.

Originally, I had added a NOT EXISTS predicate on the last SELECT to prevent duplicates in the result. But that was redundant. All CTEs of a single query see the same snapshots of tables. The set returned with ON CONFLICT (name, url, email) DO NOTHING is mutually exclusive to the set returned after the INNER JOIN on the same columns.

Unfortunately this also opens a tiny window for a race condition. If ...

  • a concurrent transaction inserts conflicting rows
  • has not committed yet
  • but commits eventually

... some rows may be lost.

You might just INSERT .. ON CONFLICT DO NOTHING, followed by a separate SELECT query for all rows - within the same transaction to overcome this. Which in turn opens another tiny window for a race condition if concurrent transactions can commit writes to the table between INSERT and SELECT (in default READ COMMITTED isolation level). Can be avoided with REPEATABLE READ transaction isolation (or stricter). Or with a (possibly expensive or even unacceptable) write lock on the whole table. You can get any behavior you need, but there may be a price to pay.

Related:

  • How to use RETURNING with ON CONFLICT in PostgreSQL?
  • Return rows from INSERT with ON CONFLICT without needing to update

How to use RETURNING with ON CONFLICT in PostgreSQL?

I had exactly the same problem, and I solved it using 'do update' instead of 'do nothing', even though I had nothing to update. In your case it would be something like this:

INSERT INTO chats ("user", "contact", "name") 
VALUES ($1, $2, $3),
($2, $1, NULL)
ON CONFLICT("user", "contact")
DO UPDATE SET
name=EXCLUDED.name
RETURNING id;

This query will return all the rows, regardless they have just been inserted or they existed before.

Can PostgreSQL's update the other row on INSERT conflict?

As proposed by @Adrian, you can do it with a trigger :

CREATE OR REPLACE FUNCTION before_insert ()
RETURNS trigger LANGUAGE plpgsql AS
$$
BEGIN
UPDATE demo
SET current = false
WHERE a = NEW.a ;
RETURN NEW ;
END ;
$$ ;

CREATE TRIGGER before_insert BEFORE INSERT ON demo
FOR EACH ROW EXECUTE FUNCTION before_insert () ;

see test result in dbfiddle

PS : the constraint one_per will prevent from having several former rows for the same a value and with current = false

PostgreSQL Upsert differentiate inserted and updated rows using system columns XMIN, XMAX and others

I think that this is an interesting question that deserves an in-depth answer; please bear with me if it is a little bit lengthy.

In short: Your guess is right, and you can use the following RETURNING clause to determine if the row was inserted and not updated:

RETURNING (xmax = 0) AS inserted

Now the detailed explanation:

When a row is updated, PostgreSQL does not modify the data, but creates a new version of the row; the old version will be deleted by autovacuum when it is no longer needed. A version of a row is called a tuple, so in PostgreSQL there can be more than one tuples per row.

xmax serves two different purposes:

  1. As stated in the documentation, it can be the transaction ID of the transaction that deleted (or updated) the tuple (“tuple” is another word for “row”). Only transactions with a transaction ID between xmin and xmax can see the tuple. An old tuple can be deleted safely if there is no transaction with a transaction ID less than xmax.

  2. xmax is also used to store row locks. In PostgreSQL, row locks are not stored in the lock table, but in the tuple to avoid overflow of the lock table.

    If only one transaction has a lock on the row, xmax will contain the transaction ID of the locking transaction. If more than one transaction has a lock on the row, xmax contains the number of a so-called multixact, which is a data structure that in turn contains the transaction IDs of the locking transactions.

The documentation of xmax is not complete, because the exact meaning of this field is considered an implementation detail and cannot be understood without knowing t_infomask of the tuple, which is not immediately visible via SQL.

You can install the contrib module pageinspect to view this and other fields of a tuple.

I ran your example, and this is what I see when I use the heap_page_items function to examine details (the transaction ID numbers are of course different in my case):

SELECT *, ctid, xmin, xmax FROM t;

┌───┬────┬───────┬────────┬────────┐
│ i │ x │ ctid │ xmin │ xmax │
├───┼────┼───────┼────────┼────────┤
│ 1 │ 11 │ (0,2) │ 102508 │ 102508 │
│ 2 │ 22 │ (0,3) │ 102508 │ 0 │
└───┴────┴───────┴────────┴────────┘
(2 rows)

SELECT lp, lp_off, t_xmin, t_xmax, t_ctid,
to_hex(t_infomask) AS t_infomask, to_hex(t_infomask2) AS t_infomask2
FROM heap_page_items(get_raw_page('laurenz.t', 0));

┌────┬────────┬────────┬────────┬────────┬────────────┬─────────────┐
│ lp │ lp_off │ t_xmin │ t_xmax │ t_ctid │ t_infomask │ t_infomask2 │
├────┼────────┼────────┼────────┼────────┼────────────┼─────────────┤
│ 1 │ 8160 │ 102507 │ 102508 │ (0,2) │ 500 │ 4002 │
│ 2 │ 8128 │ 102508 │ 102508 │ (0,2) │ 2190 │ 8002 │
│ 3 │ 8096 │ 102508 │ 0 │ (0,3) │ 900 │ 2 │
└────┴────────┴────────┴────────┴────────┴────────────┴─────────────┘
(3 rows)

The meanings of t_infomask and t_infomask2 can be found in src/include/access/htup_details.h. lp_off is the offset of the tuple data in the page, and t_ctid is the current tuple ID which consists of the page number and a tuple number within the page. Since the table was newly created, all data are in page 0.

Let me discuss the three rows returned by heap_page_items.

  1. At line pointer (lp) 1 we find the old, updated tuple. It originally had ctid = (0,1), but that got modified to contain the tuple ID of the current version during update. The Tuple was created by transaction 102507 and invalidated by transaction 102508 (the transaction that issued the INSERT ... ON CONFLICT). This tuple is not visible any more an will get removed during VACUUM.

    t_infomask shows that both xmin and xmax belong to committed transactions and consequently show when the tuples was created and deleted. t_infomask2 shows that the tuple was updated with a HOT (heap only tuple) update, which means that the updated tuple is in the same page as the original tuple and no indexed column was modified (see src/backend/access/heap/README.HOT).

  2. At line pointer 2 we see the new, updated tuple that was created by transaction the INSERT ... ON CONFLICT (transaction 102508).

    t_infomask shows that this tuple is the result of an update, xmin is valid, and xmax contains a KEY SHARE row lock (which is no longer relevant since the transaction has completed). This row lock was taken during INSERT ... ON CONFLICT processing. t_infomask2 shows that this is a HOT tuple.

  3. At line pointer 3 we see the newly inserted row.

    t_infomask shows that xmin is valid and xmax is invalid. xmax is set to 0 because this value is always used for newly inserted tuples.

So the nonzero xmax of the updated row is an implementation artifact caused by a row lock. It is conceivable that INSERT ... ON CONFLICT is reimplemented one day so that this behaviour changes, but I think that is unlikely.

query of type INSERT ON CONFLICT DO NOTHING RETURNING returns nothing

Changing DO NOTHING to UPDATE SET statement (not modifying the final result) gives the results wanted:

xsignalsbot=# insert into users_strategies (id_strategy,id_account) 
values (1,48) on conflict (id_strategy,id_account) do update set
id_strategy=excluded.id_strategy returning users_strategies.active, users_strategies.risk;

active | risk
--------+------
t | 0.50
(1 row)

Race conditions between INSERT ON CONFLICT DO NOTHING and SELECT

To make absolutely sure that the single row in the first table is there, and it's ID returned, you could create a function like outlined here:

  • Is SELECT or INSERT in a function prone to race conditions?

To make sure the row also stays there for the duration of the transaction, just make sure it's locked. If you INSERT the row, it's locked anyway. If you SELECT an existing id, you have to lock it explicitly - just like you suggested. FOR KEY SHARE is strong enough for our purpose as long as there is a (non-partial, non-functional) UNIQUE index on (scope, name), which is safe to assume given your ON CONFLICT clause.

CREATE OR REPLACE FUNCTION f_object_id(_scope text, _name text, OUT _object_id int)
LANGUAGE plpgsql AS
$func$
BEGIN
LOOP
SELECT id FROM object
WHERE scope = $1
AND name = $2
-- lock to prevent deletion in the tiny time frame before the next INSERT
FOR KEY SHARE
INTO _object_id;

EXIT WHEN FOUND;

INSERT INTO object AS o (scope, name)
VALUES ($1, $2)
ON CONFLICT (scope, name) DO NOTHING
RETURNING o.id
INTO _object_id;

EXIT WHEN FOUND;
END LOOP;
END
$func$;

You really only need to lock the row if it's conceivable that a concurrent transaction might DELETE it (you don't UPDATE) in the tiny time frame between the SELECT and the next INSERT statement.

Also, if you have a FOREIGN KEY constraint from object_member.object_id to object.id (which seems likely), referential integrity is guaranteed anyway. If you don't add the explicit lock, and the row is deleted in between, you get a foreign key violation, and the INSERT to object_member is cancelled, along with the whole transaction. Else, the other transaction with the DELETE has to wait until your transaction is done, and is then cancelled by the same FK constraint since depending rows are now there (unless it's defined to CASCADE ...) So by locking (or not) you can decide whether to prevent the DELETE or the INSERT in this scenario.

Then your call burns down to just:

query(
`WITH o(id) AS (SELECT f_object_id($1, $2))
INSERT INTO object_member (object_id, key, value)
SELECT o.id, UNNEST($3::text[]), UNNEST($4::int[])
FROM o;`
[$scope, $name, $keys, $values]
)

Since you obviously insert multiple rows into object_member, I moved f_object_id($1, $2) to a CTE to avoid repeated execution - which would work, but pointlessly expensive.

In Postgres 12 or later I would make that explicit by adding MATERIALIZED (since the INSERT is hidden in a function):

WITH o(id) AS MATERIALIZED (SELECT f_object_id($1, $2)) ...

Aside: For the multiple unnest() in the SELECT list, make sure you are on Postgres 10 or later. See:

  • What is the expected behaviour for multiple set-returning functions in SELECT clause?

Matters of detail

Will it make any difference (apart from execution time) to do this in the application logic with multiple queries in the same transaction?

Basically no. The only difference is performance. Well, and short code and reliability. It's objectively more error prone to go back and forth between db and client for each loop. But unless you have extremely competitive transactions, you would hardly ever be looping anyway.

The other consideration is this: the matter is tricky, and most developers do not understand it. Encapsulated in a server-side function, it's less likely to be broken by the next application programmer (or yourself). You have to make sure that it's actually used, too. Either way, properly document the reasons you are doing it one way or another ...

I really wonder whether my second snippet is safe, or why not (given the quote about visibility in the SELECT after the INSERT).

Mostly safe, but not absolutely. While the next separate SELECT will see (now committed) rows of a transactions competing with the previous UPSERT, there is nothing to keep a third transaction from deleting it again in the meantime. The row has not been locked, and you have no way to do that while it's not visible, and there is no generic predicate locking available in Postgres.

Consider this (T1, T2, T3 are concurrent transactions):

                               T2: BEGIN transaction
T1: BEGIN transaction
T2: INSERT object 666
T1: UPSERT object 666
unique violation?
-> wait for T2
T2: COMMIT
T1: unique violation -> NO ACTION
finish statement
can't return invisible object 666
T3: DELETE object 666 & COMMIT
T1: SELECT object 666 -> no row!
BOOM!

Typically it's extremely unlikely that it ever happens.

But it's possible. Hence the loop.

The other option is SERIALIZABLE transaction isolation. Typically more expensive, and you need to prepare for serialization failures. Catch 22.

Use INSERT ... ON CONFLICT DO NOTHING RETURNING failed rows

A bit verbose, but I can't think of anything else:

with all_tags (name) as (
values ('tag10'), ('tag6'), ('tag11')
), inserted (id, name) as (
INSERT INTO tags (name)
select name
from all_tags
ON CONFLICT DO NOTHING
returning id, name
)
select t.id, t.name, 'already there'
from tags t
join all_tags at on at.name = t.name
union all
select id, name, 'inserted'
from inserted;

The outer select from tags sees the snapshot of the table as it was before the new tags were inserted. The third column with the constant is only there to test the query so that one can identify which rows were inserted and which not.

How can I get the INSERTED and UPDATED rows for an UPSERT operation in postgres

If you add a boolean updated column to the people table:

ALTER TABLE people ADD COLUMN updated bool DEFAULT FALSE;

then you could identify updated rows by setting updated = TRUE in the DO UPDATE SET clause:

INSERT INTO people (SELECT * FROM people_update)
ON CONFLICT (name,surname)
DO UPDATE SET age = EXCLUDED.age , street = EXCLUDED.street , city = EXCLUDED.city
, postal = EXCLUDED.postal
, updated = TRUE
WHERE
(people.age,people.street,people.city,people.postal) IS DISTINCT FROM
(EXCLUDED.age,EXCLUDED.street,EXCLUDED.city,EXCLUDED.postal)
RETURNING *;

For example,

CREATE TABLE test.people (
name text
, surname text
, age float
, street text
, city text
, postal int
);
CREATE UNIQUE INDEX people_idx on people (name, surname);
ALTER TABLE people ADD COLUMN updated bool;
ALTER TABLE people ADD COLUMN prior_age float;
ALTER TABLE people ADD COLUMN prior_street text;
ALTER TABLE people ADD COLUMN prior_city text;
ALTER TABLE people ADD COLUMN prior_postal int;

INSERT INTO people (name, surname, age, street, city, postal) VALUES
('Sancho', 'Panza', 414, '1 Manchego', 'Barcelona', 01605)
, ('Oliver', 'Twist', 182, '2 Stilton', 'London', 01837)
, ('Quasi', 'Modo', 188, $$3 Rue d'Arcole$$, 'Paris' , 01831 )
;

CREATE TABLE test.people_update (
name text
, surname text
, age float
, street text
, city text
, postal int
);

INSERT INTO people_update (name, surname, age, street, city, postal) VALUES
('Sancho', 'Panza', 4140, '10 Idiazabal', 'Montserrat', 16050)
, ('Quasi', 'Modo', 1880, $$30 Champs Elysée$$ , 'Paris', 18310 )
, ('Pinocchio', 'Geppetto', 1380, '40 Nerbone', 'Florence', 18810)
;

INSERT INTO people (SELECT * FROM people_update)
ON CONFLICT (name,surname)
DO UPDATE SET
updated = TRUE
, prior_age = (CASE WHEN people.age = EXCLUDED.age THEN NULL ELSE people.age END)
, prior_street = (CASE WHEN people.street = EXCLUDED.street THEN NULL ELSE people.street END)
, prior_city = (CASE WHEN people.city = EXCLUDED.city THEN NULL ELSE people.city END)
, prior_postal = (CASE WHEN people.postal = EXCLUDED.postal THEN NULL ELSE people.postal END)
, age = EXCLUDED.age
, street = EXCLUDED.street
, city = EXCLUDED.city
, postal = EXCLUDED.postal
WHERE
(people.age,people.street,people.city,people.postal) IS DISTINCT FROM
(EXCLUDED.age,EXCLUDED.street,EXCLUDED.city,EXCLUDED.postal)
RETURNING *;

yields

| name       | surname  |  age | street           | city       | postal | updated | prior_age | prior_street   | prior_city | prior_postal |
|------------+----------+------+------------------+------------+--------+---------+-----------+----------------+------------+--------------|
| Sancho | Panza | 4140 | 10 Idiazabal | Montserrat | 16050 | t | 414 | 1 Manchego | Barcelona | 1605 |
| Quasi | Modo | 1880 | 30 Champs Elysée | Paris | 18310 | t | 188 | 3 Rue d'Arcole | | 1831 |
| Pinocchio | Geppetto | 1380 | 40 Nerbone | Florence | 18810 | f | | | | |

The updated column shows the ('Sancho', 'Panza') and ('Quasi', 'Modo') lines have been updated, and
('Pinocchio', 'Geppetto') is a new insert.



Related Topics



Leave a reply



Submit