Postgresql 9.4 - Prevent App Selecting Always the Latest Updated Rows

postgresql 9.4 - prevent app selecting always the latest updated rows

Just an idea: instead of calling random() use it as default value for a column(which can be indexed) A similar way could use a serial with an increment of about 0.7 * INT_MAX.

\i tmp.sql

CREATE TABLE opportunities
    ( id SERIAL NOT NULL PRIMARY KEY
    , deal_id INTEGER NOT NULL DEFAULT 0
    , prize_id INTEGER
    , opportunity_available boolean NOT NULL DEFAULT False
            -- ----------------------------------------
            -- precomputed random() , (could be indexed)
    , magic DOUBLE precision NOT NULL default RANDOM()
    );

INSERT INTO opportunities(deal_id)
SELECT 341
FROM generate_series(1,20) gs
    ;
VACUUM ANALYZE opportunities;

PREPARE add_three (integer) AS (
WITH zzz AS (
  UPDATE opportunities
  SET prize_id = 21
    , opportunity_available = True
    -- updating magic is not *really* needed here ...
    , magic = random()
  WHERE opportunities.id
  IN (
    SELECT opportunities.id
    FROM opportunities
    WHERE (deal_id = $1 AND prize_id IS NULL)
    -- ORDER BY RANDOM()
    ORDER BY magic
    LIMIT 3)
RETURNING id, magic
    ) -- 
SELECT * FROM zzz
    );

PREPARE draw_one (integer) AS (
  WITH upd AS (
  UPDATE opportunities s
  SET    opportunity_available = false
  FROM  (
     SELECT id
     FROM   opportunities
     WHERE  deal_id = $1
     AND    opportunity_available
     AND    pg_try_advisory_xact_lock(id)
     ORDER BY magic
     LIMIT  1

     FOR    UPDATE
     ) sub
  WHERE     s.id = sub.id
  RETURNING s.prize_id, s.id, magic
    )
SELECT * FROM upd
    );

SELECT * FROM opportunities;

\echo add3
EXECUTE add_three(341);
SELECT * FROM opportunities;

\echo add3 more
EXECUTE add_three(341);
SELECT * FROM opportunities;

\echo draw1
EXECUTE draw_one(341);
SELECT * FROM opportunities;

\echo draw2
EXECUTE draw_one(341);
SELECT * FROM opportunities;

VACUUM ANALYZE opportunities;

\echo draw3
EXECUTE draw_one(341);
SELECT * FROM opportunities;

\echo draw4
EXECUTE draw_one(341);
SELECT * FROM opportunities;

postgresql 9.4/9.5 - Select...for update one single random row on a large dataset with high Reads and Writes

regardind IDs, there can be huge gaps between ids in the table as a
whole BUT inside the 'tickets from a specific deal' (see query below)
there is not any gap between IDs (not even the smallest), which i
presume can matter to find the most appropriate query.

This makes your life much easier. I'd use the following approach.

0) Create index on (deal_id, available, id).

1) Get MIN and MAX values of ID for the given deal_id.

SELECT MIN(id) AS MinID, MAX(id) AS MaxID
FROM   tickets
WHERE  deal_id = #{@deal.id}
AND    available

If this query results in index scan instead of seek, use two separate queries for MIN and MAX.

2) Generate a random integer number RandID in the found range [MinID; MaxID].

3) Pick a row with ID=RandID. The query should seek an index.

UPDATE tickets s
    SET available = false
    FROM (
          SELECT id
          FROM   tickets
          WHERE  deal_id = #{@deal.id}
          AND    available
          AND    id = @RandID
          AND    pg_try_advisory_xact_lock(id)
          LIMIT  1
          FOR    UPDATE
          ) sub
    WHERE         s.id = sub.id
    RETURNING     s.name, s.id

If there are concurrent processes that can add or delete rows consider increasing transaction isolation level to serializable.

Having said all this I realised that it won't work. When you say, that IDs don't have gaps you most likely mean that there are no gaps for IDs with the same deal_id (regardless of the value of the available column), but not among IDs that have the same deal_id AND available=true.

As soon as the first random row is set to available=false there will be a gap in IDs.

Second attempt

Add a float column RandomNumber to the tickets table that should hold a random number in the range (0,1). Whenever you add a row to this table generate such random number and save it in this column.

Create index on (deal_id, available, RandomNumber).

Order by this RandomNumber to pick a random row that is still available. The query should seek an index.

UPDATE tickets s
    SET available = false
    FROM (
          SELECT id
          FROM   tickets
          WHERE  deal_id = #{@deal.id}
          AND    available
          AND    pg_try_advisory_xact_lock(id)
          ORDER BY RandomNumber
          LIMIT  1
          FOR    UPDATE
          ) sub
    WHERE         s.id = sub.id
    RETURNING     s.name, s.id

Advisory locks or NOWAIT to avoid waiting for locked rows?

~~FOR UPDATE NOWAIT~~ is only a good idea if you insist on locking a particular row, which is not what you need. You just want any qualifying, available (unlocked) row. The important difference is this (quoting the manual for Postgres 9.4):

With NOWAIT, the statement reports an error, rather than waiting, if a selected row cannot be locked immediately.

Identical queries will very likely try to lock the same arbitrary pick. FOR UPDATE NOWAIT will just bail out with an exception (which will roll back the whole transaction unless you trap the error) and you have to retry.

The solution in my referenced answer on dba.SE uses a combination of plain FOR UPDATE in combination with pg_try_advisory_lock():

pg_try_advisory_lock is similar to pg_advisory_lock, except the
function will not wait for the lock to become available. It will
either obtain the lock immediately and return true, or return false if
the lock cannot be acquired immediately.

So your best option is ... the third alternative: the new FOR UPDATE SKIP LOCKED in Postgres 9.5, which implements the same behavior without additional function call.

The manual for Postgres 9.5 compares the two options, explaining the difference some more:

To prevent the operation from waiting for other transactions to
commit, use either the NOWAIT or SKIP LOCKED option. With NOWAIT, the
statement reports an error, rather than waiting, if a selected row
cannot be locked immediately. With SKIP LOCKED, any selected rows that
cannot be immediately locked are skipped.

On Postgres 9.4 or older your next best option is to use pg_try_advisory_xact_lock(id) in combination with FOR UPDATE like demonstrated in the referenced answer:

Postgres UPDATE … LIMIT 1

(Also with an implementation with FOR UPDATE SKIP LOCKED.)

Aside

Strictly speaking you get arbitrary, not truly random picks. That can be an important distinction.

An audited version of your query is in my answer to your other question.

Put pg_try_advisory_xact_lock() in a nested subquery?

I updated my referenced answer with more explanation and links.

In Postgres 9.5 (currently beta) the new SKIP LOCKED is a superior solution:

Postgres UPDATE … LIMIT 1

Let me simplify a few things in your query first:

Straight query

UPDATE opportunities s
SET    opportunity_available = false
FROM  (
   SELECT id
   FROM   opportunities
   WHERE  deal_id = #{@deal.id}
   AND    opportunity_available
   AND    pg_try_advisory_xact_lock(id)
   LIMIT  1
   FOR    UPDATE
   ) sub
WHERE     s.id = sub.id
RETURNING s.prize_id, s.id;

All the double quotes were just noise with your legal, lower-case names.
Since opportunity_available is a boolean column you can simplify opportunity_available = true to just opportunity_available
You don't need to return * from the subquery, just id is enough.

Typically, this works as is. Explanation below.

Avoid advisory lock on unrelated rows

To be sure, you could encapsulate all predicates in a CTE or a subquery with the OFFSET 0 hack (less overhead) before you apply pg_try_advisory_xact_lock() in the next query level:

UPDATE opportunities s
SET    opportunity_available = false
FROM (
   SELECT id
   FROM  ( 
      SELECT id
      FROM   opportunities
      WHERE  deal_id = #{@deal.id}
      AND    opportunity_available
      AND    pg_try_advisory_xact_lock(id)
      OFFSET 0
      ) sub1
   WHERE  pg_try_advisory_xact_lock(id)
   LIMIT  1
   FOR    UPDATE
   ) sub2
WHERE     s.id = sub.id
RETURNING s.prize_id, s.id;

However, this is typically much more expensive.

You probably don't need this

There aren't going to be any "collateral" advisory locks if you base your query on an index covering all predicates, like this partial index:

CREATE INDEX opportunities_deal_id ON opportunities (deal_id)
WHERE opportunity_available;

Check with EXPLAIN to verify Postgres actually uses the index. This way, pg_try_advisory_xact_lock(id) will be a filter condition to the index or bitmap index scan and only qualifying rows are going to be tested (and locked) to begin with, so you can use the simple form without additional nesting. At the same time, your query performance is optimized. I would do that.

Even if a couple of unrelated rows should get an advisory lock once in a while, that typically just doesn't matter. Advisory locks are only relevant to queries that actually use advisory locks. Or do you really have other concurrent transactions that also use advisory locks and target other rows of the same table? Really?

The only other problematic case would be if massive amounts of unrelated rows get advisory locks, which can only happen with a sequential scan and is very unlikely even then.

Return rows of a table that actually changed in an UPDATE

Only update rows that actually change

That saves expensive updates and expensive checks after the UPDATE.

To update every column with the new value provided (if anything changes):

UPDATE accounts a
SET   (status,   field1,   field2)  -- short syntax for  ..
  = (m.status, m.field1, m.field2)  -- .. updating multiple columns
FROM   merge_accounts m
WHERE  m.uid = a.uid
AND   (a.status IS DISTINCT FROM m.status OR
       a.field1 IS DISTINCT FROM m.field1 OR 
       a.field2 IS DISTINCT FROM m.field2)
RETURNING a.*;

Due to PostgreSQL's MVCC model any change to a row writes a new row version. Updating a single column is almost as expensive as updating every column in the row at once. Rewriting the rest of the row comes at practically no cost, as soon as you have to update anything.

Details:

How do I (or can I) SELECT DISTINCT on multiple columns?
UPDATE a whole row in PL/pgSQL

Shorthand for whole rows

If the row types of accounts and merge_accounts are identical and you want to adopt everything from merge_accounts into accounts, there is a shortcut comparing the whole row type:

UPDATE accounts a
SET   (status,   field1,   field2)
  = (m.status, m.field1, m.field2)
FROM   merge_accounts m
WHERE  a.uid = m.uid
AND    m IS DISTINCT FROM a
RETURNING a.*;

This even works for NULL values. Details in the manual.

But it's not going to work for your home-grown solution where (quoting your comment):

merge_accounts is identical, save that all non-pk columns are array types

It requires compatible row types, i.e. each column shares the same data type or there is at least an implicit cast between the two types.

For your special case

UPDATE accounts a
SET   (status, field1, field2)
    = (COALESCE(m.status[1], a.status)  -- default to original ..
     , COALESCE(m.field1[1], a.field1)   -- .. if m.column[1] IS NULL
     , COALESCE(m.field2[1], a.field2))
FROM   merge_accounts m
WHERE  m.uid = a.uid
AND   (m.status[1] IS NOT NULL AND a.status IS DISTINCT FROM m.status[1]
    OR m.field1[1] IS NOT NULL AND a.field1 IS DISTINCT FROM m.field1[1]
    OR m.field2[1] IS NOT NULL AND a.field2 IS DISTINCT FROM m.field2[1])
RETURNING a.*

m.status IS NOT NULL works if columns that shouldn't be updated are NULL in merge_accounts.

m.status <> '{}' if you operate with empty arrays.

m.status[1] IS NOT NULL covers both options.

Return pre-UPDATE column values using SQL only

INSERT or SELECT strategy to always return a row?

Your observation seems impossible. The above command should always return an id, either for the newly inserted row or for the pre-existing row. Concurrent writes cannot mess with this since existing conflicting rows are locked. Explanation in this related answer:

How to use RETURNING with ON CONFLICT in PostgreSQL?

Unless an exception is raised, of course. You get an error message instead of a result in that case. Did you check that? Do you have error-handling in place? (In case your app somehow discards error messages: 1) Fix that. 2) There is an additional entry in the DB log with default logging settings.)

I do see a FK constraint in your table definition:

prop_type text not null references prop_type(name),

If you try to insert a row that violates the constraint, that's exactly what happens. If there is no row with name = 'jargon' in table prop_type, that's what you get:

ERROR:  insert or update on table "prop" violates foreign key constraint "prop_prop_type_fkey"
DETAIL:  Key (prop_type)=(jargon) is not present in table "prop_type".

Demo:

dbfiddle here

Your observation would fit the crime:

If I change prop_type = 'jargon' to prop_type = 'foo' it works!

But your explanation is based on misconceptions:

It would seem the lock isn't taken if the expression wouldn't change anything even given the where false clause.

That's not how Postgres works. The lock is taken either way (explanation in above linked answer), and the Postgres locking mechanism never even considers how the new row compares to the old.

Does this really need to depend on my guessing a value that wouldn't be in the row, though? Or is there a better way to ensure you get the lock?

No. And no.

If missing FK values are indeed the problem, you might add missing (distinct) values in a single statement with rCTEs. Simple for single-row inserts like you demonstrate, but works for inserting many rows at once, too. Related:

How do I insert a row which contains a foreign key?
INSERT rows into multiple tables in a single query, selecting from an involved table
Can INSERT [...] ON CONFLICT be used for foreign key violations?

How to perform update operations on columns of type JSONB in Postgres 9.4

Ideally, you don't use JSON documents for structured, regular data that you want to manipulate inside a relational database. Use a normalized relational design instead.

JSON is primarily intended to store whole documents that do not need to be manipulated inside the RDBMS. Related:

JSONB with indexing vs. hstore

Updating a row in Postgres always writes a new version of the whole row. That's the basic principle of Postgres' MVCC model. From a performance perspective, it hardly matters whether you change a single piece of data inside a JSON object or all of it: a new version of the row has to be written.

Thus the advice in the manual:

JSON data is subject to the same concurrency-control considerations as
any other data type when stored in a table. Although storing large
documents is practicable, keep in mind that any update acquires a
row-level lock on the whole row. Consider limiting JSON documents to a
manageable size in order to decrease lock contention among updating
transactions. Ideally, JSON documents should each represent an atomic
datum that business rules dictate cannot reasonably be further
subdivided into smaller datums that could be modified independently.

The gist of it: to modify anything inside a JSON object, you have to assign a modified object to the column. Postgres supplies limited means to build and manipulate json data in addition to its storage capabilities. The arsenal of tools has grown substantially with every new release since version 9.2. But the principal remains: You always have to assign a complete modified object to the column and Postgres always writes a new row version for any update.

Some techniques how to work with the tools of Postgres 9.3 or later:

How do I modify fields inside the new PostgreSQL JSON datatype?

This answer has attracted about as many downvotes as all my other answers on SO together. People don't seem to like the idea: a normalized design is superior for regular data. This excellent blog post by Craig Ringer explains in more detail:

"PostgreSQL anti-patterns: Unnecessary json/hstore dynamic columns"

Another blog post by Laurenz Albe, another official Postgres contributor like Craig and myself:

JSON in PostgreSQL: how to use it right

Insert, on duplicate update in PostgreSQL?

PostgreSQL since version 9.5 has UPSERT syntax, with ON CONFLICT clause. with the following syntax (similar to MySQL)

INSERT INTO the_table (id, column_1, column_2) 
VALUES (1, 'A', 'X'), (2, 'B', 'Y'), (3, 'C', 'Z')
ON CONFLICT (id) DO UPDATE 
  SET column_1 = excluded.column_1, 
      column_2 = excluded.column_2;

Searching postgresql's email group archives for "upsert" leads to finding an example of doing what you possibly want to do, in the manual:

Example 38-2. Exceptions with UPDATE/INSERT

This example uses exception handling to perform either UPDATE or INSERT, as appropriate:

CREATE TABLE db (a INT PRIMARY KEY, b TEXT);

CREATE FUNCTION merge_db(key INT, data TEXT) RETURNS VOID AS
$$
BEGIN
    LOOP
        -- first try to update the key
        -- note that "a" must be unique
        UPDATE db SET b = data WHERE a = key;
        IF found THEN
            RETURN;
        END IF;
        -- not there, so try to insert the key
        -- if someone else inserts the same key concurrently,
        -- we could get a unique-key failure
        BEGIN
            INSERT INTO db(a,b) VALUES (key, data);
            RETURN;
        EXCEPTION WHEN unique_violation THEN
            -- do nothing, and loop to try the UPDATE again
        END;
    END LOOP;
END;
$$
LANGUAGE plpgsql;

SELECT merge_db(1, 'david');
SELECT merge_db(1, 'dennis');

There's possibly an example of how to do this in bulk, using CTEs in 9.1 and above, in the hackers mailing list:

WITH foos AS (SELECT (UNNEST(%foo[])).*)
updated as (UPDATE foo SET foo.a = foos.a ... RETURNING foo.id)
INSERT INTO foo SELECT foos.* FROM foos LEFT JOIN updated USING(id)
WHERE updated.id IS NULL;

See a_horse_with_no_name's answer for a clearer example.

Postgresql 9.4 - Prevent App Selecting Always the Latest Updated Rows