Merge a Table and a Change Log into a View in Postgresql

Merge a table and a change log into a view in PostgreSQL

Assuming Postgres 9.1 or later.

I simplified / optimized your basic query to retrieve the latest values:

SELECT DISTINCT ON (1,2)
c.unique_id, a.attname AS col, c.value
FROM pg_attribute a
LEFT JOIN changes c ON c.column_name = a.attname
AND c.table_name = 'instances'
-- AND c.unique_id = 3 -- uncomment to fetch single row
WHERE a.attrelid = 'instances'::regclass -- schema-qualify to be clear?
AND a.attnum > 0 -- no system columns
AND NOT a.attisdropped -- no deleted columns
ORDER BY 1, 2, c.updated_at DESC;

I query the PostgreSQL catalog instead of the standard information schema because that is faster. Note the special cast to ::regclass.

Now, that gives you a table. You want all values for one unique_id in a row.

To achieve that you have basically three options:

  1. One subselect (or join) per column. Expensive and unwieldy. But a valid option for only a few columns.

  2. A big CASE statement.

  3. A pivot function. PostgreSQL provides the crosstab() function in the additional module tablefunc for that.

    Basic instructions:

    • PostgreSQL Crosstab Query

Basic pivot table with crosstab()

I completely rewrote the function:

SELECT *
FROM crosstab(
$x$
SELECT DISTINCT ON (1, 2)
unique_id, column_name, value
FROM changes
WHERE table_name = 'instances'
-- AND unique_id = 3 -- un-comment to fetch single row
ORDER BY 1, 2, updated_at DESC;
$x$,

$y$
SELECT attname
FROM pg_catalog.pg_attribute
WHERE attrelid = 'instances'::regclass -- possibly schema-qualify table name
AND attnum > 0
AND NOT attisdropped
AND attname <> 'unique_id'
ORDER BY attnum
$y$
)
AS tbl (
unique_id integer
-- !!! You have to list all columns in order here !!! --
);

I separated the catalog lookup from the value query, as the crosstab() function with two parameters provides column names separately. Missing values (no entry in changes) are substituted with NULL automatically. A perfect match for this use case!

Assuming that attname matches column_name. Excluding unique_id, which plays a special role.

Full automation

Addressing your comment: There is a way to supply the column definition list automatically. It's not for the faint of heart, though.

I use a number of advanced Postgres features here: crosstab(), plpgsql function with dynamic SQL, composite type handling, advanced dollar quoting, catalog lookup, aggregate function, window function, object identifier type, ...

Test environment:

CREATE TABLE instances (
unique_id int
, col1 text
, col2 text -- two columns are enough for the demo
);

INSERT INTO instances VALUES
(1, 'foo1', 'bar1')
, (2, 'foo2', 'bar2')
, (3, 'foo3', 'bar3')
, (4, 'foo4', 'bar4');

CREATE TABLE changes (
unique_id int
, table_name text
, column_name text
, value text
, updated_at timestamp
);

INSERT INTO changes VALUES
(1, 'instances', 'col1', 'foo11', '2012-04-12 00:01')
, (1, 'instances', 'col1', 'foo12', '2012-04-12 00:02')
, (1, 'instances', 'col1', 'foo1x', '2012-04-12 00:03')
, (1, 'instances', 'col2', 'bar11', '2012-04-12 00:11')
, (1, 'instances', 'col2', 'bar17', '2012-04-12 00:12')
, (1, 'instances', 'col2', 'bar1x', '2012-04-12 00:13')

, (2, 'instances', 'col1', 'foo2x', '2012-04-12 00:01')
, (2, 'instances', 'col2', 'bar2x', '2012-04-12 00:13')

-- NO change for col1 of row 3 - to test NULLs
, (3, 'instances', 'col2', 'bar3x', '2012-04-12 00:13');

-- NO changes at all for row 4 - to test NULLs

Automated function for one table

CREATE OR REPLACE FUNCTION f_curr_instance(int, OUT t public.instances) AS
$func$
BEGIN
EXECUTE $f$
SELECT *
FROM crosstab($x$
SELECT DISTINCT ON (1,2)
unique_id, column_name, value
FROM changes
WHERE table_name = 'instances'
AND unique_id = $f$ || $1 || $f$
ORDER BY 1, 2, updated_at DESC;
$x$
, $y$
SELECT attname
FROM pg_catalog.pg_attribute
WHERE attrelid = 'public.instances'::regclass
AND attnum > 0
AND NOT attisdropped
AND attname <> 'unique_id'
ORDER BY attnum
$y$) AS tbl ($f$
|| (SELECT string_agg(attname || ' ' || atttypid::regtype::text
, ', ' ORDER BY attnum) -- must be in order
FROM pg_catalog.pg_attribute
WHERE attrelid = 'public.instances'::regclass
AND attnum > 0
AND NOT attisdropped)
|| ')'
INTO t;
END
$func$ LANGUAGE plpgsql;

The table instances is hard-coded, schema qualified to be unambiguous. Note the use of the table type as return type. There is a row type registered automatically for every table in PostgreSQL. This is bound to match the return type of the crosstab() function.

This binds the function to the type of the table:

  • You will get an error message if you try to DROP the table
  • Your function will fail after an ALTER TABLE. You have to recreate it (without changes). I consider this a bug in 9.1. ALTER TABLE shouldn't silently break the function, but raise an error.

This performs very well.

Call:

SELECT * FROM f_curr_instance(3);

unique_id | col1 | col2
----------+-------+-----
3 |<NULL> | bar3x

Note how col1 is NULL here.

Use in a query to display an instance with its latest values:

SELECT i.unique_id
, COALESCE(c.col1, i.col1)
, COALESCE(c.col2, i.col2)
FROM instances i
LEFT JOIN f_curr_instance(3) c USING (unique_id)
WHERE i.unique_id = 3;

Full automation for any table

(Added 2016. This is dynamite.)

Requires Postgres 9.1 or later. (Could be made out to work with pg 8.4, but I didn't bother to backpatch.)

CREATE OR REPLACE FUNCTION f_curr_instance(_id int, INOUT _t ANYELEMENT) AS
$func$
DECLARE
_type text := pg_typeof(_t);
BEGIN
EXECUTE
(
SELECT format
($f$
SELECT *
FROM crosstab(
$x$
SELECT DISTINCT ON (1,2)
unique_id, column_name, value
FROM changes
WHERE table_name = %1$L
AND unique_id = %2$s
ORDER BY 1, 2, updated_at DESC;
$x$
, $y$
SELECT attname
FROM pg_catalog.pg_attribute
WHERE attrelid = %1$L::regclass
AND attnum > 0
AND NOT attisdropped
AND attname <> 'unique_id'
ORDER BY attnum
$y$) AS ct (%3$s)
$f$
, _type, _id
, string_agg(attname || ' ' || atttypid::regtype::text
, ', ' ORDER BY attnum) -- must be in order
)
FROM pg_catalog.pg_attribute
WHERE attrelid = _type::regclass
AND attnum > 0
AND NOT attisdropped
)
INTO _t;
END
$func$ LANGUAGE plpgsql;

Call (providing the table type with NULL::public.instances:

SELECT * FROM f_curr_instance(3, NULL::public.instances);

Related:

  • Refactor a PL/pgSQL function to return the output of various SELECT queries
  • How to set value of composite variable field using dynamic SQL

How to merge multiple tables in postgreSQL (using wildcard)

As others have mentioned, there is no mechanism to prepend columns without dropping/recreating tables or individual columns.

To append columns to every table that starts with "m", you can use an anonymous plpgsql code block:

DO
$$
DECLARE tbl_ident regclass;
BEGIN
FOR tbl_ident IN
SELECT
oid::regclass fqn --this will give you schema qualified, properly quoted table name
FROM
pg_class
WHERE
'my_schema'::regnamespace::oid = relnamespace --pick schema(s) you care about
and relkind = 'r' --ordinary tables only. Does not include views (v),materialized views (m),partitioned tables (p) or foreign tables (f)
and relname ilike 'm%'
LOOP
-- append the columns you want to the tables
EXECUTE 'ALTER TABLE ' || tbl_ident || ' ADD COLUMN col1 integer, ADD COLUMN col2 varchar(1337)';
END LOOP;
END$$

PostgreSQL Table merge

To "merge" this tables you need:

1) Add "machinetype" and "location" columns to the "Network Status" table:

ALTER TABLE "Network Status"
ADD COLUMN "machinetype" TEXT;
ALTER TABLE "Network Status"
ADD COLUMN "location" TEXT;

2) Populate them with data

UPDATE "Network Status" NS
SET "machinetype" = FL."machinetype",
"location" = FL."location"
FROM "Firewall Log" FL
WHERE FL."ipaddress" = NS."ipaddress"

Or you can create a VIEW and use it to SELECT data as if it is a table:

CREATE VIEW "Status and log" AS
SELECT NS."ipaddress", NS."time", NS."violation", FL."machinetype", FL."location"
FROM "Network Status" NS
JOIN "Firewall Log" FL ON FL."ipaddress" = NS."ipaddress"

Combining 2 views into one in PostgreSQL

"from my understanding UNION and UNION ALL only work when combining 2 SELECT Queries"

Ummmm...no. You can do something like the following:

select col1,col2
from table
union all
select col1,col2
from some_other_table
union all
select col1,col2
from yet_another_table;

etc, etc.

updating table rows in postgres using subquery

Postgres allows:

UPDATE dummy
SET customer=subquery.customer,
address=subquery.address,
partn=subquery.partn
FROM (SELECT address_id, customer, address, partn
FROM /* big hairy SQL */ ...) AS subquery
WHERE dummy.address_id=subquery.address_id;

This syntax is not standard SQL, but it is much more convenient for this type of query than standard SQL. I believe Oracle (at least) accepts something similar.

How to update a table that is built from an inner join of two other tables

According to Postgres document, You can make query like below:

[ WITH [ RECURSIVE ] with_query [, ...] ]
UPDATE [ ONLY ] table_name [ * ] [ [ AS ] alias ]
SET { column_name = { expression | DEFAULT } |
( column_name [, ...] ) = [ ROW ] ( { expression | DEFAULT } [, ...] ) |
( column_name [, ...] ) = ( sub-SELECT )
} [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition | WHERE CURRENT OF cursor_name ]
[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]

Sample query:

UPDATE 
customer u_c
SET
customer_id = c.customer_id,
first_name = c.first_name,
last_name = c.last_name,
payment_id = p.payment_id,
amount = p.amount
FROM
payment p
INNER JOIN customer c ON p.customer_id = c.customer_id
WHERE
c.id = u_c.id

How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?

9.5 and newer:

PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.

Comparison with ON DUPLICATE KEY UPDATE.

Quick explanation.

For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.

Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.

The commit adding the feature is here and the discussion around its development is here.


If you're on 9.5 and don't need to be backward-compatible you can stop reading now.



9.4 and older:

PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.

This article discusses the problem in useful detail.

In general you must choose between two options:

  • Individual insert/update operations in a retry loop; or
  • Locking the table and doing batch merge

Individual row retry loop

Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.

The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.

This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.

Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.

Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:

-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.

BEGIN;

UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;

-- Remember, this is WRONG. Do NOT COPY IT.

INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);

COMMIT;

then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.

This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.

Please don't roll your own solution for this. Like with message queuing, it's probably wrong.

Bulk upsert with lock

Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.

In this case, you typically follow the following process:

  • CREATE a TEMPORARY table

  • COPY or bulk-insert the new data into the temp table

  • LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.

  • Do an UPDATE ... FROM of existing records using the values in the temp table;

  • Do an INSERT of rows that don't already exist in the target table;

  • COMMIT, releasing the lock.

For example, for the example given in the question, using multi-valued INSERT to populate the temp table:

BEGIN;

CREATE TEMPORARY TABLE newvals(id integer, somedata text);

INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');

LOCK TABLE testtable IN EXCLUSIVE MODE;

UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;

INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;

COMMIT;

Related reading

  • UPSERT wiki page
  • UPSERTisms in Postgres
  • Insert, on duplicate update in PostgreSQL?
  • http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
  • Upsert with a transaction
  • Is SELECT or INSERT in a function prone to race conditions?
  • SQL MERGE on the PostgreSQL wiki
  • Most idiomatic way to implement UPSERT in Postgresql nowadays

What about MERGE?

SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.

It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.

Other DBs:

  • INSERT ... ON DUPLICATE KEY UPDATE in MySQL
  • MERGE from MS SQL Server (but see above about MERGE problems)
  • MERGE from Oracle (but see above about MERGE problems)

Concatenate multiple result rows of one column into one, group by another column

Simpler with the aggregate function string_agg() (Postgres 9.0 or later):

SELECT movie, string_agg(actor, ', ') AS actor_list
FROM tbl
GROUP BY 1;

The 1 in GROUP BY 1 is a positional reference and a shortcut for GROUP BY movie in this case.

string_agg() expects data type text as input. Other types need to be cast explicitly (actor::text) - unless an implicit cast to text is defined - which is the case for all other string types (varchar, character, name, ...) and some other types.

As isapir commented, you can add an ORDER BY clause in the aggregate call to get a sorted list - should you need that. Like:

SELECT movie, string_agg(actor, ', ' ORDER BY actor) AS actor_list
FROM tbl
GROUP BY 1;

But it's typically faster to sort rows in a subquery. See:

  • Create array in SELECT

Lock level on MERGE statement

From the manual:

ROW EXCLUSIVE Conflicts with the SHARE, SHARE ROW EXCLUSIVE,
EXCLUSIVE, and ACCESS EXCLUSIVE lock modes.

The commands UPDATE, DELETE, INSERT, and MERGE acquire this lock mode
on the target table (in addition to ACCESS SHARE locks on any other
referenced tables). In general, this lock mode will be acquired by any
command that modifies data in a table.



Related Topics



Leave a reply



Submit