How to Do Forward Fill as a Pl/Pgsql Function

How to do forward fill as a PL/PGSQL function

Correct call

Seems like your displayed query is incorrect, and the test case is just too reduced to show it.

Assuming you want to "forward fill" partitioned by id, you'll have to say so:

SELECT row_num, id
, str, gap_fill(str) OVER w AS strx
, val, gap_fill(val) OVER w AS valx
FROM example
WINDOW w AS (PARTITION BY id ORDER BY row_num); -- !

The WINDOW clause is just a syntactical convenience to avoid spelling out the same window frame repeatedly. The important part is the added PARTITION clause.

Simpler function

Much simpler, actually:

CREATE OR REPLACE FUNCTION gap_fill_internal(s anyelement, v anyelement)
RETURNS anyelement
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN COALESCE(v, s); -- that's all!
END
$func$;

CREATE AGGREGATE gap_fill(anyelement) (
SFUNC = gap_fill_internal,
STYPE = anyelement
);

Slightly faster in a quick test.

Standard SQL

Without custom function:

SELECT row_num, id
, str, first_value(str) OVER (PARTITION BY id, ct_str ORDER BY row_num) AS strx
, val, first_value(val) OVER (PARTITION BY id, ct_val ORDER BY row_num) AS valx
FROM (
SELECT *, count(str) OVER w AS ct_str, count(val) OVER w AS ct_val
FROM example
WINDOW w AS (PARTITION BY id ORDER BY row_num)
) sub;

The query becomes more complex with a subquery. Performance is similar. Slightly slower in a quick test.

More explanation in these related answers:

  • Carry over long sequence of missing values with Postgres
  • Retrieve last known value for each column of a row
  • SQL group table by "leading rows" without pl/sql

db<>fiddle here - showing all with extended test case

Apply function to all columns in a Postgres table dynamically

What you ask is not a trivial task. You should be comfortable with PL/pgSQL. I do not advise this kind of dynamic SQL queries for beginners, too powerful.

That said, let's dive in. Buckle up!

CREATE OR REPLACE FUNCTION f_gap_fill_update(_tbl regclass, _id text, _row_num text, OUT nullable_columns int, OUT updated_rows int)
LANGUAGE plpgsql AS
$func$
DECLARE
_pk text := quote_ident(_row_num);
_sql text;
BEGIN
SELECT INTO _sql, nullable_columns
concat_ws(E'\n'
, 'UPDATE ' || _tbl || ' t'
, 'SET (' || string_agg( quote_ident(a.attname), ', ') || ')'
, ' = (' || string_agg('u.' || quote_ident(a.attname), ', ') || ')'
, 'FROM ('
, ' SELECT ' || _pk
, ' , ' || string_agg(format('gap_fill(%1$I) OVER w AS %1$I', a.attname), ', ')
, ' FROM ' || _tbl
, format(' WINDOW w AS (PARTITION BY %I ORDER BY %s)', _id, _pk)
, ' ) u'
, format('WHERE t.%1$s = u.%1$s', _pk)
, 'AND (' || string_agg('t.' || quote_ident(a.attname), ', ') || ') IS DISTINCT FROM'
, ' (' || string_agg('u.' || quote_ident(a.attname), ', ') || ')'
)
, count(*) -- AS _col_ct
FROM (
SELECT a.attname
FROM pg_attribute a
WHERE a.attrelid = _tbl
AND a.attnum > 0
AND NOT a.attisdropped
AND NOT a.attnotnull
ORDER BY a.attnum
) a;

IF nullable_columns = 0 THEN
RAISE EXCEPTION 'No nullable columns found in table >>%<<', _tbl;
ELSIF _sql IS NULL THEN
RAISE EXCEPTION 'SQL string is NULL. Should not occur!';
END IF;

-- RAISE NOTICE '%', _sql; -- debug
EXECUTE _sql; -- execute
GET DIAGNOSTICS updated_rows = ROW_COUNT;
END
$func$;

Example call:

SELECT * FROM f_gap_fill_update('example', 'id', 'row_num');

db<>fiddle here

The function is state of the art.
Generates and executes a query of the form:

UPDATE tbl t
SET (str, val, col1)
= (u.str, u.val, u.col1)
FROM (
SELECT row_num
, gap_fill(str) OVER w AS str, gap_fill(val) OVER w AS val
, gap_fill(col1) OVER w AS col1
FROM tbl
WINDOW w AS (PARTITION BY id ORDER BY row_num)
) u
WHERE t.row_num = u.row_num
AND (t.str, t.val, t.col1) IS DISTINCT FROM
(u.str, u.val, u.col1)

Using pg_catalog.pg_attribute instead of the information schema. See:

  • "Information schema vs. system catalogs"

Note the final WHERE clause to prevent (possibly expensive) empty updates. Only rows that actually change will be written. See:

  • How do I (or can I) SELECT DISTINCT on multiple columns?

Moreover, only nullable columns (not defined NOT NULL) will even be considered, to avoid unnecessary work.

Using ROW syntax in UPDATE to keep the code simple. See:

  • SQL update fields of one table from fields of another one

The function returns two integer values: nullable_columns and updated_rows, reporting what the names suggest.

The function defends against SQL injection properly. See:

  • Table name as a PostgreSQL function parameter
  • SQL injection in Postgres functions vs prepared queries

About GET DIAGNOSTICS:

  • Calculate number of rows affected by batch query in PostgreSQL

The above function updates, but does not return rows. Here is a basic demo how to return rows of varying type:

CREATE OR REPLACE FUNCTION f_gap_fill_select(_tbl_type anyelement, _id text, _row_num text)
RETURNS SETOF anyelement
LANGUAGE plpgsql AS
$func$
DECLARE
_tbl regclass := pg_typeof(_tbl_type)::text::regclass;
_sql text;
BEGIN
SELECT INTO _sql
'SELECT ' || string_agg(CASE WHEN a.attnotnull
THEN format('%I', a.attname)
ELSE format('gap_fill(%1$I) OVER w AS %1$I', a.attname) END
, ', ' ORDER BY a.attnum)
|| E'\nFROM ' || _tbl
|| format(E'\nWINDOW w AS (PARTITION BY %I ORDER BY %I)', _id, _row_num)
FROM pg_attribute a
WHERE a.attrelid = _tbl
AND a.attnum > 0
AND NOT a.attisdropped;

IF _sql IS NULL THEN
RAISE EXCEPTION 'SQL string is NULL. Should not occur!';
END IF;

RETURN QUERY EXECUTE _sql;
-- RAISE NOTICE '%', _sql; -- debug
END
$func$;

Call (note special syntax!):

SELECT * FROM f_gap_fill_select(NULL::example, 'id', 'row_num');

db<>fiddle here

About returning a polymorphic row type:

  • Refactor a PL/pgSQL function to return the output of various SELECT queries

PL/pgSQL function won't run correctly outside pgAdmin

It seems like you can replace the whole function with this plain, much cheaper UPDATE query and a RETURNING clause:

UPDATE table1
SET val = val + 1
WHERE status IS NULL
RETURNING id, name AS t1, data AS t2;

If there can be race conditions, consider:

  • Postgres UPDATE … LIMIT 1

You can run this query as is, without function wrapper. If you want a function wrapper, prepend RETURN QUERY to return results directly. Code example:

  • PostgreSQL: ERROR: 42601: a column definition list is required for functions returning "record"

SQL group table by leading rows without pl/sql

This can be achieved by setting rows containing a to a specific value and all the other rows to a different value. Then use a cumulative sum to get the desired number for the rows. The group number is set to the next number when a new value in the val column is encountered and all the proceeding rows with a will have the same group number as the one before and this continues.

I assume that you would need a distinct number for each group and the number doesn't matter.

select id, val, sum(ex) over(order by id) cm_sum
from (select t.*
,case when val = 'a' then 0 else 1 end ex
from t) x

The result for the query above with the data in question, would be

id  val cm_sum
--------------
1 a 0
2 a 0
3 a3 1
4 a 1
5 a 1
6 a6 2
7 a 2
8 a8 3
9 a 3

Impute via fill-forward/LOCF a column over a range of sequential rows in SQL?

The following query structure will achieve fill-forward if using a PostgreSQL flavoured SQL dialect (e.g. Netezza PureData) for a datetime index (assuming past data). It will also work for multi-column index/keys.

Given the following parameters:

  • <key_cols> - list of columns uniquely identifying each time-series sample (e.g. UNIT, TIME )
  • <impute_col> - column in which values need to be imputed (e.g. VALUE )
  • <impute_over_range_col> - the sequential range column for the time-series (e.g. TIME)

and deriving:

  • <keys_no_range> - key columns except for <impute_over_range_col>

SELECT DISTINCT T1.<key_cols>, 
COALESCE(T1.<impute_col>, T2.<impute_col>) AS <impute_col>
FROM table T1
LEFT OUTER JOIN (SELECT T1.<key_cols>,
T1.<impute_col>,
LEAD(T1.<impute_over_range_col>,1)
OVER (PARTITION BY T1.<keys_no_range>
ORDER BY T1.<key_cols>)
AS NEXT_RANGE
FROM table T1
WHERE T1.<impute_col> IS NOT NULL
ORDER BY T1.<key_cols>
) T2
ON (T1.<impute_over_range_col> BETWEEN T2.<impute_over_range_col>
AND COALESCE(NEXT_RANGE, CURRENT_DATE))
AND T1.<keys_no_range>[0] = T2.<keys_no_range>[0]
AND T1.<keys_no_range>[1] = T2.<keys_no_range>[1]
-- ... for each col in <keys_no_range>

Concretely, for the example in the question:

SELECT DISTINCT T1.UNIT, T1.TIME, 
COALESCE(T1.VALUE, T2.VALUE) AS VALUE
FROM table T1
LEFT OUTER JOIN (SELECT T1.UNIT, T1.TIME,
T1.VALUE,
LEAD(T1.TIME,1)
OVER (PARTITION BY T1.UNIT
ORDER BY T1.UNIT, T1.TIME)
AS NEXT_RANGE
FROM table T1
WHERE T1.VALUE IS NOT NULL
ORDER BY T1.UNIT, T1.TIME
) T2
ON (T1.TIME BETWEEN T2.TIME
AND COALESCE(NEXT_RANGE, CURRENT_DATE))
AND T1.UNIT = T2.UNIT

Here is an SQLFiddle of the above query: http://sqlfiddle.com/#!15/d589b/1

Retrieve last known value for each column of a row

This should work but keep in mind it is an uggly solution

select * from
(select dt from
(select rank() over (order by ctid desc) idx, dt
from sometable ) cx
where idx = 1) dtz,
(
select a from
(select rank() over (order by ctid desc) idx, a
from sometable where a is not null ) ax
where idx = 1) az,
(
select b from
(select rank() over (order by ctid desc) idx, b
from sometable where b is not null ) bx
where idx = 1) bz,
(
select c from
(select rank() over (order by ctid desc) idx, c
from sometable where c is not null ) cx
where idx = 1) cz

See it here at fiddle: http://sqlfiddle.com/#!15/d5940/40

The result will be

DT                                   A        B      C
October, 16 2013 00:00:00+0000 abc died fred


Related Topics



Leave a reply



Submit