What Is the Expected Behaviour For Multiple Set-Returning Functions in Select Clause

What is the expected behaviour for multiple set-returning functions in SELECT clause?

Postgres 10 or newer

adds null values for smaller set(s). Demo with generate_series():

SELECT generate_series( 1,  2) AS row2
, generate_series(11, 13) AS row3
, generate_series(21, 24) AS row4;

row2 | row3 | row4
-----+------+-----
1 | 11 | 21
2 | 12 | 22
null | 13 | 23
null | null | 24

dbfiddle here

The manual for Postgres 10:

If there is more than one set-returning function in the query's select
list, the behavior is similar to what you get from putting the
functions into a single LATERAL ROWS FROM( ... ) FROM-clause item. For
each row from the underlying query, there is an output row using the
first result from each function, then an output row using the second
result, and so on. If some of the set-returning functions produce
fewer outputs than others, null values are substituted for the missing
data, so that the total number of rows emitted for one underlying row
is the same as for the set-returning function that produced the most
outputs. Thus the set-returning functions run “in lockstep” until they
are all exhausted, and then execution continues with the next
underlying row.

This ends the traditionally odd behavior.

Postgres 9.6 or older

The number of result rows (somewhat surprisingly!) is the lowest common multiple of all sets in the same SELECT list. (Only acts like a CROSS JOIN if there is no common divisor to all set-sizes!) Demo:

SELECT generate_series( 1,  2) AS row2
, generate_series(11, 13) AS row3
, generate_series(21, 24) AS row4;

row2 | row3 | row4
-----+------+-----
1 | 11 | 21
2 | 12 | 22
1 | 13 | 23
2 | 11 | 24
1 | 12 | 21
2 | 13 | 22
1 | 11 | 23
2 | 12 | 24
1 | 13 | 21
2 | 11 | 22
1 | 12 | 23
2 | 13 | 24

dbfiddle here

Documented in manual for Postgres 9.6 the chapter SQL Functions Returning Sets, along with the recommendation to avoid it:

Note: The key problem with using set-returning functions in the select
list, rather than the FROM clause, is that putting more than one
set-returning function in the same select list does not behave very
sensibly. (What you actually get if you do so is a number of output
rows equal to the least common multiple of the numbers of rows
produced by each set-returning function.
) The LATERAL syntax produces
less surprising results when calling multiple set-returning functions,
and should usually be used instead.

Bold emphasis mine.

A single set-returning function is OK (but still cleaner in the FROM list), but multiple in the same SELECT list is discouraged now. This was a useful feature before we had LATERAL joins. Now it's merely historical ballast.

Related:

  • Parallel unnest() and sort order in PostgreSQL
  • Unnest multiple arrays in parallel
  • What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

inconsistent behaviour of set-returning functions in sub-query with random()

Indeed the postgres mailinglist gave a good response and it is likely a bug.

This is the answer, including workaround, from Tom Lane:


Hmm, I think this is an optimizer bug. There are two legitimate behaviors
here:

SELECT * FROM unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should (and does) re-evaluate the WHERE for every row output by unnest().

SELECT unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should evaluate WHERE only once, since that happens before expansion of the
set-returning function in the targetlist. (If you're an Oracle user and
you imagine this query as having an implicit "FROM dual", the WHERE should
be evaluated for the single row coming out of the FROM clause.)

In the case you've got here, given the placement of the WHERE in the outer
query, you'd certainly expect it to be evaluated for each row coming out
of the inner query. But the optimizer is deciding it can push the WHERE
clause down to become a WHERE of the sub-select. That is legitimate in a
lot of cases, but not when there are SRF(s) in the sub-select's
targetlist, because that pushes the WHERE to occur before the SRF(s),
analogously to the change between the two queries I wrote.

I'm a bit hesitant to change this in existing releases. Given the lack
of previous complaints, it seems more likely to break queries that were
behaving as-expected than to make people happy. But we could change it
in v10 and up, especially since some other corner-case changes in
SRF-in-tlist behavior are afoot.

In the meantime, you could force it to work as you wish by inserting the
all-purpose optimization fence "OFFSET 0" in the sub-select:

=# SELECT num FROM (
SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num OFFSET 0) AS foo WHERE random() > 0.5;
num
-----
1
4
7
9
(4 rows)

Inconsistent results with jsonb_array_elements_text() twice in the SELECT list

Combining multiple set-returning functions in the SELECT list is not in the SQL standard, where all set-returning elements go into the FROM list. You can do that in Postgres, but it used to exhibit surprising behavior before version 10, where it was finally sanitized.

All of this is not directly related to the datatype jsonb or the function jsonb_array_elements_text() - beyond it being a set-returning function.

If you want the Cartesian product, reliably and not depending on your version of Postgres, use CROSS JOIN LATERAL instead (requires at least Postgres 9.3):

SELECT t.a, jb.b, jc.c
FROM test t
, jsonb_array_elements_text(t.b) jb(b)
, jsonb_array_elements_text(t.c) jc(c)
ORDER BY t.a, ???; -- your desired order seems arbitrary beyond a

The comma in the FROM list (,) is basically short syntax for CROSS JOIN LATERAL here.

See:

  • What is the difference between LATERAL and a subquery in PostgreSQL?

Explanation for your actual question:

Why does the behavior of the query below change when the number of elements in the array changes?

  • What is the expected behaviour for multiple set-returning functions in SELECT clause?

RETURNING rows using unnest()?

Use WITH statement:

WITH upd AS (
UPDATE Notis new_noti SET notis = '{}'::noti_record_type[]
FROM (SELECT * FROM Notis WHERE user_id = 2 FOR UPDATE) old_noti
WHERE old_noti.user_id = new_noti.user_id RETURNING old_noti.notis
)
SELECT unnest(notis) FROM upd;

Record returned from function has columns concatenated

Generally, to decompose rows returned from a function and get individual columns:

SELECT * FROM account_servicetier_for_day(20424, '2014-08-12');



As for the query:

Postgres 9.3 or newer

Cleaner with JOIN LATERAL:

SELECT '2014-08-12' AS day, 0 AS inbytes, 0 AS outbytes
, a.username, a.accountid, a.userid
, f.* -- but avoid duplicate column names!
FROM account_tab a
, account_servicetier_for_day(a.accountid, '2014-08-12') f -- <-- HERE
WHERE a.isdsl = 1
AND a.dslservicetypeid IS NOT NULL
AND NOT EXISTS (
SELECT FROM dailyaccounting_tab
WHERE day = '2014-08-12'
AND accountid = a.accountid
)
ORDER BY a.username;

The LATERAL keyword is implicit here, functions can always refer earlier FROM items. The manual:

LATERAL can also precede a function-call FROM item, but in this
case it is a noise word, because the function expression can refer to
earlier FROM items in any case.

Related:

  • Insert multiple rows in one table based on number in another table

Short notation with a comma in the FROM list is (mostly) equivalent to a CROSS JOIN LATERAL (same as [INNER] JOIN LATERAL ... ON TRUE) and thus removes rows from the result where the function call returns no row. To retain such rows, use LEFT JOIN LATERAL ... ON TRUE:

...
FROM account_tab a
LEFT JOIN LATERAL account_servicetier_for_day(a.accountid, '2014-08-12') f ON TRUE
...

Also, don't use NOT IN (subquery) when you can avoid it. It's the slowest and most tricky of several ways to do that:

  • Select rows which are not present in other table

I suggest NOT EXISTS instead.

Postgres 9.2 or older

You can call a set-returning function in the SELECT list (which is a Postgres extension of standard SQL). For performance reasons, this is best done in a subquery. Decompose the (well-known!) row type in the outer query to avoid repeated evaluation of the function:

SELECT '2014-08-12' AS day, 0 AS inbytes, 0 AS outbytes
, a.username, a.accountid, a.userid
, (a.rec).* -- but be wary of duplicate column names!
FROM (
SELECT *, account_servicetier_for_day(a.accountid, '2014-08-12') AS rec
FROM account_tab a
WHERE a.isdsl = 1
AND a.dslservicetypeid Is Not Null
AND NOT EXISTS (
SELECT FROM dailyaccounting_tab
WHERE day = '2014-08-12'
AND accountid = a.accountid
)
) a
ORDER BY a.username;

Related answer by Craig Ringer with an explanation, why it's better not to decompose on the same query level:

  • How to avoid multiple function evals with the (func()).* syntax in an SQL query?

Postgres 10 removed some oddities in the behavior of set-returning functions in the SELECT:

  • What is the expected behaviour for multiple set-returning functions in SELECT clause?

Unexpected behaviour using table functions over multivalued columns with same length

You are using table functions in the SELECT list, which is supported, but strange. You should use them in the FROM clause of your query instead.

Besides, the behavior of your query changed in PostgreSQL version 10 (and you must be using an older version). The release notes from v10 describe that:

  • Change the implementation of set-returning functions appearing in a query's SELECT list (Andres Freund)

    Set-returning functions are now evaluated before evaluation of scalar expressions in the SELECT list, much as though they had been placed in a LATERAL FROM-clause item. This allows saner semantics for cases where multiple set-returning functions are present. If they return different numbers of rows, the shorter results are extended to match the longest result by adding nulls. Previously the results were cycled until they all terminated at the same time, producing a number of rows equal to the least common multiple of the functions' periods. In addition, set-returning functions are now disallowed within CASE and COALESCE constructs. For more information see Section 37.4.8.

Your query exhibits the old "cycling" behavior: if both functions return a table of two rows, you will get two result rows. If one returns m rows and the other n, the number of result rows will be the least common multiple of m and n.

You should upgrade to a supported version of PostgreSQL.

How to use generate_series() to generate a grid of values

Move the function calls to the FROM clause:

SELECT *
FROM generate_series(1,5) a
, generate_series(1,5) b;

Or upgrade to Postgres 10 or later, where this odd behavior was finally changed. Detailed explanation:

  • What is the expected behaviour for multiple set-returning functions in SELECT clause?

Comma-separated FROM items are cross-joined. See:

  • Why does this implicit join get planned differently than an explicit join?
  • What does [FROM x, y] mean in Postgres?


Related Topics



Leave a reply



Submit