Is There Something Like a Zip() Function in Postgresql That Combines Two Arrays

Is there something like a zip() function in PostgreSQL that combines two arrays?

Postgres 9.5 or later

has array_agg(array expression):

array_agg ( anyarray ) → anyarray

Concatenates all the input arrays into an array of one higher
dimension. (The inputs must all have the same dimensionality, and
cannot be empty or null.)

This is a drop-in replacement for my custom aggregate function array_agg_mult() demonstrated below. It's implemented in C and considerably faster. Use it.

Postgres 9.4

Use the ROWS FROM construct or the updated unnest() which takes multiple arrays to unnest in parallel. Each can have a different length. You get (per documentation):

[...] the number of result rows in this case is that of the largest function
result, with smaller results padded with null values to match.

Use this cleaner and simpler variant:

SELECT ARRAY[a,b] AS ab
FROM unnest('{a,b,c}'::text[]
, '{d,e,f}'::text[]) x(a,b);

Postgres 9.3 or older

Simple zip()

Consider the following demo for Postgres 9.3 or earlier:

SELECT ARRAY[a,b] AS ab
FROM (
SELECT unnest('{a,b,c}'::text[]) AS a
, unnest('{d,e,f}'::text[]) AS b
) x;

Result:

  ab
-------
{a,d}
{b,e}
{c,f}

Note that both arrays must have the same number of elements to unnest in parallel, or you get a cross join instead.

You can wrap this into a function, if you want to:

CREATE OR REPLACE FUNCTION zip(anyarray, anyarray)
RETURNS SETOF anyarray LANGUAGE SQL AS
$func$
SELECT ARRAY[a,b] FROM (SELECT unnest($1) AS a, unnest($2) AS b) x;
$func$;

Call:

SELECT zip('{a,b,c}'::text[],'{d,e,f}'::text[]);

Same result.

zip() to multi-dimensional array:

Now, if you want to aggregate that new set of arrays into one 2-dimenstional array, it gets more complicated.

SELECT ARRAY (SELECT ...)

or:

SELECT array_agg(ARRAY[a,b]) AS ab
FROM (
SELECT unnest('{a,b,c}'::text[]) AS a
,unnest('{d,e,f}'::text[]) AS b
) x

or:

SELECT array_agg(ARRAY[ARRAY[a,b]]) AS ab
FROM ...

will all result in the same error message (tested with pg 9.1.5):

ERROR: could not find array type for data type text[]

But there is a way around this, as we worked out under this closely related question.

Create a custom aggregate function:

CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
, STYPE = anyarray
, INITCOND = '{}'
);

And use it like this:

SELECT array_agg_mult(ARRAY[ARRAY[a,b]]) AS ab
FROM (
SELECT unnest('{a,b,c}'::text[]) AS a
, unnest('{d,e,f}'::text[]) AS b
) x

Result:

{{a,d},{b,e},{c,f}}

Note the additional ARRAY[] layer! Without it and just:

SELECT array_agg_mult(ARRAY[a,b]) AS ab
FROM ...

You get:

{a,d,b,e,c,f}

Which may be useful for other purposes.

Roll another function:

CREATE OR REPLACE FUNCTION zip2(anyarray, anyarray)
RETURNS SETOF anyarray LANGUAGE SQL AS
$func$
SELECT array_agg_mult(ARRAY[ARRAY[a,b]])
FROM (SELECT unnest($1) AS a, unnest($2) AS b) x;
$func$;

Call:

SELECT zip2('{a,b,c}'::text[],'{d,e,f}'::text[]); -- or any other array type

Result:

{{a,d},{b,e},{c,f}}

Unnest multiple arrays in parallel

You will love this new feature of Postgres 9.4:

unnest(anyarray, anyarray [, ...])

unnest() with the much anticipated (at least by me) capability to unnest multiple arrays in parallel cleanly. The manual:

expand multiple arrays (possibly of different types) to a set of rows. This is only allowed in the FROM clause;

It's a special implementation of the new ROWS FROM feature.

Your function can now just be:

CREATE OR REPLACE FUNCTION multi_unnest(_some_id int
, _amounts numeric[]
, _invoices text[])
RETURNS TABLE (some_id int, amount numeric, invoice text) AS
$func$
SELECT _some_id, u.* FROM unnest(_amounts, _invoices) u;
$func$ LANGUAGE sql;

Call:

SELECT * FROM multi_unnest(123, '{100, 40.5, 76}'::numeric[] 
, '{01-2222-05,01-3333-04,01-4444-08}'::text[]);

Of course, the simple form can be replaced with plain SQL (no additional function):

SELECT 123 AS some_id, *
FROM unnest('{100, 40.5, 76}'::numeric[]
, '{01-2222-05,01-3333-04,01-4444-08}'::text[]) AS u(amount, invoice);

In earlier versions (Postgres 9.3-), you can use the less elegant and less safe form:

SELECT 123 AS some_id
, unnest('{100, 40.5, 76}'::numeric[]) AS amount
, unnest('{01-2222-05,01-3333-04,01-4444-08}'::text[]) AS invoice;

Caveats of the old shorthand form: besides being non-standard to have set-returning function in the SELECT list, the number of rows returned would be the lowest common multiple of each arrays number of elements (with surprising results for unequal numbers). Details in these related answers:

  • Parallel unnest() and sort order in PostgreSQL
  • Is there something like a zip() function in PostgreSQL that combines two arrays?

This behavior has finally been sanitized with Postgres 10. Multiple set-returning functions in the SELECT list produce rows in "lock-step" now. See:

  • What is the expected behaviour for multiple set-returning functions in SELECT clause?

Joining arrays within group by clause

UNION ALL

You could "unpivot" with UNION ALL first:

SELECT name, array_agg(c) AS c_arr
FROM (
SELECT name, id, 1 AS rnk, col1 AS c FROM tbl
UNION ALL
SELECT name, id, 2, col2 FROM tbl
ORDER BY name, id, rnk
) sub
GROUP BY 1;

Adapted to produce the order of values you later requested. The manual:

The aggregate functions array_agg, json_agg, string_agg, and xmlagg,
as well as similar user-defined aggregate functions, produce
meaningfully different result values depending on the order of the
input values. This ordering is unspecified by default, but can be
controlled by writing an ORDER BY clause within the aggregate call, as
shown in Section 4.2.7. Alternatively, supplying the input values from
a sorted subquery will usually work.

Bold emphasis mine.

LATERAL subquery with VALUES expression

LATERAL requires Postgres 9.3 or later.

SELECT t.name, array_agg(c) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
CROSS JOIN LATERAL (VALUES (t.col1), (t.col2)) v(c)
GROUP BY 1;

Same result. Only needs a single pass over the table.

Custom aggregate function

Or you could create a custom aggregate function like discussed in these related answers:

  • Selecting data into a Postgres array
  • Is there something like a zip() function in PostgreSQL that combines two arrays?
CREATE AGGREGATE array_agg_mult (anyarray)  (
SFUNC = array_cat
, STYPE = anyarray
, INITCOND = '{}'
);

Then you can:

SELECT name, array_agg_mult(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;

Or, typically faster, while not standard SQL:

SELECT name, array_agg_mult(ARRAY[col1, col2]) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
GROUP BY 1;

The added ORDER BY id (which can be appended to such aggregate functions) guarantees your desired result:

a | {1,2,3,4}
b | {5,6,7,8}

Or you might be interested in this alternative:

SELECT name, array_agg_mult(ARRAY[ARRAY[col1, col2]] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;

Which produces 2-dimensional arrays:

a | {{1,2},{3,4}}
b | {{5,6},{7,8}}

The last one can be replaced (and should be, as it's faster!) with the built-in array_agg() in Postgres 9.5 or later - with its added capability of aggregating arrays:

SELECT name, array_agg(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;

Same result. The manual:

input arrays concatenated into array of one higher dimension (inputs
must all have same dimensionality, and cannot be empty or null)

So not exactly the same as our custom aggregate function array_agg_mult();

Writing a PostgreSQL aggregate function where the input values are arrays

You can write a pretty simple CREATE AGGREGATE statement for this; see this similar prior post.

It's not very efficient, so PL/R will be a much better option. Or writing an aggregate in C if you're brave - the PostgreSQL array API in C is pretty ugly.

Aggregating all values not in the same group

In Postgres 11 or later, use a window function with a custom frame and a frame_exclusion:

SELECT *, array_combine(values) OVER (ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
EXCLUDE CURRENT ROW) AS agg_values
FROM tbl;

If name is not UNIQUE, and since you asked:

all values where name not = John Smith

SELECT *, array_combine(values) OVER (ORDER BY name
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
EXCLUDE GROUP) AS agg_values
FROM tbl;

db<>fiddle here

The first one (also) works with arbitrary order of rows, only excluding the current one. The second requires ORDER BY to establish which rows are in the same group.

The manual:

The frame_exclusion option allows rows around the current row to be
excluded from the frame, even if they would be included according to
the frame start and frame end options. EXCLUDE CURRENT ROW
excludes the current row from the frame. EXCLUDE GROUP excludes the
current row and its ordering peers from the frame.
EXCLUDE TIES
excludes any peers of the current row from the frame, but not the
current row itself. [...]

Bold emphasis mine.

This uses the custom aggregate function array_combine(anyarray) provided by a_horse.

Or here:

  • Selecting data into a Postgres array
  • Is there something like a zip() function in PostgreSQL that combines two arrays?

Pairwise array sum aggregate function?

General solutions for any number of arrays with any number of elements. Individual elements or the the whole array can be NULL, too:

Simpler in 9.4+ using WITH ORDINALITY

SELECT ARRAY (
SELECT sum(elem)
FROM tbl t
, unnest(t.arr) WITH ORDINALITY x(elem, rn)
GROUP BY rn
ORDER BY rn
);

See:

  • PostgreSQL unnest() with element number

Postgres 9.3+

This makes use of an implicit LATERAL JOIN

SELECT ARRAY (
SELECT sum(arr[rn])
FROM tbl t
, generate_subscripts(t.arr, 1) AS rn
GROUP BY rn
ORDER BY rn
);

See:

  • What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

Postgres 9.1

SELECT ARRAY (
SELECT sum(arr[rn])
FROM (
SELECT arr, generate_subscripts(arr, 1) AS rn
FROM tbl t
) sub
GROUP BY rn
ORDER BY rn
);

The same works in later versions, but set-returning functions in the SELECT list are not standard SQL and were frowned upon by some. Should be OK since Postgres 10, though. See:

  • What is the expected behaviour for multiple set-returning functions in SELECT clause?

db<>fiddle here

Old sqlfiddle

Related:

  • Is there something like a zip() function in PostgreSQL that combines two arrays?

SQL multiple UNNEST in single select list

This isn't about unnest as such, but about PostgreSQL's very weird handling of multiple set-returning functions in the SELECT list. Set-returning functions in SELECT aren't part of the ANSI SQL standard.

You will find behaviour much saner with LATERAL queries, which should be preferred over using a a set-returning function in FROM as much as possible:

select a, b FROM unnest(array[1,2]) a, LATERAL unnest(array[1,2,3]) b;

e.g.

regress=> select a, b FROM unnest(array[1,2]) a, LATERAL unnest(array[1,2,3]) b;
a | b
---+---
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
2 | 3
(6 rows)

The only time I still use multiple set-returning functions in SELECT is when I want to pair up values from functions that both return the same number of rows. The need for that will go away in 9.4, with multi-argument unnest and with support for WITH ORDINALITY.

Aggregate single array of distinct elements from array column, excluding NULL

Plain array_agg() does this with arrays:

Concatenates all the input arrays into an array of one higher
dimension. (The inputs must all have the same dimensionality, and
cannot be empty or null.)

Not what you need. See:

  • Is there something like a zip() function in PostgreSQL that combines two arrays?

You need something like this: unnest(), process and sort elements an feed the resulting set to an ARRAY constructor:

SELECT ARRAY(
SELECT DISTINCT elem::date
FROM (SELECT unnest(date_array) FROM example) AS e(elem)
WHERE elem IS NOT NULL
ORDER BY elem DESC
);

db<>fiddle here

To be clear: we could use array_agg() (taking non-array input, different from your incorrect use) instead of the final ARRAY constructor. But the latter is faster (and simpler, too, IMO).

They happen to be timestamps of just dates (eg without time or timezone)

So cast to date and trim the noise.

Should be the fastest way:

  • A correlated subquery is a bit faster than a LATERAL one (and does the simple job).
  • An ARRAY constructor is a bit faster than the aggregate function array_agg() (and does the simple job).
  • Most importantly, sorting and applying DISTINCT in a subquery is typically faster than inline ORDER BY and DISTINCT in an aggregate function (and does the simple job).

See:

  • Unnest arrays of different dimensions

  • How to select 1d array from 2d array?

  • Why is array_agg() slower than the non-aggregate ARRAY() constructor?

  • What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

Performance comparison:

db<>fiddle here



Related Topics



Leave a reply



Submit