Concatenate/Merge Array Values During Grouping/Aggregation

Concatenate/merge array values during grouping/aggregation

Custom aggregate

Approach 1: define a custom aggregate. Here's one I wrote earlier.

CREATE TABLE my_test(title text, tags text[]);

INSERT INTO my_test(title, tags) VALUES
('ridealong', '{comedy,other}'),
('ridealong', '{comedy,tragedy}'),
('freddyjason', '{horror,silliness}');

CREATE AGGREGATE array_cat_agg(anyarray) (
SFUNC=array_cat,
STYPE=anyarray
);

select title, array_cat_agg(tags) from my_test group by title;

LATERAL query

... or since you don't want to preserve order and want to deduplicate, you could use a LATERAL query like:

SELECT title, array_agg(DISTINCT tag ORDER BY tag) 
FROM my_test, unnest(tags) tag
GROUP BY title;

in which case you don't need the custom aggregate. This one is probably a fair bit slower for big data sets due to the deduplication. Removing the ORDER BY if not required may help, though.

Combine arrays in MongoDB $group aggregation

You can use $reduce

db.collection.aggregate([
{
$group: {
_id: "$date",
hourlyConsumption: { $push: "$hourlyConsumption" }
}
},
{
$set: {
hourlyConsumption: {
$reduce: {
input: "$hourlyConsumption",
initialValue: [],
in: { $map: { input: { $range: [ 0, 23 ] },
as: "h",
in: {
$sum: [
{ $arrayElemAt: [ "$$value", "$$h" ] },
{ $arrayElemAt: [ "$$this", "$$h" ] }
]
}
}
}
}
}
}
}
])

Mongo Playground

Or you use $unwind and $group:

db.collection.aggregate([
{
$unwind: {
path: "$hourlyConsumption",
includeArrayIndex: "hour"
}
},
{
$group: {
_id: {
date: "$date",
hour: "$hour"
},
hourlyConsumption: { $sum: "$hourlyConsumption" }
}
},
{ $sort: { "_id.hour": 1 } },
{
$group: {
_id: "$_id.date",
hourlyConsumption: { $push: "$hourlyConsumption" }
}
}
])

Mongo Playground

However, when you use $unwind, then you actually contradict your bucketing design pattern.

How to group and merge array entries and to sum-up values on multiple common (but not all) keys?

Just test both properties in the find() callback, and add both to the new object when pushing into acc.

const arr = [
{'ID':'1','Parent':'1','Member': '1','Code': '123','Subject': 'Org A','value': 0.3},
{'ID':'2','Parent':'1','Member': '1','Code': '124','Subject': 'Org A','value': 0.25},
{'ID':'3','Parent':'1','Member': '1','Code': '123','Subject': 'Org B','value': 0.45},
{'ID':'4','Parent':'1','Member': '2','Code': '125','Subject': 'Org A','value': 0.8},
{'ID':'5','Parent':'1','Member': '2','Code': '211','Subject': 'Org C','value': 0.3},
{'ID':'6','Parent':'1','Member': '3','Code': '221','Subject': 'Org B','value': 0.3},
{'ID':'7','Parent':'1','Member': '3','Code': '221','Subject': 'Org C','value': 0.25},
{'ID':'8','Parent':'1','Member': '3','Code': '234','Subject': 'Org A','value': 0.45},
{'ID':'9','Parent':'1','Member': '4','Code': '123','Subject': 'Org A','value': 0.8},
{'ID':'10','Parent':'2','Member': '5','Code': '123','Subject': 'Org D','value': 0.3},
{'ID':'11','Parent':'2','Member': '5','Code': '123','Subject': 'Org E','value': 0.3},
{'ID':'12','Parent':'2','Member': '6','Code': '125','Subject': 'Org E','value': 0.25},
{'ID':'13','Parent':'2','Member': '6','Code': '211','Subject': 'Org F','value': 0.45},
{'ID':'14','Parent':'2','Member': '6','Code': '221','Subject': 'Org F','value': 0.8},
{'ID':'15','Parent':'2','Member': '6','Code': '123','Subject': 'Org G','value': 0.3},
{'ID':'16','Parent':'3','Member': '7','Code': '124','Subject': 'Org H','value': 0.3},
{'ID':'17','Parent':'3','Member': '8','Code': '124','Subject': 'Org H','value': 0.25},
{'ID':'18','Parent':'3','Member': '9','Code': '123','Subject': 'Org I','value': 0.45},
{'ID':'19','Parent':'3','Member': '10','Code': '123','Subject': 'Org J','value': 0.8},
{'ID':'20','Parent':'3','Member': '10','Code': '211','Subject': 'Org I','value': 0.3},
{'ID':'21','Parent':'4','Member': '11','Code': '221','Subject': 'Org K','value': 0.3},
{'ID':'22','Parent':'4','Member': '11','Code': '234','Subject': 'Org K','value': 0.25},
{'ID':'23','Parent':'4','Member': '12','Code': '234','Subject': 'Org K','value': 0.45},
{'ID':'24','Parent':'4','Member': '12','Code': '123','Subject': 'Org L','value': 0.8},
{'ID':'25','Parent':'4','Member': '13','Code': '211','Subject': 'Org M','value': 0.3}
];

const summed = arr.reduce((acc, cur) => {
const item = acc.length > 0 && acc.find(({
Code, Parent
}) => Code === cur.Code && Parent == cur.Parent)
if (item) {
item.value += cur.value
} else acc.push({
Code: cur.Code,
Parent: cur.Parent,
value: cur.value
});
return acc
}, [])
console.log(arr); // not modified
console.log(summed)

Merging an array while grouping from MongoDB collection

You can use $concatArrays together with $reduce.

Instead of $project you can use $addFields, available from MongoDB v3.4 or $set available from v4.2 to keep other fields

db.variants.aggregate([
{
"$group": {
"_id": "$productId",
"price": {
"$min": "$price"
},
"stock": {
"$sum": "$stock"
},
"unit": {
"$first": "$unit"
},
"images": {
"$push": "$images"
},
"variants": {
"$push": "$$ROOT"
}
}
},
{
"$addFields": { // or $set
"images": {
"$reduce": {
"input": "$images",
"initialValue": [],
"in": {
"$concatArrays": ["$$value", "$$this"]
}
}
}
}
}
])

How to Group by and concatenate arrays in PostgreSQL

To preserve the same dimension of you array you can't directly use array_agg(), so first we unnest your arrays and apply distinct to remove duplicates (1). In outer query this is the time to aggregate. To preserve values ordering include order by within aggregate function:

select time, array_agg(col order by col) as col
from (
select distinct time, unnest(col) as col
from yourtable
) t
group by time
order by time

(1) If you don't need duplicate removal just remove distinct word.

Combine arrays when grouping documents

Depending on your available version and practicality you could possibly just apply $reduce and $concatArrays in order to "join" the resulting "array of arrays" in the grouped document:

db.getCollection('stuff').aggregate([
{ "$group": {
"_id": {
"product": "$product", "state": "$state"
},
"nondnd": { "$push": "$nondnd" },
"dnd": { "$push": "$dnd" },
"land": { "$push": "$land" },
"emails": { "$push": "$emails" }
}},
{ "$addFields": {
"nondnd": {
"$reduce": {
"input": "$nondnd",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
},
"dnd": {
"$reduce": {
"input": "$dnd",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
},
"land": {
"$reduce": {
"input": "$land",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
},
"emails": {
"$reduce": {
"input": "$emails",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
}
}}
])

Or even "ultra-modern" where you really don't like repeating yourself ( but you probably should be generating the pipeline stages anyway ):

db.getCollection('stuff').aggregate([
{ "$project": {
"product": 1,
"state": 1,
"data": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$in": [ "$$this.k", ["nondnd","dnd","land","emails"] ] }
}
}
}},
{ "$unwind": "$data" },
{ "$unwind": "$data.v" },
{ "$group": {
"_id": {
"product": "$product",
"state": "$state",
"k": "$data.k"
},
"v": { "$push": "$data.v" }
}},
{ "$group": {
"_id": {
"product": "$_id.product",
"state": "$_id.state"
},
"data": { "$push": { "k": "$_id.k", "v": "$v" } }
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": {
"$concatArrays": [
[{ "k": "_id", "v": "$_id" }],
{ "$map": {
"input": ["nondnd","dnd","land","emails"],
"in": {
"$cond": {
"if": { "$ne": [{ "$indexOfArray": [ "$data.k", "$$this" ] },-1] },
"then": {
"$arrayElemAt": [
"$data",
{ "$indexOfArray": [ "$data.k", "$$this" ] }
]
},
"else": { "k": "$$this", "v": [] }
}
}
}}
]
}
}
}}
])

Or you can alternately join the arrays at the source and map them to a type. Then reconstruct after the grouping:

db.getCollection('stuff').aggregate([
{ "$project": {
"product": 1,
"state": 1,
"combined": {
"$concatArrays": [
{ "$map": {
"input": "$nondnd",
"in": { "t": "nondnd", "v": "$$this" }
}},
{ "$map": {
"input": "$dnd",
"in": { "t": "dnd", "v": "$$this" }
}},
{ "$map": {
"input": "$land",
"in": { "t": "land", "v": "$$this" }
}},
{ "$map": {
"input": "$emails",
"in": { "t": "emails", "v": "$$this" }
}}
]
}
}},
{ "$unwind": "$combined" },
{ "$group": {
"_id": {
"product": "$product", "state": "$state"
},
"combined": { "$push": "$combined" }
}},
{ "$project": {
"nondnd": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "nondnd" ] }
}
},
"in": "$$this.v"
}
},
"dnd": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "dnd" ] }
}
},
"in": "$$this.v"
}
},
"land": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "land" ] }
}
},
"in": "$$this.v"
}
},
"emails": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "emails" ] }
}
},
"in": "$$this.v"
}
}
}}
])

So largely depending on $map and $filter in both constructing and deconstructing the contents of a single joined array, which is of course perfectly fine to $unwind.

The same result comes from each case:

/* 1 */
{
"_id" : {
"product" : "product1",
"state" : "state2"
},
"nondnd" : [
9.0,
8.0,
2.0
],
"dnd" : [
10.0,
7.0,
11.0
],
"land" : [
1.0,
3.0
],
"emails" : [
"e",
"g"
]
}

/* 2 */
{
"_id" : {
"product" : "product1",
"state" : "state1"
},
"nondnd" : [
1.0,
2.0,
3.0,
9.0,
8.0,
2.0
],
"dnd" : [
4.0,
5.0,
10.0,
7.0,
11.0
],
"land" : [
2.0,
4.0,
6.0,
8.0
],
"emails" : [
"a",
"b",
"c",
"d"
]
}

Joining arrays within group by clause

UNION ALL

You could "unpivot" with UNION ALL first:

SELECT name, array_agg(c) AS c_arr
FROM (
SELECT name, id, 1 AS rnk, col1 AS c FROM tbl
UNION ALL
SELECT name, id, 2, col2 FROM tbl
ORDER BY name, id, rnk
) sub
GROUP BY 1;

Adapted to produce the order of values you later requested. The manual:

The aggregate functions array_agg, json_agg, string_agg, and xmlagg,
as well as similar user-defined aggregate functions, produce
meaningfully different result values depending on the order of the
input values. This ordering is unspecified by default, but can be
controlled by writing an ORDER BY clause within the aggregate call, as
shown in Section 4.2.7. Alternatively, supplying the input values from
a sorted subquery will usually work.

Bold emphasis mine.

LATERAL subquery with VALUES expression

LATERAL requires Postgres 9.3 or later.

SELECT t.name, array_agg(c) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
CROSS JOIN LATERAL (VALUES (t.col1), (t.col2)) v(c)
GROUP BY 1;

Same result. Only needs a single pass over the table.

Custom aggregate function

Or you could create a custom aggregate function like discussed in these related answers:

  • Selecting data into a Postgres array
  • Is there something like a zip() function in PostgreSQL that combines two arrays?
CREATE AGGREGATE array_agg_mult (anyarray)  (
SFUNC = array_cat
, STYPE = anyarray
, INITCOND = '{}'
);

Then you can:

SELECT name, array_agg_mult(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;

Or, typically faster, while not standard SQL:

SELECT name, array_agg_mult(ARRAY[col1, col2]) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
GROUP BY 1;

The added ORDER BY id (which can be appended to such aggregate functions) guarantees your desired result:

a | {1,2,3,4}
b | {5,6,7,8}

Or you might be interested in this alternative:

SELECT name, array_agg_mult(ARRAY[ARRAY[col1, col2]] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;

Which produces 2-dimensional arrays:

a | {{1,2},{3,4}}
b | {{5,6},{7,8}}

The last one can be replaced (and should be, as it's faster!) with the built-in array_agg() in Postgres 9.5 or later - with its added capability of aggregating arrays:

SELECT name, array_agg(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;

Same result. The manual:

input arrays concatenated into array of one higher dimension (inputs
must all have same dimensionality, and cannot be empty or null)

So not exactly the same as our custom aggregate function array_agg_mult();

Spark merge/combine arrays in groupBy/aggregate

It could be inefficient to explode but fundamentally the operation you try to implement is simply expensive. Effectively it is just another groupByKey and there is not much you can do here to make it better. Since you use Spark > 2.0 you could collect_list directly and flatten:

import org.apache.spark.sql.functions.{collect_list, udf}

val flatten_distinct = udf(
(xs: Seq[Seq[String]]) => xs.flatten.distinct)

df
.groupBy("category")
.agg(
flatten_distinct(collect_list("array_value_1")),
flatten_distinct(collect_list("array_value_2"))
)

In Spark >= 2.4 you can replace udf with composition of built-in functions:

import org.apache.spark.sql.functions.{array_distinct, flatten}

val flatten_distinct = (array_distinct _) compose (flatten _)

It is also possible to use custom Aggregator but I doubt any of these will make a huge difference.

If sets are relatively large and you expect significant number of duplicates you could try to use aggregateByKey with mutable sets:

import scala.collection.mutable.{Set => MSet}

val rdd = df
.select($"category", struct($"array_value_1", $"array_value_2"))
.as[(Int, (Seq[String], Seq[String]))]
.rdd

val agg = rdd
.aggregateByKey((MSet[String](), MSet[String]()))(
{case ((accX, accY), (xs, ys)) => (accX ++= xs, accY ++ ys)},
{case ((accX1, accY1), (accX2, accY2)) => (accX1 ++= accX2, accY1 ++ accY2)}
)
.mapValues { case (xs, ys) => (xs.toArray, ys.toArray) }
.toDF


Related Topics



Leave a reply



Submit