Concatenate/merge array values during grouping/aggregation
Custom aggregate
Approach 1: define a custom aggregate. Here's one I wrote earlier.
CREATE TABLE my_test(title text, tags text[]);
INSERT INTO my_test(title, tags) VALUES
('ridealong', '{comedy,other}'),
('ridealong', '{comedy,tragedy}'),
('freddyjason', '{horror,silliness}');
CREATE AGGREGATE array_cat_agg(anyarray) (
SFUNC=array_cat,
STYPE=anyarray
);
select title, array_cat_agg(tags) from my_test group by title;
LATERAL query
... or since you don't want to preserve order and want to deduplicate, you could use a LATERAL
query like:
SELECT title, array_agg(DISTINCT tag ORDER BY tag)
FROM my_test, unnest(tags) tag
GROUP BY title;
in which case you don't need the custom aggregate. This one is probably a fair bit slower for big data sets due to the deduplication. Removing the ORDER BY
if not required may help, though.
Combine arrays in MongoDB $group aggregation
You can use $reduce
db.collection.aggregate([
{
$group: {
_id: "$date",
hourlyConsumption: { $push: "$hourlyConsumption" }
}
},
{
$set: {
hourlyConsumption: {
$reduce: {
input: "$hourlyConsumption",
initialValue: [],
in: { $map: { input: { $range: [ 0, 23 ] },
as: "h",
in: {
$sum: [
{ $arrayElemAt: [ "$$value", "$$h" ] },
{ $arrayElemAt: [ "$$this", "$$h" ] }
]
}
}
}
}
}
}
}
])
Mongo Playground
Or you use $unwind
and $group
:
db.collection.aggregate([
{
$unwind: {
path: "$hourlyConsumption",
includeArrayIndex: "hour"
}
},
{
$group: {
_id: {
date: "$date",
hour: "$hour"
},
hourlyConsumption: { $sum: "$hourlyConsumption" }
}
},
{ $sort: { "_id.hour": 1 } },
{
$group: {
_id: "$_id.date",
hourlyConsumption: { $push: "$hourlyConsumption" }
}
}
])
Mongo Playground
However, when you use $unwind, then you actually contradict your bucketing design pattern.
How to group and merge array entries and to sum-up values on multiple common (but not all) keys?
Just test both properties in the find()
callback, and add both to the new object when pushing into acc
.
const arr = [
{'ID':'1','Parent':'1','Member': '1','Code': '123','Subject': 'Org A','value': 0.3},
{'ID':'2','Parent':'1','Member': '1','Code': '124','Subject': 'Org A','value': 0.25},
{'ID':'3','Parent':'1','Member': '1','Code': '123','Subject': 'Org B','value': 0.45},
{'ID':'4','Parent':'1','Member': '2','Code': '125','Subject': 'Org A','value': 0.8},
{'ID':'5','Parent':'1','Member': '2','Code': '211','Subject': 'Org C','value': 0.3},
{'ID':'6','Parent':'1','Member': '3','Code': '221','Subject': 'Org B','value': 0.3},
{'ID':'7','Parent':'1','Member': '3','Code': '221','Subject': 'Org C','value': 0.25},
{'ID':'8','Parent':'1','Member': '3','Code': '234','Subject': 'Org A','value': 0.45},
{'ID':'9','Parent':'1','Member': '4','Code': '123','Subject': 'Org A','value': 0.8},
{'ID':'10','Parent':'2','Member': '5','Code': '123','Subject': 'Org D','value': 0.3},
{'ID':'11','Parent':'2','Member': '5','Code': '123','Subject': 'Org E','value': 0.3},
{'ID':'12','Parent':'2','Member': '6','Code': '125','Subject': 'Org E','value': 0.25},
{'ID':'13','Parent':'2','Member': '6','Code': '211','Subject': 'Org F','value': 0.45},
{'ID':'14','Parent':'2','Member': '6','Code': '221','Subject': 'Org F','value': 0.8},
{'ID':'15','Parent':'2','Member': '6','Code': '123','Subject': 'Org G','value': 0.3},
{'ID':'16','Parent':'3','Member': '7','Code': '124','Subject': 'Org H','value': 0.3},
{'ID':'17','Parent':'3','Member': '8','Code': '124','Subject': 'Org H','value': 0.25},
{'ID':'18','Parent':'3','Member': '9','Code': '123','Subject': 'Org I','value': 0.45},
{'ID':'19','Parent':'3','Member': '10','Code': '123','Subject': 'Org J','value': 0.8},
{'ID':'20','Parent':'3','Member': '10','Code': '211','Subject': 'Org I','value': 0.3},
{'ID':'21','Parent':'4','Member': '11','Code': '221','Subject': 'Org K','value': 0.3},
{'ID':'22','Parent':'4','Member': '11','Code': '234','Subject': 'Org K','value': 0.25},
{'ID':'23','Parent':'4','Member': '12','Code': '234','Subject': 'Org K','value': 0.45},
{'ID':'24','Parent':'4','Member': '12','Code': '123','Subject': 'Org L','value': 0.8},
{'ID':'25','Parent':'4','Member': '13','Code': '211','Subject': 'Org M','value': 0.3}
];
const summed = arr.reduce((acc, cur) => {
const item = acc.length > 0 && acc.find(({
Code, Parent
}) => Code === cur.Code && Parent == cur.Parent)
if (item) {
item.value += cur.value
} else acc.push({
Code: cur.Code,
Parent: cur.Parent,
value: cur.value
});
return acc
}, [])
console.log(arr); // not modified
console.log(summed)
Merging an array while grouping from MongoDB collection
You can use $concatArrays
together with $reduce
.
Instead of $project
you can use $addFields
, available from MongoDB v3.4 or $set
available from v4.2 to keep other fields
db.variants.aggregate([
{
"$group": {
"_id": "$productId",
"price": {
"$min": "$price"
},
"stock": {
"$sum": "$stock"
},
"unit": {
"$first": "$unit"
},
"images": {
"$push": "$images"
},
"variants": {
"$push": "$$ROOT"
}
}
},
{
"$addFields": { // or $set
"images": {
"$reduce": {
"input": "$images",
"initialValue": [],
"in": {
"$concatArrays": ["$$value", "$$this"]
}
}
}
}
}
])
How to Group by and concatenate arrays in PostgreSQL
To preserve the same dimension of you array you can't directly use array_agg()
, so first we unnest
your arrays and apply distinct
to remove duplicates (1). In outer query this is the time to aggregate. To preserve values ordering include order by
within aggregate function:
select time, array_agg(col order by col) as col
from (
select distinct time, unnest(col) as col
from yourtable
) t
group by time
order by time
(1) If you don't need duplicate removal just remove distinct
word.
Combine arrays when grouping documents
Depending on your available version and practicality you could possibly just apply $reduce
and $concatArrays
in order to "join" the resulting "array of arrays" in the grouped document:
db.getCollection('stuff').aggregate([
{ "$group": {
"_id": {
"product": "$product", "state": "$state"
},
"nondnd": { "$push": "$nondnd" },
"dnd": { "$push": "$dnd" },
"land": { "$push": "$land" },
"emails": { "$push": "$emails" }
}},
{ "$addFields": {
"nondnd": {
"$reduce": {
"input": "$nondnd",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
},
"dnd": {
"$reduce": {
"input": "$dnd",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
},
"land": {
"$reduce": {
"input": "$land",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
},
"emails": {
"$reduce": {
"input": "$emails",
"initialValue": [],
"in": { "$concatArrays": [ "$$value", "$$this" ] }
}
}
}}
])
Or even "ultra-modern" where you really don't like repeating yourself ( but you probably should be generating the pipeline stages anyway ):
db.getCollection('stuff').aggregate([
{ "$project": {
"product": 1,
"state": 1,
"data": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$in": [ "$$this.k", ["nondnd","dnd","land","emails"] ] }
}
}
}},
{ "$unwind": "$data" },
{ "$unwind": "$data.v" },
{ "$group": {
"_id": {
"product": "$product",
"state": "$state",
"k": "$data.k"
},
"v": { "$push": "$data.v" }
}},
{ "$group": {
"_id": {
"product": "$_id.product",
"state": "$_id.state"
},
"data": { "$push": { "k": "$_id.k", "v": "$v" } }
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": {
"$concatArrays": [
[{ "k": "_id", "v": "$_id" }],
{ "$map": {
"input": ["nondnd","dnd","land","emails"],
"in": {
"$cond": {
"if": { "$ne": [{ "$indexOfArray": [ "$data.k", "$$this" ] },-1] },
"then": {
"$arrayElemAt": [
"$data",
{ "$indexOfArray": [ "$data.k", "$$this" ] }
]
},
"else": { "k": "$$this", "v": [] }
}
}
}}
]
}
}
}}
])
Or you can alternately join the arrays at the source and map them to a type. Then reconstruct after the grouping:
db.getCollection('stuff').aggregate([
{ "$project": {
"product": 1,
"state": 1,
"combined": {
"$concatArrays": [
{ "$map": {
"input": "$nondnd",
"in": { "t": "nondnd", "v": "$$this" }
}},
{ "$map": {
"input": "$dnd",
"in": { "t": "dnd", "v": "$$this" }
}},
{ "$map": {
"input": "$land",
"in": { "t": "land", "v": "$$this" }
}},
{ "$map": {
"input": "$emails",
"in": { "t": "emails", "v": "$$this" }
}}
]
}
}},
{ "$unwind": "$combined" },
{ "$group": {
"_id": {
"product": "$product", "state": "$state"
},
"combined": { "$push": "$combined" }
}},
{ "$project": {
"nondnd": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "nondnd" ] }
}
},
"in": "$$this.v"
}
},
"dnd": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "dnd" ] }
}
},
"in": "$$this.v"
}
},
"land": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "land" ] }
}
},
"in": "$$this.v"
}
},
"emails": {
"$map": {
"input": {
"$filter": {
"input": "$combined",
"cond": { "$eq": [ "$$this.t", "emails" ] }
}
},
"in": "$$this.v"
}
}
}}
])
So largely depending on $map
and $filter
in both constructing and deconstructing the contents of a single joined array, which is of course perfectly fine to $unwind
.
The same result comes from each case:
/* 1 */
{
"_id" : {
"product" : "product1",
"state" : "state2"
},
"nondnd" : [
9.0,
8.0,
2.0
],
"dnd" : [
10.0,
7.0,
11.0
],
"land" : [
1.0,
3.0
],
"emails" : [
"e",
"g"
]
}
/* 2 */
{
"_id" : {
"product" : "product1",
"state" : "state1"
},
"nondnd" : [
1.0,
2.0,
3.0,
9.0,
8.0,
2.0
],
"dnd" : [
4.0,
5.0,
10.0,
7.0,
11.0
],
"land" : [
2.0,
4.0,
6.0,
8.0
],
"emails" : [
"a",
"b",
"c",
"d"
]
}
Joining arrays within group by clause
UNION ALL
You could "unpivot" with UNION ALL
first:
SELECT name, array_agg(c) AS c_arr
FROM (
SELECT name, id, 1 AS rnk, col1 AS c FROM tbl
UNION ALL
SELECT name, id, 2, col2 FROM tbl
ORDER BY name, id, rnk
) sub
GROUP BY 1;
Adapted to produce the order of values you later requested. The manual:
The aggregate functions
array_agg
,json_agg
,string_agg
, andxmlagg
,
as well as similar user-defined aggregate functions, produce
meaningfully different result values depending on the order of the
input values. This ordering is unspecified by default, but can be
controlled by writing anORDER BY
clause within the aggregate call, as
shown in Section 4.2.7. Alternatively, supplying the input values from
a sorted subquery will usually work.
Bold emphasis mine.
LATERAL
subquery with VALUES
expression
LATERAL
requires Postgres 9.3 or later.
SELECT t.name, array_agg(c) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
CROSS JOIN LATERAL (VALUES (t.col1), (t.col2)) v(c)
GROUP BY 1;
Same result. Only needs a single pass over the table.
Custom aggregate function
Or you could create a custom aggregate function like discussed in these related answers:
- Selecting data into a Postgres array
- Is there something like a zip() function in PostgreSQL that combines two arrays?
CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
, STYPE = anyarray
, INITCOND = '{}'
);
Then you can:
SELECT name, array_agg_mult(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Or, typically faster, while not standard SQL:
SELECT name, array_agg_mult(ARRAY[col1, col2]) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
GROUP BY 1;
The added ORDER BY id
(which can be appended to such aggregate functions) guarantees your desired result:
a | {1,2,3,4}
b | {5,6,7,8}
Or you might be interested in this alternative:
SELECT name, array_agg_mult(ARRAY[ARRAY[col1, col2]] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Which produces 2-dimensional arrays:
a | {{1,2},{3,4}}
b | {{5,6},{7,8}}
The last one can be replaced (and should be, as it's faster!) with the built-in array_agg()
in Postgres 9.5 or later - with its added capability of aggregating arrays:
SELECT name, array_agg(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Same result. The manual:
input arrays concatenated into array of one higher dimension (inputs
must all have same dimensionality, and cannot be empty or null)
So not exactly the same as our custom aggregate function array_agg_mult()
;
Spark merge/combine arrays in groupBy/aggregate
It could be inefficient to explode
but fundamentally the operation you try to implement is simply expensive. Effectively it is just another groupByKey
and there is not much you can do here to make it better. Since you use Spark > 2.0 you could collect_list
directly and flatten:
import org.apache.spark.sql.functions.{collect_list, udf}
val flatten_distinct = udf(
(xs: Seq[Seq[String]]) => xs.flatten.distinct)
df
.groupBy("category")
.agg(
flatten_distinct(collect_list("array_value_1")),
flatten_distinct(collect_list("array_value_2"))
)
In Spark >= 2.4 you can replace udf with composition of built-in functions:
import org.apache.spark.sql.functions.{array_distinct, flatten}
val flatten_distinct = (array_distinct _) compose (flatten _)
It is also possible to use custom Aggregator
but I doubt any of these will make a huge difference.
If sets are relatively large and you expect significant number of duplicates you could try to use aggregateByKey
with mutable sets:
import scala.collection.mutable.{Set => MSet}
val rdd = df
.select($"category", struct($"array_value_1", $"array_value_2"))
.as[(Int, (Seq[String], Seq[String]))]
.rdd
val agg = rdd
.aggregateByKey((MSet[String](), MSet[String]()))(
{case ((accX, accY), (xs, ys)) => (accX ++= xs, accY ++ ys)},
{case ((accX1, accY1), (accX2, accY2)) => (accX1 ++= accX2, accY1 ++ accY2)}
)
.mapValues { case (xs, ys) => (xs.toArray, ys.toArray) }
.toDF
Related Topics
How to Design a Database Schema to Support Tagging with Categories
How to Use a Postgresql Triggers to Store Changes (SQL Statements and Row Changes)
Concatenate Multiple Rows in an Array with SQL on Postgresql
Difference Between Int Primary Key and Integer Primary Key SQLite
Cross Apply VS Outer Apply Speed Difference
How to Schedule a SQL Job in Microsoft Azure SQL Database
Is There Something Equivalent to Argmax in SQL
Sql: Is There a Possibility to Convert Numbers (1,2,3,4...) to Letters (A,B,C,D...)
Where Col1,Col2 in (...) [SQL Subquery Using Composite Primary Key]
Error Importing Azure Bacpac File to Local Db Error Incorrect Syntax Near External
How to Join the Most Recent Row in One Table to Another Table
How to Rename a Column in a Database Table Using SQL
SQL Query That Groups Different Items into Buckets
How to Join on a Stored Procedure
Return Only One Row from the Right-Most Table for Every Row in the Left-Most Table
How to Insert a Unique Id into Each SQLite Row
Column Conflicts with the Type of Other Columns in the Unpivot List