When to use GROUPING SETS, CUBE and ROLLUP
Firstly, for those who haven't already read up on the subject:
- Using GROUP BY with ROLLUP, CUBE, and GROUPING SETS
That being said, don't think about these grouping options as ways to get a result set. These are performance tools.
Let's take ROLLUP
as a simple example.
I can use the following query to get the count of records for each value of GrpCol.
SELECT GrpCol, count(*) AS cnt
FROM dbo.MyTable
GROUP BY GrpCol
And I can use the following query to summarily "roll up" the count of ALL records.
SELECT NULL, count(*) AS cnt
FROM dbo.MyTable
And I could UNION ALL
the above two queries to get the exact same results I might get if I had written the first query with the ROLLUP
clause (that's why I put the NULL in there).
It might actually be more convenient for me to execute this as two different queries because then I have the grouped results separate from my totals. Why would I want my final total mixed right in to the rest of those results? The answer is that doing both together using the ROLLUP
clause is more efficient. SQL Server will use an execution plan that calculates all of the aggregations together in one pass. Compare that to the UNION ALL
example which would provide the exact same results but use a less efficient execution plan (two table scans instead of one).
Imagine an extreme example in which you are working on a data set so large that each scan of the data takes one whole hour. You have to provide totals on basically every possible dimension (way to slice) that data every day. Aha! I bet one of these grouping options is exactly what you need. If you save off the results of that one scan into a special schema layout, you will then be able to run reports for the rest of the day off the saved results.
So I'm basically saying that you're working on a data warehouse project. For the rest of us it mostly falls into the "neat thing to know" category.
How does Impala Implements GroupBy Extension(CUBE, ROLLUP and GROUPING SETS) In a distributed way?
Impala introduced the group by modifiers in 7.2.2:
Added support for GROUP BY ROLLUP, CUBE and GROUPING SETS. The GROUP BY ROLLUP clause creates a group for each combination of column expressions. The CUBE clause creates groups for all possible combinations of columns. The GROUPING SETS just lets you list out the combinations of expressions that you want to GROUP BY.
This is explained in the documentation for GROUP BY
starting with that version.
Should I use GROUPING SETS, CUBE, or ROLLUP in Postgres
It looks like you want to ROLLUP
yourdata using a GROUPING SET
:
select case grouping(studentnr)
when 0 then studentnr
else count(distinct studentnr)|| ' students'
end studentnr
, count(distinct case careday when 'monday' then studentnr end) monday
, count(distinct case careday when 'tuesday' then studentnr end) teusday
, count(distinct case careday when 'wednesday' then studentnr end) wednesday
, count(distinct case careday when 'thursday' then studentnr end) thursday
, count(distinct case careday when 'friday' then studentnr end) friday
, durationid
from yourdata
group by rollup ((studentnr, durationid))
Which yields the desired results:
| studentnr | monday | teusday | wednesday | thursday | friday | durationid |
|------------|--------|---------|-----------|----------|--------|------------|
| 10177 | 1 | 1 | 1 | 1 | 1 | 1507 |
| 717208 | 1 | 1 | 1 | 1 | 1 | 1507 |
| 722301 | 1 | 1 | 1 | 1 | 0 | 1507 |
| 3 students | 3 | 3 | 3 | 3 | 2 | (null) |
The second set of parenthesis in the ROLLUP
indicates that studentnr
and durationid
should be summarized at the same level when doing the roll up.
With just one level of summarization, there's not much difference between ROLLUP
and CUBE
, however to use GROUPING SETS
would require a slight change to the GROUP BY
clause in order to get the lowest desired level of detail. All three of the following GROUP BY
statements produce equivalent results:
group by rollup ((studentnr, durationid))
group by cube ((studentnr, durationid))
group by grouping sets ((),(studentnr, durationid))
Data warehouse rollup and grouping sets, which to use?
ROLLUP
and CUBE
are just shorthand for two common usages of GROUPING SETS
.
GROUPING SETS
gives more precise control of which aggregations you want to calculate.
What is the difference between cube, rollup and groupBy operators?
These are not intended to work in the same way. groupBy
is simply an equivalent of the GROUP BY
clause in standard SQL. In other words
table.groupBy($"foo", $"bar")
is equivalent to:
SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
cube
is equivalent to CUBE
extension to GROUP BY
. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:
val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
df.show
// +---+---+
// | x| y|
// +---+---+
// |foo| 1|
// |foo| 2|
// |bar| 2|
// |bar| 2|
// +---+---+
and you compute cube(x, y)
with count as an aggregation:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
A similar function to cube
is rollup
which computes hierarchical subtotals from left to right:
df.rollup($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// | foo|null| 2| <- count where x is fixed to foo
// | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2
// | foo| 1| 1| ...
// | foo| 2| 1| ...
// |null|null| 4| <- count where no column is fixed
// | bar|null| 2| <- count where x is fixed to bar
// +----+----+-----+
Just for comparison lets see the result of plain groupBy
:
df.groupBy($"x", $"y").count.show
// +---+---+-----+
// | x| y|count|
// +---+---+-----+
// |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+
To summarize:
- When using plain
GROUP BY
every row is included only once in its corresponding summary. With
GROUP BY CUBE(..)
every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could useNULL
placeholders):SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y, COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, yWith
GROUP BY ROLLUP(...)
is similar toCUBE
but works hierarchically by filling colums from left to right.SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
ROLLUP
and CUBE
come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.
PostgreSQL:How to use GROUPING SETS, CUBE, and ROLLUP for summary totals
See the documentation:
SELECT
Zone,
State,
COUNT(Sponsored),
COUNT(Enrolled),
COUNT(PickedUp)
FROM MasterData
GROUP BY rollup(Zone, State);
zone | state | sum | sum | sum
--------+---------------+-----+-----+-----
Zone 1 | Alaska | 0 | 0 | 0
Zone 1 | Arizona | 1 | 3 | 1
Zone 1 | California | 3 | 6 | 0
Zone 1 | Colorado | 0 | 4 | 2
Zone 1 | Guam | 0 | 0 | 0
Zone 1 | Hawaii | 0 | 1 | 0
Zone 1 | | 4 | 14 | 3
Zone 2 | Idaho | 1 | 0 | 0
Zone 2 | Montana | 0 | 1 | 1
Zone 2 | Nevada | 0 | 0 | 1
Zone 2 | New Mexico | 0 | 1 | 4
Zone 2 | North Dakota | 4 | 8 | 4
Zone 2 | Oregon | 0 | 0 | 1
Zone 2 | South Dakota | 0 | 1 | 0
Zone 2 | Utah | 0 | 1 | 0
Zone 2 | Washington | 0 | 1 | 1
Zone 2 | Wyoming | 0 | 1 | 1
Zone 2 | | 5 | 14 | 13
| | 9 | 28 | 16
(19 rows)
GROUP BY with multiple GROUPING SETS, CUBE, and ROLLUP clauses
I think the utility comes when you specify different arguments to the cube and rollup clauses. That's even evident in whatever source you're quoting. Each of cube and rollup is just a shortcut for longer list of grouping sets. In your example, the cube defines the following grouping sets
- shipcountry, shipcity
- shipcountry
- shipcity
- (null)
Whereas the rollup specifies these sets:
- shipcountry, shipcity
- shipcountry
- (null)
When you specify both in the same group by clause, you're getting each set from the first paired with each set from the second (which is what the multiplicative effect from your pull quote implies). So you get (using the nomenclature "(x) + (y)" to mean "item x from the first set and item y from the second):
- (1) + (1) → (shipcountry, shipcity) + (shipcountry, shipcity) → (shipcountry, shipcity)
- (1) + (2) → (shipcountry, shipcity) + (shipcountry) → (shipcountry, shipcity)
- (1) + (3) → (shipcountry, shipcity) + ((null)) → (shipcountry, shipcity)
- (2) + (1) → (shipcountry) + (shipcountry, shipcity) → (shipcountry, shipcity)
- (2) + (2) → (shipcountry) + (shipcountry) → (shipcountry)
- (2) + (3) → (shipcountry) + ((null)) → (shipcountry)
- (3) + (1) → (shipcity) + (shipcountry, shipcity) → (shipcountry, shipcity)
- (3) + (2) → (shipcity) + (shipcountry) → (shipcountry, shipcity)
- (3) + (3) → (shipcity) + ((null)) → (shipcity)
- (4) + (1) → ((null)) + (shipcountry, shipcity) → (shipcountry, shipcity)
- (4) + (2) → ((null)) + (shipcountry) → (shipcountry)
- (4) + (3) → ((null)) + ((null)) → ((null))
As you can see, there are a lot of duplicates. For example, (shipcountry, shipcity) shows up seven times in 1, 2, 3, 4, 7, 8, and 10.
If instead you'd specified different arguments to the rollup and cube, you'd get a wholly distinct set of grouping sets.
Lastly, remember what I said above: both rollup and cube are shortcuts for commonly used patterns of grouping sets. If you only want certain grouping sets, specify only those with a grouping sets clause!
Related Topics
How to Get a Hash of an Entire Table in Postgresql
Delimited Function in SQL to Split Data Between Semi-Colon
How to Correctly Handle Dates in Queries Constraints
Use a Query to Access Column Description in SQL
SQL Server Giving Logins(Users) Db_Owner Access to Database
Correct Way to Select from Two Tables in SQL Server with No Common Field to Join On
Informix SQL - List All Fields & Tables
Odata Case In-Sensitive Filtering in Web API
Change a Primary Key from Nonclustered to Clustered
Why Does Null Equal Integer in Where
Why Does SQL Server Return 0 for 1/2
Most Recent Record in a Left Join
How to Get a View Table Query (Code) in SQL Server 2008 Management Studio
Mysql: Select N Rows, But with Only Unique Values in One Column
How to Perform a Left Join in SQL Server Between Two Select Statements