Differencebetween Cube, Rollup and Groupby Operators

What is the difference between cube, rollup and groupBy operators?

These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words

table.groupBy($"foo", $"bar")

is equivalent to:

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
df.show

// +---+---+
// | x| y|
// +---+---+
// |foo| 1|
// |foo| 2|
// |bar| 2|
// |bar| 2|
// +---+---+

and you compute cube(x, y) with count as an aggregation:

df.cube($"x", $"y").count.show

// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+

A similar function to cube is rollup which computes hierarchical subtotals from left to right:

df.rollup($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// | foo|null| 2| <- count where x is fixed to foo
// | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2
// | foo| 1| 1| ...
// | foo| 2| 1| ...
// |null|null| 4| <- count where no column is fixed
// | bar|null| 2| <- count where x is fixed to bar
// +----+----+-----+

Just for comparison lets see the result of plain groupBy:

df.groupBy($"x", $"y").count.show

// +---+---+-----+
// | x| y|count|
// +---+---+-----+
// |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+

To summarize:

  • When using plain GROUP BY every row is included only once in its corresponding summary.
  • With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):

    SELECT NULL, NULL, COUNT(*) FROM table
    UNION ALL
    SELECT x, NULL, COUNT(*) FROM table GROUP BY x
    UNION ALL
    SELECT NULL, y, COUNT(*) FROM table GROUP BY y
    UNION ALL
    SELECT x, y, COUNT(*) FROM table GROUP BY x, y
  • With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.

    SELECT NULL, NULL, COUNT(*) FROM table
    UNION ALL
    SELECT x, NULL, COUNT(*) FROM table GROUP BY x
    UNION ALL
    SELECT x, y, COUNT(*) FROM table GROUP BY x, y

ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.

When to use GROUPING SETS, CUBE and ROLLUP

Firstly, for those who haven't already read up on the subject:

  • Using GROUP BY with ROLLUP, CUBE, and GROUPING SETS

That being said, don't think about these grouping options as ways to get a result set. These are performance tools.

Let's take ROLLUP as a simple example.

I can use the following query to get the count of records for each value of GrpCol.

SELECT   GrpCol, count(*) AS cnt
FROM dbo.MyTable
GROUP BY GrpCol

And I can use the following query to summarily "roll up" the count of ALL records.

SELECT   NULL, count(*) AS cnt
FROM dbo.MyTable

And I could UNION ALL the above two queries to get the exact same results I might get if I had written the first query with the ROLLUP clause (that's why I put the NULL in there).

It might actually be more convenient for me to execute this as two different queries because then I have the grouped results separate from my totals. Why would I want my final total mixed right in to the rest of those results? The answer is that doing both together using the ROLLUP clause is more efficient. SQL Server will use an execution plan that calculates all of the aggregations together in one pass. Compare that to the UNION ALL example which would provide the exact same results but use a less efficient execution plan (two table scans instead of one).

Imagine an extreme example in which you are working on a data set so large that each scan of the data takes one whole hour. You have to provide totals on basically every possible dimension (way to slice) that data every day. Aha! I bet one of these grouping options is exactly what you need. If you save off the results of that one scan into a special schema layout, you will then be able to run reports for the rest of the day off the saved results.

So I'm basically saying that you're working on a data warehouse project. For the rest of us it mostly falls into the "neat thing to know" category.

Comparison of the queries

Drill down and roll up functions are inverse and allow you to add and remove granularity in axes like a zoom.

  • a roll up - less granularity in the target table (for example years instead of months)
  • a drill down - more granularity in the target table (months instead of years)
  • neither of these two - there is other data in the tables

A three-dimensional cube is given, where the dimensions are:

  • Geography
  • Time
  • Product

Values are sales volumes (SUM (Sales.numberSold)).

Let's call this source cube Z.

In the end, A, B, and C show only two dimensions:

  • Geography
  • Time

The Product is always shrunk into one dimension.

Cube A:

The Product is shrunk into one dimension by the slice for Product.ProductFamily = "video"

Z

  • rollup on Product (from ArticleName to ProductFamily)
  • slice for Product.ProductFamily = "video"
  • rollup on Time (from Day to Year)
  • slice for Time.Year = 2000
  • rollup on Geography (from BranchName to Country)
  • slice for Geography.Country = "Germany"
  • drill down Geography from Country to Region
  • drill down Time from Year to Month

A

Cube B:

The Product is shrunk into one dimension by rollup on Product (from ArticleName to All).

The B is formed from the Z with the same granularity of rollups and drills down.

The result of query B is neither of these two in comparison to the result of query A.

Z

  • rollup on Product (from ArticleName to All)
  • rollup on Time (from Day to Year)
  • slice for Time.Year = 2000
  • rollup on Geography (from BranchName to Country)
  • slice for Geography.Country = "Germany"
  • drill down Geography from Country to Region
  • drill down Time from Year to Month

B

Cube C:

The Product is shrunk into one dimension the same way as in Cube A.

So the only difference is in granularity.

The C is more detailed.

The result of query C is a drill down in comparison to the result of query A.

A

  • drill down Geography (from Region to City)

C

Sources:

Data Warehousing - OLAP on tutorialspoint

Online analytical processing on wikipedia

OLAP Operations in the Multidimensional Data Model on javatpoint

Z

                    .   .   .   .   .
. . . . .
+---+---+---+---+ .
audio 123 / 2 / 9 / / /|.
+---+---+---+---+ + .
video 321 / 3 / 6 / / /|/|.
+---+---+---+---+ + + .
video 123 / 5 / 2 / / /|/|/|.
+---+---+---+---+ + + + .
LA | 5 | 2 | | |/|/|/|.
+---+---+---+---+ + + +
NY | 3 | 8 | | |/|/|/:
+---+---+---+---+ + + :
| | | | |/|/:
+---+---+---+---+ + :
| | | | |/:
+---+---+---+---+ :
: : : : :
: : : : :
01-01-22
02-01-22

Edit 1:

The disadvantage of the sources is that they say what a given function is and not what it isn't. It is important to thoroughly understand what they do to determine what they can't.

Even if you only have to decide between the Roll-Up and the Drill Down it is necessary to understand the Slice in your example. The Slice is pretty weak, so it's a good idea to find out that it's a case of the Dice.

The Roll-Up and the Drill Down aggregate all values - the Group by clause.

The Slice (Dice) filters - the Where clause.

Things achieved by the Slice can't be achieved by the Roll-Up.

The Roll-Up and the Drill Down in dimension Product can scale on All (remove the dimension), Category, Family, Group, and Name.

In select A, we get rid of the Product dimension using the Slice and in query B using the Roll-Up to All.
The Slice in query A leaves only Sales.numberSold for "video". This can't be achieved with Roll up.

How does Impala Implements GroupBy Extension(CUBE, ROLLUP and GROUPING SETS) In a distributed way?

Impala introduced the group by modifiers in 7.2.2:

Added support for GROUP BY ROLLUP, CUBE and GROUPING SETS. The GROUP BY ROLLUP clause creates a group for each combination of column expressions. The CUBE clause creates groups for all possible combinations of columns. The GROUPING SETS just lets you list out the combinations of expressions that you want to GROUP BY.

This is explained in the documentation for GROUP BY starting with that version.

How does Postgres implement the CUBE-, ROLLUP- and GROUPING SETS operators

The implementation is based on processing sorted data. You can see result of EXPLAIN statement:

 postgres=# EXPLAIN SELECT a, b, sum(c) FROM foo GROUP BY ROLLUP(a,b);
┌────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞════════════════════════════════════════════════════════════════════╡
│ GroupAggregate (cost=142.54..166.99 rows=405 width=12) │
│ Group Key: a, b │
│ Group Key: a │
│ Group Key: () │
│ -> Sort (cost=142.54..147.64 rows=2040 width=12) │
│ Sort Key: a, b │
│ -> Seq Scan on foo (cost=0.00..30.40 rows=2040 width=12) │
└────────────────────────────────────────────────────────────────────┘
(7 rows)

postgres=# EXPLAIN SELECT a, b, sum(c) FROM foo GROUP BY CUBE(a,b);
┌────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞════════════════════════════════════════════════════════════════════╡
│ GroupAggregate (cost=142.54..302.48 rows=605 width=12) │
│ Group Key: a, b │
│ Group Key: a │
│ Group Key: () │
│ Sort Key: b │
│ Group Key: b │
│ -> Sort (cost=142.54..147.64 rows=2040 width=12) │
│ Sort Key: a, b │
│ -> Seq Scan on foo (cost=0.00..30.40 rows=2040 width=12) │
└────────────────────────────────────────────────────────────────────┘
(9 rows)

Data are sorted, and then continually aggregated.

Should I use GROUPING SETS, CUBE, or ROLLUP in Postgres

It looks like you want to ROLLUP yourdata using a GROUPING SET:

select case grouping(studentnr)
when 0 then studentnr
else count(distinct studentnr)|| ' students'
end studentnr
, count(distinct case careday when 'monday' then studentnr end) monday
, count(distinct case careday when 'tuesday' then studentnr end) teusday
, count(distinct case careday when 'wednesday' then studentnr end) wednesday
, count(distinct case careday when 'thursday' then studentnr end) thursday
, count(distinct case careday when 'friday' then studentnr end) friday
, durationid
from yourdata
group by rollup ((studentnr, durationid))

Which yields the desired results:

|  studentnr | monday | teusday | wednesday | thursday | friday | durationid |
|------------|--------|---------|-----------|----------|--------|------------|
| 10177 | 1 | 1 | 1 | 1 | 1 | 1507 |
| 717208 | 1 | 1 | 1 | 1 | 1 | 1507 |
| 722301 | 1 | 1 | 1 | 1 | 0 | 1507 |
| 3 students | 3 | 3 | 3 | 3 | 2 | (null) |

The second set of parenthesis in the ROLLUP indicates that studentnr and durationid should be summarized at the same level when doing the roll up.

With just one level of summarization, there's not much difference between ROLLUP and CUBE, however to use GROUPING SETS would require a slight change to the GROUP BY clause in order to get the lowest desired level of detail. All three of the following GROUP BY statements produce equivalent results:

 group by rollup ((studentnr, durationid))
group by cube ((studentnr, durationid))
group by grouping sets ((),(studentnr, durationid))


Related Topics



Leave a reply



Submit