Conditional Aggregation Performance

Conditional aggregation performance

Short summary

  • Performance of subqueries method depends on the data distribution.
  • Performance of conditional aggregation does not depend on the data distribution.

Subqueries method can be faster or slower than conditional aggregation, it depends on the data distribution.

Naturally, if the table has a suitable index, then subqueries are likely to benefit from it, because index would allow to scan only the relevant part of the table instead of the full scan. Having a suitable index is unlikely to significantly benefit the Conditional aggregation method, because it will scan the full index anyway. The only benefit would be if the index is narrower than the table and engine would have to read fewer pages into memory.

Knowing this you can decide which method to choose.


First test

I made a larger test table, with 5M rows. There were no indexes on the table.
I measured the IO and CPU stats using SQL Sentry Plan Explorer. I used SQL Server 2014 SP1-CU7 (12.0.4459.0) Express 64-bit for these tests.

Indeed, your original queries behaved as you described, i.e. subqueries were faster even though the reads were 3 times higher.

After few tries on a table without an index I rewrote your conditional aggregate and added variables to hold the value of DATEADD expressions.

Overall time became significantly faster.

Then I replaced SUM with COUNT and it became a little bit faster again.

After all, conditional aggregation became pretty much as fast as subqueries.

Warm the cache (CPU=375)

SELECT -- warm cache
COUNT(*) AS all_cnt
FROM LogTable
OPTION (RECOMPILE);

Subqueries (CPU=1031)

SELECT -- subqueries
(
SELECT count(*) FROM LogTable
) all_cnt,
(
SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-1,GETDATE())
) last_year_cnt,
(
SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-10,GETDATE())
) last_ten_year_cnt
OPTION (RECOMPILE);

Original conditional aggregation (CPU=1641)

SELECT -- conditional original
COUNT(*) AS all_cnt,
SUM(CASE WHEN datesent > DATEADD(year,-1,GETDATE())
THEN 1 ELSE 0 END) AS last_year_cnt,
SUM(CASE WHEN datesent > DATEADD(year,-10,GETDATE())
THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables (CPU=1078)

DECLARE @VarYear1 datetime = DATEADD(year,-1,GETDATE());
DECLARE @VarYear10 datetime = DATEADD(year,-10,GETDATE());

SELECT -- conditional variables
COUNT(*) AS all_cnt,
SUM(CASE WHEN datesent > @VarYear1
THEN 1 ELSE 0 END) AS last_year_cnt,
SUM(CASE WHEN datesent > @VarYear10
THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables and COUNT instead of SUM (CPU=1062)

SELECT -- conditional variable, count, not sum
COUNT(*) AS all_cnt,
COUNT(CASE WHEN datesent > @VarYear1
THEN 1 ELSE NULL END) AS last_year_cnt,
COUNT(CASE WHEN datesent > @VarYear10
THEN 1 ELSE NULL END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

times

Based on these results my guess is that CASE invoked DATEADD for each row, while WHERE was smart enough to calculate it once. Plus COUNT is a tiny bit more efficient than SUM.

In the end, conditional aggregation is only slightly slower than subqueries (1062 vs 1031), maybe because WHERE is a bit more efficient than CASE in itself, and besides, WHERE filters out quite a few rows, so COUNT has to process less rows.


In practice I would use conditional aggregation, because I think that number of reads is more important. If your table is small to fit and stay in the buffer pool, then any query will be fast for the end user. But, if the table is larger than available memory, then I expect that reading from disk would slow subqueries significantly.


Second test

On the other hand, filtering the rows out as early as possible is also important.

Here is a slight variation of the test, which demonstrates it. Here I set the threshold to be GETDATE() + 100 years, to make sure that no rows satisfy the filter criteria.

Warm the cache (CPU=344)

SELECT -- warm cache
COUNT(*) AS all_cnt
FROM LogTable
OPTION (RECOMPILE);

Subqueries (CPU=500)

SELECT -- subqueries
(
SELECT count(*) FROM LogTable
) all_cnt,
(
SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,100,GETDATE())
) last_year_cnt
OPTION (RECOMPILE);

Original conditional aggregation (CPU=937)

SELECT -- conditional original
COUNT(*) AS all_cnt,
SUM(CASE WHEN datesent > DATEADD(year,100,GETDATE())
THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables (CPU=750)

DECLARE @VarYear100 datetime = DATEADD(year,100,GETDATE());

SELECT -- conditional variables
COUNT(*) AS all_cnt,
SUM(CASE WHEN datesent > @VarYear100
THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables and COUNT instead of SUM (CPU=750)

SELECT -- conditional variable, count, not sum
COUNT(*) AS all_cnt,
COUNT(CASE WHEN datesent > @VarYear100
THEN 1 ELSE NULL END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

times2

Below is a plan with subqueries. You can see that 0 rows went into the Stream Aggregate in the second subquery, all of them were filtered out at the Table Scan step.

plan_subqueries

As a result, subqueries are again faster.

Third test

Here I changed the filtering criteria of the previous test: all > were replaced with <. As a result, the conditional COUNT counted all rows instead of none. Surprise, surprise! Conditional aggregation query took same 750 ms, while subqueries became 813 instead of 500.

times3

Here is the plan for subqueries:

plan_subqueries3

Could you give me an example, where conditional aggregation notably
outperforms the subquery solution?

Here it is. Performance of subqueries method depends on the data distribution. Performance of conditional aggregation does not depend on the data distribution.

Subqueries method can be faster or slower than conditional aggregation, it depends on the data distribution.

Knowing this you can decide which method to choose.


Bonus details

If you hover the mouse over the Table Scan operator you can see the Actual Data Size in different variants.

  1. Simple COUNT(*):

data size count


  1. Conditional aggregation:

data size conditional


  1. Subquery in test 2:

data size subquery test2


  1. Subquery in test 3:

data size subquery test3

Now it becomes clear that the difference in performance is likely caused by the difference in the amount of data that flows through the plan.

In case of simple COUNT(*) there is no Output list (no column values are needed) and data size is smallest (43MB).

In case of conditional aggregation this amount doesn't change between tests 2 and 3, it is always 72MB. Output list has one column datesent.

In case of subqueries, this amount does change depending on the data distribution.

Performance Wise, should calculations/joins/conditional logic/aggregate functions, etc, be done in the source qualifier or transformations?

There is no right answer to your question. It depends on the power of your DBMS v. the power of your Informatica environment and how efficiently your DBMS performs each type of transformation compared to your Informatica environment.

The only way is to try it and see - on your specific environment.

Conditional aggregation in a window function: gaps and island problem

Use LAG() and SUM() window functions to create the groups where the value of opinion has not changed and group by user_id, question to get the min date of each group inside a CTE.

Then do a self join of the CTE:

WITH cte AS (
SELECT user_id, question, grp, opinion, MIN(last_modified) [from]
FROM (
SELECT *, SUM(CASE WHEN opinion <> prev_opinion THEN 1 ELSE 0 END) OVER (PARTITION BY user_id, question ORDER BY last_modified) grp
FROM (
SELECT *, LAG(opinion, 1, ~opinion) OVER (PARTITION BY user_id, question ORDER BY last_modified) prev_opinion
FROM tablename
) t
) t
GROUP BY user_id, question, opinion, grp
)
SELECT c1.user_id, c1.question, c1.opinion, c1.[from], c2.[from] until
FROM cte c1 LEFT JOIN cte c2
ON c2.user_id = c1.user_id AND c2.question = c1.question AND c2.grp = c1.grp + 1

See the demo.

SQL sum of two conditional aggregation

You probably won't see a significant performance hit from what you're doing now as you already have all the data available, you're just repeating the case evaluation.

But you can't refer to the column aliases for the first two columns within the same level of query.

If you can't do a simple count as @Zeki suggested because you aren't sure if there might be values other than zero and one (though this looks rather like a binary true/false equivalent, so there may well be a check constraint limiting you to those values), or if you're just more interested in a more general case, you can use an inline view as @jarhl suggested:

select Year_CW,
"Total_sampled(Checked)",
"Total_unsampled(Not_Checked)",
"Accepted",
"Accepted with comments",
"Request for rework",
"Rejected",
"Total_sampled(Checked)" + "Total_unsampled(Not_Checked)" as "Total_DS"
from (
select Year_CW,
sum(case when col = 0 then 1 else 0 end) as "Total_sampled(Checked)",
sum(case when col = 1 then 1 else 0 end) as "Total_unsampled(Not_Checked)",
sum(case when col = 0 AND col2 = 'accepted' then 1 else 0 end) as "Accepted",
sum(case when col = 0 AND col2 = 'accepted with comments' then 1 else 0 end)
as "Accepted with comments",
sum(case when col = 0 AND col2 = 'request for rework' then 1 else 0 end)
as "Request for rework",
sum(case when col = 0 AND col2 = 'rejected' then 1 else 0 end) as "Rejected"
from (
select Year_CW, SAMPLED as col, APPROVAL as col2
from View_TEST tv
) tv
group by Year_CW
)
order by Year_CW desc;

The inner query gets the data and calculates the conditional aggregate values. The outer query just gets those values from the inner query, and also adds the Total_DS column to the result set by adding together the rwo values from the inner query.


You should generally avoid quoted identifiers, and if you really need them in your result set you should apply them at the last possible moment - so use unquoted identifiers in the inner query, and give them qupted aliases in the outer query. And personally if the point of a query is to count things, I prefer to use a conditional count over a conditional sum. I'm also not sure why you already have a subquery against your view, which just changes the column names and makes the main query slightly more obscure. So I might do this as:

select year_cw,
total_sampled_checked as "Total_sampled(Checked)",
total_unsampled_not_checked as "Total_unsampled(Not_Checked)",
accepted as "Accepted",
accepted_with_comments as "Accepted with comments",
request_for_rework as "Request for rework",
rejected as "Rejected",
total_sampled_checked + total_unsampled_not_checked as "Total_DS"
from (
select year_cw,
count(case when sampled = 0 then 1 end) as total_sampled_checked,
count(case when sampled = 1 then 1 end) as total_unsampled_not_checked,
count(case when sampled = 0 and approval = 'accepted' then 1 end) as accepted,
count(case when sampled = 0 and approval = 'accepted with comments' then 1 end)
as accepted_with_comments,
count(case when sampled = 0 and approval = 'request for rework' then 1 end)
as request_for_rework,
count(case when sampled = 0 and approval = 'rejected' then 1 end) as rejected
from view_test
group by year_cw
)
order by year_cw desc;

Note that in the case expression, then 1 can be then <anything that isn't null>, so you could do then sampled or whatever. I've left out the implicit else null. As count() ignores nulls, all the case expression has to do is evaluate to any not-null value for the rows you want to include in the count.

One-to-many query and row_number() and conditional aggregation?

If the grid is what you really want to have then you need to group them and check for each and you even don't need to do any modification in c# for it either if I'm not mistaken:

SELECT s.StampId 
, s.Country
, s.Year
, MAX(case when q.statusId = 1 and q.MintUsedId = 1 then 1 ELSE 0 END) as HaveMNH
, MAX(case when q.statusId = 1 and q.MintUsedId = 2 then 1 ELSE 0 END) as HaveMH
, ....
FROM Stamp s
LEFT JOIN StamQuantities sq
ON s.StampId = sq.StampId
LEFT JOIN Quantities q
ON q.quantitiesId = qs.quantitiesId
GROUP BY s.StampId
, s.Country
, s.Year

Multiple Case When slows down query performance

Another option, which may be easier in your case, is to use APPLY. (OUTER APPLY is similar to LEFT JOIN and CROSS APPLY is similar to INNER JOIN).

You can use conditional aggregation within the subquery.

SELECT t1.ServiceWaittime,
t1.Starttime,
t1.Endtime,
t1.prevEndTime,
CONVERT(float, t1.Starttime - t1.prevEndTime) / 24.0 / 60.0 AS continuityDuration,
t1.Duration,
t1.MainCpseId,
t1.Yield,
t1.Scrap,
t1.MachineYield,
ISNULL((t1.MachineYield / NULLIF(t1.Duration, 0)), 0) AS Geschwindigkeit,
t1.Te,
t1.Tr,
t1.CalendarWeek,
t1.InterruptionTriggerAttributeId,
t1.ReasonId,
t1.InterruptionName,
t1.ReasonName,
el.Sys_Speed_Value, --statement1
el.attr_103, --statement2
el.attr_292, --statement3
el.attr_293, --statement4
el.attr_294, --statement5
el.attr_8159, --statement6
el.attr_8175, --statement7
el.attr_8186, --statement8
el.attr_8208, --statement9
el.attr_8209 --statement10

FROM (
SELECT t1.TimeType,
t1.ServiceWaittime,
t1.Starttime,
t1.Endtime,
LAG(t1.Endtime) OVER (PARTITION BY t1.MainCpseId ORDER BY t1.Starttime ASC) AS prevEndTime,
t1.Duration,
t1.MainCpseId,
t1.Yield,
t1.Scrap,
t1.MachineYield,
t1.Te,
t1.Tr,
t1.CalendarWeek,
t1.InterruptionTriggerAttributeId,
t1.ReasonId,
t2.Name AS InterruptionName,
t3.Name AS ReasonName
FROM CpseProcessLogger t1
LEFT JOIN AttributeType t2 ON t1.InterruptionTriggerAttributeId = t2.Id
LEFT JOIN Reason t3 ON t1.ReasonId = t3.Id
WHERE 1 = 1
AND TimeType IN ('UNT')
AND MainCpseId = 12
) t1
CROSS APPLY (
SELECT
ISNULL(AVG(CASE WHEN AttributeId = 4 THEN Value END), 0) AS Sys_Speed_Value,
COUNT(CASE WHEN AttributeId = 103 THEN 1 END) AS attr_103,
COUNT(CASE WHEN AttributeId = 292 THEN 1 END) AS attr_292,
COUNT(CASE WHEN AttributeId = 293 THEN 1 END) AS attr_293,
COUNT(CASE WHEN AttributeId = 294 THEN 1 END) AS attr_294,
COUNT(CASE WHEN AttributeId = 8159 THEN 1 END) AS attr_8159,
COUNT(CASE WHEN AttributeId = 8175 THEN 1 END) AS attr_8175,
COUNT(CASE WHEN AttributeId = 8186 THEN 1 END) AS attr_8186,
COUNT(CASE WHEN AttributeId = 8208 THEN 1 END) AS attr_8208,
COUNT(CASE WHEN AttributeId = 8209 THEN 1 END) AS attr_8209
FROM CpseEventLogger
WHERE MainCpseId = 12
AND Timestamp BETWEEN t1.Starttime AND t1.Endtime
) el
ORDER BY t1.Starttime ASC;

For this query to perform well, you want the following indexes:

CpseEventLogger (MainCpseId, Timestamp) INCLUDE (AttributeId, Value)
CpseProcessLogger (MainCpseId, TimeType, InterruptionTriggerAttributeId) INCLUDE (
ReasonId, .... othercolumns)
AttributeType (Id) INCLUDE (Name)
Reason (Id) INCLUDE (Name)

Or instead of a non-clustered index with INCLUDE columns, you could use a clustered index (all non-key columns are INCLUDE).

You may want to swap around InterruptionTriggerAttributeId and ReasonId positions in the index

How to implement a conditional match stage in the aggregate pipeline

Use it like this:

const pipeline = [];
if (!isAdmin) pipeline.push({ $match: { isLive: true } });
pipeline.push({ $project: { _id: 1 } });

db.collection.aggregate(pipeline)

You may use also other Array methods to compose your pipeline.

Or try

{ $match: { $expr: { $or: [isAdmin, "$isLive"] } } }

Sub query vs joins performance

I would probably write this query using joins:

SELECT
s.siteid,
COALESCE(si.CountUniquePermissions, 0) AS CountUniquePermissions,
COALESCE(si.CountNotModified30Days, 0) AS CountNotModified30Days
FROM sites s
LEFT JOIN
(
SELECT siteid,
COUNT(CASE WHEN CountUniqueRoleAssignments > 0 THEN 1 END)
AS CountUniquePermissions,
COUNT(CASE WHEN Modified < DATEADD (day, -30, GETDATE()) THEN 1 END)
AS CountNotModified30Days
FROM ScannedItems
GROUP BY siteid
) si
ON si.siteid = s.siteid
ORDER BY
s.siteid;

The above query has no WHERE or HAVING clauses, and so I don't see any obvious way to tune it further using indices. But it at least has the potential advantage over your current query that it doesn't involve N^2 behavior with correlated subqueries in the select clause.

Add a CASE on a SUM when using conditional aggregation

I would suggest a CTE or subquery:

with cte as (
<your query here>
)
select cte.*
from cte
where movement < 10; -- or whatever condition

Note: You might actually want the absolute value if you really mean -10 to 10 rather than 0 to 10:

where abs(movement) < 10


Related Topics



Leave a reply



Submit