String_Agg Not Behaving as Expected

STRING_AGG not behaving as expected

Yes, this is a Bug (tm), present in all versions of SQL Server 2017 (as of writing). It's fixed in Azure SQL Server and 2019 RC1. Specifically, the part in the optimizer that performs common subexpression elimination (ensuring that we don't calculate expressions more than necessary) improperly considers all expressions of the form STRING_AGG(x, <separator>) identical as long as x matches, no matter what <separator> is, and unifies these with the first calculated expression in the query.

One workaround is to make sure x does not match by performing some sort of (near-)identity transformation on it. Since we're dealing with strings, concatenating an empty one will do:

SELECT y, STRING_AGG(z, '+') AS STRING_AGG_PLUS, STRING_AGG('' + z, '-') AS STRING_AGG_MINUS
FROM (
VALUES
(1, 'a'),
(1, 'b')
) x (y, z)
GROUP by y

SQL Server STRING_AGG function sorting is not working as expected

Firstly, I do agree that the behaviour you're getting shouldn't be happening, however, Stack Overflow isn't for reporting bugs with applications. For SQL Server, that should be done in their Azure Feedback portal.

As for resolving the issue, removing the redundant DISTINCT from your COUNT causes the problem to disappear. To implement a DISTINCT (either in a SELECT DISTINCT or a COUNT(DISTINCT {expression})) SQL Server needs to first sort the results as then it can easily remove any values that have the same sort position. As a result that sort is being expressed in your STRING_AGG expressions, even though they have an explicit ORDER BY clause.

The reason I say your DISTINCT is redundant is because at that point in the query there will be no duplicate values of Manager for a given value of ClCode. This is because you already grouped on both Manager and ClCode in the subquery. If you run that query alone, you'll see that Manager doesn't have any duplicates:

WITH tbl AS
(SELECT Id,
ClCode,
Manager,
ChangeDate
FROM (VALUES (1, '000005', 'Cierra Vega', '2017-10-05'),
(2, '000005', 'Alden Cantrell', '2017-11-29'),
(3, '000005', 'Alden Cantrell', '2017-11-30'),
(4, '000005', 'Kierra Gentry', '2018-09-05'),
(5, '000005', 'Kierra Gentry', '2018-09-12'),
(6, '000005', 'Pierre Cox', '2018-11-06'),
(7, '000005', 'Thomas Crane', '2019-09-11'),
(8, '000005', 'Thomas Crane', '2019-10-01'),
(9, '000005', 'Miranda Shaffer', '2020-04-27'),
(10, '000360', 'Bradyn Kramer', '2017-10-06')) t (Id, ClCode, Manager, ChangeDate) )
SELECT x.ClCode,
x.[Manager],
MIN(x.ChangeDate) AS [MinChangeDate]
FROM tbl x
GROUP BY x.ClCode,
x.[Manager];

As such, the DISTINCT in the COUNT is just added overhead for the instance, as it's not required (SQL Server has already sorted the data for the GROUP BY so why ask it to sort it again?). If you Are using a DISTINCT in a query you've already aggregated, then you very likely don't need it.

Why is my ORDER BY in STRING_AGG not always working?

This appears to be a bug in the optimizer.

The optimizer, having realized that the join is a self-join, is transforming it into a window aggregate. It can do this despite STRING_AGG not being available as a window aggregate. The rule is called GenGbApplySimple, and allows a self-join to be converted to a window aggregate. There is nothing specifically wrong with this so far.

Plan

PasteThePlan

The problem is that the aggregation is over the wrong value. It is aggregating the outer value rather than the inner one.

If you give the two references different aliases, then a careful examination of the query plan reveals the bug.

STRING_AGG([dbo].[HashTable].[Hash] as [HT1].[Hash],'')
WITHIN GROUP (ORDER BY [HT2].[Hash])

The other issue is that the aggregates used with that rule (e.g. MIN, MAX, AVG) don't have a WITHIN GROUP ordering to satisfy, so the replacement plan doesn't account for it. It seems likely that STRING_AGG was not intended to work the GbApply rules, or work would be needed to make it compatible (honouring the sort request).

As you can see below, the Sort only orders by the correlation column GroupIdentifier, not by the Hash column used in the WITHIN GROUP.

<OrderBy>
<OrderByColumn Ascending="1">
<ColumnReference
Database="[...]"
Schema="[dbo]"
Table="[HashTable]"
Alias="[HT1]"
Column="GroupIdentifier">
</ColumnReference>
</OrderByColumn>
</OrderBy>

If you are a sysadmin, you can turn this rule off for the query, by using the following undocumented OPTION.

OPTION (QUERYRULEOFF GenGbApplySimple)

As a workaround, one option to prevent this optimization being applied is to use a grouped OUTER APPLY

UPDATE HT1
SET GroupHashList = C.HashList
OUTPUT inserted.*
FROM HashTable AS HT1
OUTER APPLY
(
SELECT
HashList =
STRING_AGG(HT2.[Hash], ';')
WITHIN GROUP (ORDER BY HT2.[Hash] ASC)
FROM HashTable AS HT2
WHERE HT2.GroupIdentifier = HT1.GroupIdentifier
) C;

This gets you a pretty straightforward self-join with a Stream Aggregate.

db<>fiddle


I strongly suggest you file this as a bug with Microsoft.

You could also leave feedback, but that does not typically lead to a specific response.


As an aside, you should follow the aliasing rules suggested by Conor Cunningham when writing multi-table UPDATE statements:

The non-ANSI FROM clause (which you are using here) has specific binding behaviors that may or may not be what you expect. I will suggest you start by aliasing the 3 references to hashtable to be different and then make sure you are explicitly refering to the one you want. It may be (I am guessing) that it is binding to a different one than you think and providing you an undesired output as a result.

order by in string_agg does not seems to work

Your line_no is varchar, as you can typically notice

'1' < '14' < '16 < '17' < '2'

So, just simply parse the varchar into int solve the problem.

select recid, 
STRING_AGG(DefaultDimension, '-') WITHIN GROUP (ORDER BY CAST(line_no AS int) ASC) DefaultDimension,
STRING_AGG(DefaultDimensionName, '-') WITHIN GROUP (ORDER BY CAST(line_no AS int)ASC) DefaultDimensionName
from #tmp
group by recid

T-SQL STRING_AGG problems dunno if bad writing or just not working

I don't think you want to group by InventoryId if that's what you're concatenating... Try this:

Edit, you need to remove columns that are different from row-to-row.

SELECT p.FirstName [Spelers Voornaam]
,p.LastName [Spelers Achternaam]
,pa.FamilyName [Familie's Groeps Naam]
,string_agg (i.InventoryId, ',') as [In Inventory]
FROM Player AS p
LEFT JOIN PlayerAvatar AS pa ON p.PlayerId = pa.PlayerId
LEFT JOIN Avatar AS Av ON pa.AvatarId = Av.AvatarId
LEFT JOIN Avatar AS a ON pa.AvatarId = a.AvatarId
LEFT JOIN Inventory as i on i.InventoryId = pa.InventoryId
LEFT JOIN Item as it on it.ItemId = i.ItemId
GROUP BY p.FirstName, p.LastName, pa.AvatarName, pa.FamilyName, av.Type

Or you can aggregate those columns too.

SELECT p.FirstName [Spelers Voornaam]
,p.LastName [Spelers Achternaam]
,string_agg(pa.AvatarName,',') [Spelers Avatarnaam]
,pa.FamilyName [Familie's Groeps Naam]
,string_agg(Av.Type,',') [Avatar's Type]
,string_agg (i.InventoryId, ',') as [In Inventory]

FROM Player AS p
LEFT JOIN PlayerAvatar AS pa ON p.PlayerId = pa.PlayerId
LEFT JOIN Avatar AS Av ON pa.AvatarId = Av.AvatarId
LEFT JOIN Avatar AS a ON pa.AvatarId = a.AvatarId
LEFT JOIN Inventory as i on i.InventoryId = pa.InventoryId
LEFT JOIN Item as it on it.ItemId = i.ItemId
GROUP BY p.FirstName, p.LastName, pa.FamilyName,

STRING_AGG working on compatibility level 140

This behavior is documented at the very last line of the remarks section:

STRING_AGG is available in any compatibility level.

Which means that if you are running an SQL Server version that supports string_agg -

  • SQL Server 2017 (or higher)
  • Azure SQL Database
  • Azure Synapse Analytics (SQL DW)

the string_agg built in function will work regardless of the compatibility level that set to the specific database you're working with.

SQL Server - STRING_AGG separator as conditional expression

You are almost there. Just reverse the order and use stuff and you can eliminate the need for a cte and most of the string functions:

SELECT STUFF(
STRING_AGG(
(IIF(i % 2 = 0, ' OR ', ' AND '))+c
, '') WITHIN GROUP (ORDER BY i)
, 1, 5, '') AS r
FROM t;

Results: a OR b AND c OR d AND e

db<>fiddle demo

Since the first row i % 2 equals 1, you know the string_agg result will always start with and: and a or b...
Then all you do is remove the first 5 chars from that using stuff and you're home free.

I've also taken the liberty to replace the CASE expression with the shorter IIF

Update

Well, in the case the selected separator is not known in advance, I couldn't come up with a single query solution, but I still think I found a simpler solution than you've posted - separating my initial solution to a cte with the string_agg and a select from it with the stuff, while determining the length of the delimiter by repeating the condition:

WITH CTE AS
(
SELECT MIN(i) As firstI,
STRING_AGG(
(IIF(i % 2 = 0, ' OR ', ' AND '))+c
, '') WITHIN GROUP (ORDER BY i)
AS r
FROM t
)

SELECT STUFF(r, 1, IIF(firstI % 2 = 0, 4, 5), '') AS r
FROM CTE;

db<>fiddle demo #2

Use String_AGG to query with condition in SQL?

I finally found the most accurate answer to my question. Thanks all!
The best solution is:

ALTER PROC FindPolicyByService
@codes varchar(200)
AS
BEGIN
SELECT p.ID AS PolicyID,
p.Code AS PolicyCode,
p.Name AS PolicyName,
STRING_AGG(s.Code, ',') AS ServiceCode
FROM dbo.DXBusinessPolicy_Policy AS p
JOIN dbo.DXBusinessPolicy_PolicyService AS ps ON p.ID = ps.PolicyID
JOIN dbo.DXBusinessPolicy_Service AS s ON ps.ServiceID = s.ID
WHERE p.ID IN
(
SELECT subps.PolicyID
FROM dbo.DXBusinessPolicy_PolicyService AS subps
JOIN dbo.DXBusinessPolicy_Service AS subs ON subps.ServiceID = subs.ID
WHERE subs.Code = @ServiceCode
)
GROUP by p.ID, p.Code, p.Name
END

Or the orther solution:

ALTER PROC FindPolicyByService
@ServiceCode varchar(200)
AS
BEGIN
SELECT DIstinct policy.ID AS PolicyID,
policy.Code AS PolicyCode,
policy.Name AS PolicyName,
(SELECT STRING_AGG(tempService.Code, ',') FROM dbo.DXBusinessPolicy_Policy tempPolicy
JOIN dbo.DXBusinessPolicy_PolicyService tempPolicyService
ON tempPolicy.ID = tempPolicyService.PolicyID
JOIN dbo.DXBusinessPolicy_Service tempService
ON tempPolicyService.ServiceID = tempService.ID
WHERE policyservice.PolicyID = PolicyID) AS ServiceCode
FROM dbo.DXBusinessPolicy_Policy policy
JOIN dbo.DXBusinessPolicy_PolicyService policyservice
ON policy.ID = policyservice.PolicyID
JOIN dbo.DXBusinessPolicy_Service service
ON policyservice.ServiceID = service.ID AND service.Code = @ServiceCode
GROUP BY
policy.ID,
policyservice.PolicyID,
policy.Code,
policy.Name
END

It all gave me the same result I expected:






















PolicyCodePolicyNameServices
COMBO.2103001[Giá nền] T9/2020 #1INT,IPTV
INT.2103001Chính sách 2INT

SQL how to prevent duplicates in STRING_AGG when joining multiple tables

One approach in this sort of situation is to STRING_AGG each of the constituent tables (or sets of tables) first; then, LEFT JOIN those contrived tables onto the main table. This sidesteps the problem of multiplication that can occur during consecutive LEFT JOINs.

In your case, try something like this:

SELECT
shows.*,
show_genre_names.genre_names,
show_actors.actor_ids,
show_actors.actor_names
FROM
shows
LEFT JOIN
( -- one row per show_id
SELECT
sg.show_id,
STRING_AGG(g.name, ', ') AS genre_names
FROM
show_genres sg
JOIN genres g ON g.id = sg.genre_id
GROUP BY
sg.show_id
) show_genre_names
ON shows.id = show_genre_names.show_id
LEFT JOIN
( -- one row per show_id
SELECT
sc.show_id,
STRING_AGG(a.id, ', ') AS actor_ids,
STRING_AGG(a.name, ', ') AS actor_names
FROM
show_characters sc
JOIN actors a ON a.id = sc.actor_id
GROUP BY
sc.show_id
) show_actors
ON shows.id = show_actors.show_id
WHERE
shows.id = 1390
;

You can solve this in other ways, too, but understanding this technique will be helpful in your SQL journey.



Related Topics



Leave a reply



Submit