Aggregate Strings in Azure SQL Data Warehouse

Aggregate strings in Azure SQL Data Warehouse

Update Jan 2019: As at Jan 2019 STRING_SPLIT is available in SQL Data Warehouse as per here.

Update Jul 2019: As at Jul 2019 STRING_AGG is available in SQL Data Warehouse as per here.

Azure SQL Data Warehouse is an MPP system designed to hold billions of records and terabytes of data, so as a first port of call you should strongly consider if it's the right place to do string manipulation. It's also kind of expensive depending on what DWU you are running at. Have a read through this anti-patterns article.

Azure SQL Data Warehouse Workload Patterns and Anti-Patterns
https://blogs.msdn.microsoft.com/sqlcat/2017/09/05/azure-sql-data-warehouse-workload-patterns-and-anti-patterns/

If you really need to do this in your warehouse then I did do a recent example specifically for Azure SQL Data Warehouse using a procedural approach:

Does Azure SQL Data Warehouse have a way to split strings?

Update Jul 2019 - STRING_SPLIT is now available in Azure SQL Data Warehouse as per here. So in my example below, the code would be more like this:

DECLARE @delimiter CHAR(1) = '-';

CREATE TABLE dbo.guids_split
WITH
(
    DISTRIBUTION = HASH(xguid),
    HEAP
)
AS
SELECT *
FROM dbo.guids g
    CROSS APPLY STRING_SPLIT ( xguid, @delimiter );

Azure SQL Data Warehouse has a reduced T-SQL surface area as compared with normal SQL Server or Azure SQL Database. It does not have any of the fancy tricks such as STRING_SPLIT, table-valued functions, CLR, XML; even cursors are not allowed. In fact for all the techniques in one of the go-to articles on this topic (pre-SQL 2016) 'Split strings the right way - or the next best way', you can't use any of them, with the exception of the numbers table.

~~Therefore we need something a bit more procedural, avoiding loops of any kind. I have used the above article for inspiration, used an adapted version of the test data script and this approach:~~

-- Create one million guids
IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers
IF OBJECT_ID('dbo.guids_split') IS NOT NULL DROP TABLE dbo.guids_split
IF OBJECT_ID('dbo.guids') IS NOT NULL DROP TABLE dbo.guids
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp
GO

CREATE TABLE dbo.Numbers (
    Number  INT NOT NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,     --!!TODO try distibuting?
    CLUSTERED INDEX ( Number )
)
GO

DECLARE @UpperLimit INT = 1000000;

;WITH n AS
(
    SELECT
        x = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
    FROM       sys.all_objects AS s1
    CROSS JOIN sys.all_objects AS s2
    CROSS JOIN sys.all_objects AS s3
)
SELECT x
INTO #tmp
FROM n
WHERE x BETWEEN 1 AND @UpperLimit
GO

INSERT INTO dbo.Numbers ( Number )
SELECT x
FROM #tmp
GO

CREATE TABLE dbo.guids (
    rn  INT IDENTITY,
    xguid   CHAR(36) NOT NULL
)
WITH
(
    DISTRIBUTION = HASH(xguid),
    CLUSTERED COLUMNSTORE INDEX
)
GO

INSERT INTO dbo.guids ( xguid )
SELECT NEWID() xguid
FROM dbo.Numbers
GO -- 10    -- scale up 10 to 100, 1,000 etc

ALTER INDEX ALL ON dbo.guids REBUILD 
GO

-- Create the stats
CREATE STATISTICS _st_numbers_number ON dbo.numbers (number);
CREATE STATISTICS _st_guids_rn ON dbo.guids (rn);
CREATE STATISTICS _st_guids_xguid ON dbo.guids (xguid);
GO
-- multi-col stat?
:exit

-- NB The length of the guid; so we don't have to use VARCHAR(MAX)
DECLARE @delimiter VARCHAR(1) = '-';

CREATE TABLE dbo.guids_split
WITH
(
    DISTRIBUTION = HASH(xguid),
    HEAP
)
AS
SELECT
    s.rn,
    n.Number n,
    originalid AS xguid,
    LTRIM( RTRIM( SUBSTRING( s.xguid, n.Number + 1, CHARINDEX( @delimiter, s.xguid, n.Number + 1 ) - n.Number - 1 ) ) ) AS split_value
FROM (
    SELECT
        rn,
        xguid AS originalid,
        CAST( CAST( @delimiter AS VARCHAR(38) ) + CAST( xguid AS VARCHAR(38) ) + CAST( @delimiter AS VARCHAR(38) ) AS VARCHAR(38) ) AS xguid
        FROM dbo.guids
        ) s
    CROSS JOIN dbo.Numbers n
WHERE n.Number < LEN( s.xguid )
  AND SUBSTRING( s.xguid, n.Number, 1 ) = @delimiter;
GO

/*
SELECT TOP 10 * FROM dbo.guids ORDER BY rn;

SELECT *
FROM dbo.guids_split
WHERE rn In ( SELECT TOP 10 rn FROM dbo.guids ORDER BY rn )
ORDER BY 1, 2;
GO

*/

The script is now tested on ADW and worked satisfactorily over 100 million records. This ran in under 4 mins at only DWU 400 (at least once I had added the stats and removed the varchar(max) : ). The guids is however a slightly artificial example as the data is uniform in size and always only 5 parts to split.

Getting good performance out of Azure SQL Data Warehouse is really to do with minimising data movement via a good hash distribution key. Therefore please post some realistic sample data.

The other alternative is Azure Data Lake Analytics. ADLA supports federated queries to "query data where it lives", so you could query the original table using U-SQL, split it using the native .net method and output a
which could easily be imported using Polybase. Let me know if you need more help with this approach and I'll do up an example.

The SQLCat team have since published this article on anti-patterns with SQL Data Warehouse, which this type of string processing might be considered an example of. Please read this article:

https://blogs.msdn.microsoft.com/sqlcat/2017/09/05/azure-sql-data-warehouse-workload-patterns-and-anti-patterns/

Optimal way to concatenate/aggregate strings

SOLUTION

The definition of optimal can vary, but here's how to concatenate strings from different rows using regular Transact SQL, which should work fine in Azure.

;WITH Partitioned AS
(
    SELECT 
        ID,
        Name,
        ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Name) AS NameNumber,
        COUNT(*) OVER (PARTITION BY ID) AS NameCount
    FROM dbo.SourceTable
),
Concatenated AS
(
    SELECT 
        ID, 
        CAST(Name AS nvarchar) AS FullName, 
        Name, 
        NameNumber, 
        NameCount 
    FROM Partitioned 
    WHERE NameNumber = 1

    UNION ALL

    SELECT 
        P.ID, 
        CAST(C.FullName + ', ' + P.Name AS nvarchar), 
        P.Name, 
        P.NameNumber, 
        P.NameCount
    FROM Partitioned AS P
        INNER JOIN Concatenated AS C 
                ON P.ID = C.ID 
                AND P.NameNumber = C.NameNumber + 1
)
SELECT 
    ID,
    FullName
FROM Concatenated
WHERE NameNumber = NameCount

EXPLANATION

The approach boils down to three steps:

Number the rows using OVER and PARTITION grouping and ordering them as needed for the concatenation. The result is Partitioned CTE. We keep counts of rows in each partition to filter the results later.
Using recursive CTE (Concatenated) iterate through the row numbers (NameNumber column) adding Name values to FullName column.
Filter out all results but the ones with the highest NameNumber.

Please keep in mind that in order to make this query predictable one has to define both grouping (for example, in your scenario rows with the same ID are concatenated) and sorting (I assumed that you simply sort the string alphabetically before concatenation).

I've quickly tested the solution on SQL Server 2012 with the following data:

INSERT dbo.SourceTable (ID, Name)
VALUES 
(1, 'Matt'),
(1, 'Rocks'),
(2, 'Stylus'),
(3, 'Foo'),
(3, 'Bar'),
(3, 'Baz')

The query result:

ID          FullName
----------- ------------------------------
2           Stylus
3           Bar, Baz, Foo
1           Matt, Rocks

Cast binary column to string in Azure SQL Data Warehouse

Using convert should solve that problem:

CONVERT(VARCHAR(32),HashBytes('MD5', 'abc123'),2)

It is because you can define parameter for style which is needed as we convert varbinary value. It is described here:
https://technet.microsoft.com/pl-pl/library/ms187928(v=sql.105).aspx

Here is Remarks part from that documentation about binary conversion with convert:

Binary Styles When expression is binary(n), varbinary(n), char(n), or
varchar(n), style can be one of the values shown in the following
table. Style values that are not listed in the table return an error.
0 (default)
Translates ASCII characters to binary bytes or binary
bytes to ASCII characters. Each character or byte is converted 1:1. If
the data_type is a binary type, the characters 0x are added to the
left of the result.
1, 2
If the data_type is a binary type, the
expression must be a character expression. The expression must be
composed of an even number of hexadecimal digits (0, 1, 2, 3, 4, 5, 6,
7, 8, 9, A, B, C, D, E, F, a, b, c, d, e, f). If the style is set to 1
the characters 0x must be the first two characters in the expression.
If the expression contains an odd number of characters or if any of
the characters are invalid an error is raised. If the length of the
converted expression is greater than the length of the data_type the
result will be right truncated. Fixed length data_types that are
larger then the converted result will have zeros added to the right of
the result. If the data_type is a character type, the expression must
be a binary expression. Each binary character is converted into two
hexadecimal characters. If the length of the converted expression is
greater than the data_type length it will be right truncated. If the
data_type is a fix sized character type and the length of the
converted result is less than its length of the data_type; spaces are
added to the right of the converted expression to maintain an even
number of hexadecimal digits. The characters 0x will be added to the
left of the converted result for style 1.

Azure Data Factory / Aggregate as comma separated string

Faced the same issue recently. I ended up sinking the table, then creating a new dataflow that would query that table using SQL GROUP BY and STRING_AGG().

MAY 2020 update:

Microsoft released a new feature - collect transformation: https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions#collect
It can be used in aggregation step and produce an array with the values you need. It can be then followed by the derived column step to cast the string array toString().

Aggregate Strings in Azure SQL Data Warehouse