How to Scale Pivoting in Bigquery

How to scale Pivoting in BigQuery?

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

STEP 1 - Aggregate plays by user / artist

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
FROM [mydataset.stats] GROUP BY 1, 2

STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .

We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

Combined with first step – it will be:

SELECT u.uid AS uid, a.aid AS aid, plays 
FROM (
SELECT userGUID, artistGUID, COUNT(1) AS plays
FROM [mydataset.stats]
GROUP BY 1, 2
) AS s
JOIN (
SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID

Let’s write output to table - mydataset.aggs

STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time.
In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000.
To be on safe side I decided to use 2000 features at a time

Below script is used for dynamically generating query that then run to create partitioned tables

SELECT 'SELECT uid,' + 
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid)
)
+ ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)

Above query produces yet another query like below:

SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid

This should be run and written to mydataset.pivot_1_2000

Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000

As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

Same as in above steps. First we need generate query and then run it
So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

SELECT 'SELECT x.uid uid,' + 
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)

Output string from above should be run and result written to mydataset.pivot_1_4000

Then we repeat STEP 4 like below

SELECT 'SELECT x.uid uid,' + 
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)

Result to be written to mydataset.pivot_1_6000

The resulted table has following schema:

uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 

NOTE:

a. I tried this approach only up to 6000 features and it worked as expected

b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min

c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.

For more information, see Timing in High-Compute queries

d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

BigQuery Pivot Data Rows Columns

There is no nice way of doing this in BigQuery, but you can do it follow below idea

Step 1

Run below query

SELECT 'SELECT CUST_createdMonth, ' + 
GROUP_CONCAT_UNQUOTED(
'EXACT_COUNT_DISTINCT(IF(Transaction_Month = "' + Transaction_Month + '", ConsumerId, NULL)) as [m_' + REPLACE(Transaction_Month, '/', '_') + ']'
)
+ ' FROM yourTable GROUP BY CUST_createdMonth ORDER BY CUST_createdMonth'
FROM (
SELECT Transaction_Month
FROM yourTable
GROUP BY Transaction_Month
ORDER BY Transaction_Month
)

As a result - you will get string like below (it is formatted below for readability sake)

SELECT
CUST_createdMonth,
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/01/2015", ConsumerId, NULL)) AS [m_01_01_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/02/2015", ConsumerId, NULL)) AS [m_01_02_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/03/2015", ConsumerId, NULL)) AS [m_01_03_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/04/2015", ConsumerId, NULL)) AS [m_01_04_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/05/2015", ConsumerId, NULL)) AS [m_01_05_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/06/2015", ConsumerId, NULL)) AS [m_01_06_2015]
FROM yourTable
GROUP BY
CUST_createdMonth
ORDER BY
CUST_createdMonth

Step 2

Just run above composed query

Result will be lik e below

CUST_createdMonth   m_01_01_2015    m_01_02_2015    m_01_03_2015    m_01_04_2015    m_01_05_2015    m_01_06_2015     
01/01/2015 2 1 0 0 0 0
01/02/2015 0 3 1 0 0 0
01/03/2015 0 0 2 1 0 1
01/04/2015 0 0 0 2 1 0

Note

Step 1 is helpful if you have many months to pivot so too much of manual work.

In this case - Step 1 helps you to generate your query

You can see more about pivoting in my other posts.

How to scale Pivoting in BigQuery?

Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.

You can also see below as simplified examples (if above one is too complex/verbose):

How to transpose rows to columns with large amount of the data in BigQuery/SQL?

How to create dummy variable columns for thousands of categories in Google BigQuery?

Pivot Repeated fields in BigQuery

Transpose rows into columns in BigQuery (Pivot implementation)

BigQuery does not support yet pivoting functions

You still can do this in BigQuery using below approach

But first, in addition to two columns in input data you must have one more column that would specify groups of rows in input that needs to be combined into one row in output

So, I assume your input table (yourTable) looks like below

**id**  **Key**                  **Value**
1 channel_title Mahendra Guru
1 youtube_id ugEGMG4-MdA
1 channel_id UCiDKcjKocimAO1tV
1 examId 72975611-4a5e-11e5
1 postId 1189e340-b08f

2 channel_title Ab Live
2 youtube_id 3TNbtTwLY0U
2 channel_id UCODeKM_D6JLf8jJt
2 examId 72975611-4a5e-11e5
2 postId 0c3e6590-afeb

So, first you should run below query

SELECT 'SELECT id, ' + 
GROUP_CONCAT_UNQUOTED(
'MAX(IF(key = "' + key + '", value, NULL)) as [' + key + ']'
)
+ ' FROM yourTable GROUP BY id ORDER BY id'
FROM (
SELECT key
FROM yourTable
GROUP BY key
ORDER BY key
)

Result of above query will be string that (if to format) will look like below

SELECT 
id,
MAX(IF(key = "channel_id", value, NULL)) AS [channel_id],
MAX(IF(key = "channel_title", value, NULL)) AS [channel_title],
MAX(IF(key = "examId", value, NULL)) AS [examId],
MAX(IF(key = "postId", value, NULL)) AS [postId],
MAX(IF(key = "youtube_id", value, NULL)) AS [youtube_id]
FROM yourTable
GROUP BY id
ORDER BY id

you should now copy above result (note: you don't really need to format it - i did it for presenting only) and run it as normal query

Result will be as you would expected

id  channel_id          channel_title   examId              postId          youtube_id   
1 UCiDKcjKocimAO1tV Mahendra Guru 72975611-4a5e-11e5 1189e340-b08f ugEGMG4-MdA
2 UCODeKM_D6JLf8jJt Ab Live 72975611-4a5e-11e5 0c3e6590-afeb 3TNbtTwLY0U

Please note: you can skip Step 1 if you can construct proper query (as in step 2) by yourself and number of fields small and constant or if it is one time deal. But Step 1 just helper step that makes it for you, so you can create it fast any time!

If you are interested - you can see more about pivoting in my other posts.

How to scale Pivoting in BigQuery?

Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.

You can also see below as simplified examples (if above one is too complex/verbose):

How to transpose rows to columns with large amount of the data in BigQuery/SQL?

How to create dummy variable columns for thousands of categories in Google BigQuery?

Pivot Repeated fields in BigQuery

Pivot data in BigQuery SQL?


Q. Is this possible in the same BigQuery call? Or should I create a new
table and pivot it from there?

In general, you can use that “complicated query” as a subquery for extra logic to be applied to your current result.
So, it is definitely doable. But code can quickly become un-manageable or hard to manage – so you can consider writing this result into new table and then pivot it from there

If you stuck with direction of doing pivot (the way you described in your question) - check below link to see detailed intro on how you can implement pivot within BigQuery.

How to scale Pivoting in BigQuery?

Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.

You can also see below as simplified examples (if above one is too complex/verbose):

How to transpose rows to columns with large amount of the data in BigQuery/SQL?

How to create dummy variable columns for thousands of categories in Google BigQuery?

Pivot Repeated fields in BigQuery

Q. Or should I just do the reshaping in Python?

If above will not work for you – pivoting on client is always an option but now you should consider client side limitations

Hope this helped!

SQL Pivot in BigQuery

Consider below example

select * from your_table
pivot (count(distinct UserID) for Status in ('opened', 'clicked'))

if applied to sample data in your question

with your_table as (
select '01' UserID, 'Campaign#1' CampaignName, 'opened' Status union all
select '01', 'Campaign#1', 'clicked' union all
select '01', 'Campaign#2', 'opened' union all
select '02', 'Campaign#1', 'opened' union all
select '02', 'Campaign#2', 'opened'
)

the output is

Sample Image

Big Query Transpose

There is no nice way of doing this in BigQuery as of yet, but you can do it following below idea

Step 1

Run below query

SELECT 'SELECT [group], ' + 
GROUP_CONCAT_UNQUOTED(
'SUM(IF([date] = "' + [date] + '", value, NULL)) as [d_' + REPLACE([date], '/', '_') + ']'
)
+ ' FROM YourTable GROUP BY [group] ORDER BY [group]'
FROM (
SELECT [date] FROM YourTable GROUP BY [date] ORDER BY [date]
)

As a result - you will get string like below (it is formatted below for readability sake)

SELECT 
[group],
SUM(IF([date] = "date1", value, NULL)) AS [d_date1],
SUM(IF([date] = "date2", value, NULL)) AS [d_date2]
FROM YourTable
GROUP BY [group]
ORDER BY [group]

Step 2

Just run above composed query

Result will be like below

group   d_date1 d_date2  
group1 15 30

Note 1: Step 1 is helpful if you have many groups to pivot so too much of manual work. In this case - Step 1 helps you to generate your query

Note 2: these steps are easily implemented in any client of your choice or you can just run those in BigQuery Web UI

You can see more about pivoting in my other posts.

How to scale Pivoting in BigQuery?

Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.

You can also see below as simplified examples (if above one is too complex/verbose):

How to transpose rows to columns with large amount of the data in BigQuery/SQL?

How to create dummy variable columns for thousands of categories in Google BigQuery?

Pivot Repeated fields in BigQuery

How to transpose rows to columns with large amount of the data in BigQuery/SQL?


STEP #1

In below query replace yourTable with real name of your table and execute/run it

SELECT 'SELECT CustomerID, ' + 
GROUP_CONCAT_UNQUOTED(
'MAX(IF(Feature = "' + STRING(Feature) + '", Value, NULL))'
)
+ ' FROM yourTable GROUP BY CustomerID'
FROM (SELECT Feature FROM yourTable GROUP BY Feature)

As a result you will get some string to be used in next step!

STEP #2

Take string you got from Step 1 and just execute it as a query

The output is a Pivot you asked in question

Transpose rows into columns in BigQuery using standard sql

for BigQuery Standard SQL



#standardSQL
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid

You can play/test with below dummy data from your question

#standardSQL
WITH `mytable` AS (
SELECT 'A' AS uniqueid, 'Collection' AS page_flag, 1 AS order_of_pages UNION ALL
SELECT 'A', 'Product', 2 UNION ALL
SELECT 'A', 'Product', 3 UNION ALL
SELECT 'A', 'Login', 4 UNION ALL
SELECT 'A', 'Delivery', 5 UNION ALL
SELECT 'B', 'Clearance', 1 UNION ALL
SELECT 'B', 'Search', 2 UNION ALL
SELECT 'B', 'Product', 3 UNION ALL
SELECT 'C', 'Search', 1 UNION ALL
SELECT 'C', 'Collection', 2 UNION ALL
SELECT 'C', 'Product', 3
)
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid

result is

uniqueid    p1          p2          p3      p4      p5   
A Collection Product Product Login Delivery
B Clearance Search Product null null
C Search Collection Product null null

Depends on your needs you can also consider below approach (not pivot though)

#standardSQL
SELECT uniqueid,
STRING_AGG(page_flag, '>' ORDER BY order_of_pages) AS journey
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid

if to run with same dummy data as above - result is

uniqueid    journey  
A Collection>Product>Product>Login>Delivery
B Clearance>Search>Product
C Search>Collection>Product

How to write the result of a BigQuery script to a table?

You can use the CREATE TABLE AS SELECT statement:

DECLARE my_dates STRING;

SET report_dates = (
SELECT month FROM my_dataset.my_date_able)
);

EXECUTE IMMEDIATE format("""
CREATE TABLE `<yourproject>.<yourdataset>.<target_table_name>` AS

SELECT * from
( select x,
y,
month,
sum(things) as num_things
FROM my_dataset.my_data
GROUP BY 1,2,3
)
PIVOT
(
sum(num_things) AS s
FOR month in %s
)
""", my_dates);

BigQuery will infer the schema from the source table, create the table at the desired location and fill it the results from the query.



Related Topics



Leave a reply



Submit