Reshape from Wide to Long in Big Query (Standard Sql)

Reshape from wide to long in big query (standard SQL)

Below is for BigQuery Standard SQL and does not require repeating selects depends on number of columns. It will pick as many as you have and transform them into metrics and values

#standardSQL
SELECT DATE, country,
metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT DATE, country,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM `project.dataset.yourtable` t,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

You can test / play with above using dummy data as in your question

#standardSQL
WITH `project.dataset.yourtable` AS (
SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT DATE, country,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM `project.dataset.yourtable` t,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

result is as expected

DATE        country metric  value    
20170928 CH A 3000.3
20170928 CH B 121.0
20170928 CH C 13.0
20170928 CH D 3200.0
20170929 CH A 2800.31
20170929 CH B 137.0
20170929 CH C 23.0
20170929 CH D 1614.31

BigQuery reshape table with structs from wide to long

Below is for BigQuery Standard SQL

#standardSQL
WITH wide_stats AS (
SELECT 'joe' name, 'bills' team,
STRUCT(STRUCT(7 AS fga) AS o, STRUCT(8 AS fga) AS d) AS t1,
STRUCT(STRUCT(3 AS fga) AS o, STRUCT(9 AS fga) AS d) AS t2,
STRUCT(3 AS pts, 9 AS ast, 5 AS reb) AS t3, 7 tov, 3 blk UNION ALL
SELECT 'nick' name, 'jets' team,
STRUCT(STRUCT(12 AS fga) AS o, STRUCT(13 AS fga) AS d) AS t1,
STRUCT(STRUCT(15 AS fga) AS o, STRUCT(22 AS fga) AS d) AS t2,
STRUCT(13 AS pts, 5 AS ast, 15 AS reb) AS t3, 75 tov, 23 blk
), flat_stats AS (
SELECT name, team,
t1.o.fga AS t1_o_fga,
t1.d.fga AS t1_d_fga,
t2.o.fga AS t2_o_fga,
t2.d.fga AS t2_d_fga,
t3.pts AS t3_pts,
t3.ast AS t3_ast,
t3.reb AS t3_reb,
tov, blk
FROM wide_stats
)
SELECT name, team, metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT name, team,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM flat_stats,
UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(flat_stats), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('name', 'team')

with output

Row name    team    metric      value    
1 joe bills t1_o_fga 7.0
2 joe bills t1_d_fga 8.0
3 joe bills t2_o_fga 3.0
4 joe bills t2_d_fga 9.0
5 joe bills t3_pts 3.0
6 joe bills t3_ast 9.0
7 joe bills t3_reb 5.0
8 joe bills tov 7.0
9 joe bills blk 3.0
10 nick jets t1_o_fga 12.0
11 nick jets t1_d_fga 13.0
12 nick jets t2_o_fga 15.0
13 nick jets t2_d_fga 22.0
14 nick jets t3_pts 13.0
15 nick jets t3_ast 5.0
16 nick jets t3_reb 15.0
17 nick jets tov 75.0
18 nick jets blk 23.0

If for some reason you have problem with assembling flat_stats temp table manually - you can do a small trick like below

Step 1 - Just run below query in legacy mode with destination table [project:dataset.flat_stats]

#legacySQL
SELECT *
FROM [project:dataset.wide_stats]

"Surprisingly", this will create table [project:dataset.flat_stats] with below structure

Row name    team    t1_o_fga    t1_d_fga    t2_o_fga    t2_d_fga    t3_pts  t3_ast  t3_reb  tov blk  
1 joe bills 7 8 3 9 3 9 5 7 3
2 nick jets 12 13 15 22 13 5 15 75 23

Step 2 - After that you can simply run below (now in Standard SQL)

#standardSQL
SELECT name, team, metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT name, team,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM `project.dataset.flat_stats` flat_stats,
UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(flat_stats), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('name', 'team')

Transpose rows into columns in BigQuery using standard sql

for BigQuery Standard SQL



#standardSQL
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid

You can play/test with below dummy data from your question

#standardSQL
WITH `mytable` AS (
SELECT 'A' AS uniqueid, 'Collection' AS page_flag, 1 AS order_of_pages UNION ALL
SELECT 'A', 'Product', 2 UNION ALL
SELECT 'A', 'Product', 3 UNION ALL
SELECT 'A', 'Login', 4 UNION ALL
SELECT 'A', 'Delivery', 5 UNION ALL
SELECT 'B', 'Clearance', 1 UNION ALL
SELECT 'B', 'Search', 2 UNION ALL
SELECT 'B', 'Product', 3 UNION ALL
SELECT 'C', 'Search', 1 UNION ALL
SELECT 'C', 'Collection', 2 UNION ALL
SELECT 'C', 'Product', 3
)
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid

result is

uniqueid    p1          p2          p3      p4      p5   
A Collection Product Product Login Delivery
B Clearance Search Product null null
C Search Collection Product null null

Depends on your needs you can also consider below approach (not pivot though)

#standardSQL
SELECT uniqueid,
STRING_AGG(page_flag, '>' ORDER BY order_of_pages) AS journey
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid

if to run with same dummy data as above - result is

uniqueid    journey  
A Collection>Product>Product>Login>Delivery
B Clearance>Search>Product
C Search>Collection>Product

Mysql, reshape data from long / tall to wide

Cross-tabs or pivot tables is the answer. From there you can SELECT FROM ... INSERT INTO ... or create a VIEW from the single SELECT.

Something like:

SELECT country, 
MAX( IF( key='President', value, NULL ) ) AS President,
MAX( IF( key='Currency', value, NULL ) ) AS Currency,
...

FROM table
GROUP BY country;

Transform long rows to wide, filling all cells

In order to get the current location for each business_id for any given year you need two things:

  1. A parameterized query to select the year, implemented as a SQL language function.
  2. A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.

The function then looks like this:

CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
WITH last_move AS (
SELECT business_id, MAX(year_move) AS yr
FROM business_moves
WHERE year_move <= $1
GROUP BY business_id)
SELECT lm.business_id, $1::int AS yr, longitude, latitude
FROM business_moves bm, last_move lm
WHERE bm.business_id = lm.business_id
AND bm.year_move = lm.yr;
$$ LANGUAGE sql;

The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.

Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.

If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.

Transpose rows into columns in BigQuery (Pivot implementation)

BigQuery does not support yet pivoting functions

You still can do this in BigQuery using below approach

But first, in addition to two columns in input data you must have one more column that would specify groups of rows in input that needs to be combined into one row in output

So, I assume your input table (yourTable) looks like below

**id**  **Key**                  **Value**
1 channel_title Mahendra Guru
1 youtube_id ugEGMG4-MdA
1 channel_id UCiDKcjKocimAO1tV
1 examId 72975611-4a5e-11e5
1 postId 1189e340-b08f

2 channel_title Ab Live
2 youtube_id 3TNbtTwLY0U
2 channel_id UCODeKM_D6JLf8jJt
2 examId 72975611-4a5e-11e5
2 postId 0c3e6590-afeb

So, first you should run below query

SELECT 'SELECT id, ' + 
GROUP_CONCAT_UNQUOTED(
'MAX(IF(key = "' + key + '", value, NULL)) as [' + key + ']'
)
+ ' FROM yourTable GROUP BY id ORDER BY id'
FROM (
SELECT key
FROM yourTable
GROUP BY key
ORDER BY key
)

Result of above query will be string that (if to format) will look like below

SELECT 
id,
MAX(IF(key = "channel_id", value, NULL)) AS [channel_id],
MAX(IF(key = "channel_title", value, NULL)) AS [channel_title],
MAX(IF(key = "examId", value, NULL)) AS [examId],
MAX(IF(key = "postId", value, NULL)) AS [postId],
MAX(IF(key = "youtube_id", value, NULL)) AS [youtube_id]
FROM yourTable
GROUP BY id
ORDER BY id

you should now copy above result (note: you don't really need to format it - i did it for presenting only) and run it as normal query

Result will be as you would expected

id  channel_id          channel_title   examId              postId          youtube_id   
1 UCiDKcjKocimAO1tV Mahendra Guru 72975611-4a5e-11e5 1189e340-b08f ugEGMG4-MdA
2 UCODeKM_D6JLf8jJt Ab Live 72975611-4a5e-11e5 0c3e6590-afeb 3TNbtTwLY0U

Please note: you can skip Step 1 if you can construct proper query (as in step 2) by yourself and number of fields small and constant or if it is one time deal. But Step 1 just helper step that makes it for you, so you can create it fast any time!

If you are interested - you can see more about pivoting in my other posts.

How to scale Pivoting in BigQuery?

Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.

You can also see below as simplified examples (if above one is too complex/verbose):

How to transpose rows to columns with large amount of the data in BigQuery/SQL?

How to create dummy variable columns for thousands of categories in Google BigQuery?

Pivot Repeated fields in BigQuery

How to convert columns to rows in Google BigQuery

How do I reverse pivot this so I can achieve the output I'm looking for.

Assuming you have table `data`

col1    col2    col3
---- ---- ----
null null null
null null 1
null 1 1
1 null 1
1 null 1
1 null 1

And you need reverse pivot it to

column      count_of_errors  
-------- ---------------
cnt_col1 3
cnt_col2 1
cnt_col3 5

Below is for BigQuery Standard SQL and does exactly this



#standardSQL
WITH `data` AS (
SELECT NULL AS col1, NULL AS col2, NULL AS col3 UNION ALL
SELECT NULL, NULL, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT NULL, 1, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT 1, NULL, 1
)
SELECT r.* FROM (
SELECT
[
STRUCT<column STRING, count_of_errors INT64>
('cnt_col1', SUM(col1)),
('cnt_col2', SUM(col2)),
('cnt_col3', SUM(col3))
] AS row
FROM `data`
), UNNEST(row) AS r

It is simple enough and friendly for adjusting to any number of columns you potentially have in your initial `data` table - you just need to add respective number of ('cnt_colN', SUM(colN)), - which can be done manually or you can just write simple script to generate those lines (or whole query)

Pivot data in BigQuery SQL?

Q. Is this possible in the same BigQuery call? Or should I create a new
table and pivot it from there?

In general, you can use that “complicated query” as a subquery for extra logic to be applied to your current result.
So, it is definitely doable. But code can quickly become un-manageable or hard to manage – so you can consider writing this result into new table and then pivot it from there

If you stuck with direction of doing pivot (the way you described in your question) - check below link to see detailed intro on how you can implement pivot within BigQuery.

How to scale Pivoting in BigQuery?

Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.

You can also see below as simplified examples (if above one is too complex/verbose):

How to transpose rows to columns with large amount of the data in BigQuery/SQL?

How to create dummy variable columns for thousands of categories in Google BigQuery?

Pivot Repeated fields in BigQuery

Q. Or should I just do the reshaping in Python?

If above will not work for you – pivoting on client is always an option but now you should consider client side limitations

Hope this helped!

Bigquery SQL: convert array to columns

The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.

Cross-joining to turn array fields into long-format tables

I think the first thing you're going to want to do is get the values out of the arrays and into rows.

Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:

WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)

SELECT * FROM long_format

UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.

This will yields results like

 name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1

Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:

WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)

SELECT * FROM long_format

This is more compact but may be confusing if you haven't seen it before.

Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.

Pivot tables in BigQuery

The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:

WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)

SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)

This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).

Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.

The query yields results like:

name    vals_0  vals_1  vals_2  vals_3  vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1

This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.

Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).

Dynamically building/executing the pivot

And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.

What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!

(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)

This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.

Recall that || can be used as a string concatenation operator.

EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)

SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""

Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):

(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)

This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).

REALLY dynamically executing the pivot

It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.

Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with

(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)

But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.

Yet all is not lost. This will work just fine:

SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)

... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.

If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.


As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.



Related Topics



Leave a reply



Submit