How to Parse Xml Tags in Bigquery Standard SQL

Is there a way to parse XML tags in BigQuery Standard SQL?

Here is the documentation to how to use Javascript UDFs in BigQuery like Elliot has mentioned.

https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions

I imagine the UDF might look something like

CREATE TEMPORARY FUNCTION XML(x STRING)
RETURNS STRING
LANGUAGE js AS """
var data = fromXML(x);
return data.title;
"""
OPTIONS(
library="gs://<BUCKET_NAME>/from-xml.min.js"
);
SELECT XML(a) FROM UNNEST(["<title>Title of Page</title>"]) as a

Where from-xml.min.js is from this library and loaded into your gcs account

How to Parse simple data in BigQuery

Below example is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, '1/2' list UNION ALL
SELECT 2, '1/3' UNION ALL
SELECT 3, '10/20' UNION ALL
SELECT 4, '15/' UNION ALL
SELECT 5, '12/31'
)
SELECT id,
SPLIT(list, '/')[SAFE_OFFSET(0)] AS first_element,
SPLIT(list, '/')[SAFE_OFFSET(1)] AS second_element
FROM `project.dataset.table`
-- ORDER BY id

with result as below

Row id  first_element   second_element   
1 1 1 2
2 2 1 3
3 3 10 20
4 4 15
5 5 12 31

BigQuery get columns from JSON file keys

Below is for BigQuery Standard SQL

#standardSQL 
SELECT
JSON_EXTRACT_SCALAR(line, '$.id') id,
TRIM(SPLIT(aud_kv, ':')[OFFSET(0)], '"') audiences,
TRIM(SPLIT(seg_kv, ':')[OFFSET(0)], '"') segments
FROM `project.dataset.table`,
UNNEST(SPLIT(TRIM(JSON_EXTRACT(line, '$.key1.key2.audiences'),'{}'))) aud_kv,
UNNEST(SPLIT(TRIM(JSON_EXTRACT(line, '$.key1.key2.segments'),'{}'))) seg_kv

if to apply to sample data from your question - output is

Row id      audiences   segments     
1 abcdefg aud1 seg1
2 abcdefg aud1 seg2
3 abcdefg aud1 seg3
4 abcdefg aud1 seg4
5 abcdefg aud2 seg1
6 abcdefg aud2 seg2
7 abcdefg aud2 seg3
8 abcdefg aud2 seg4

SQL conditional aggregation?

Standard SQL offers listagg() to aggregate strings. So this looks something like:

select name,
listagg(case when virtual = 1 then message end, ',') within group (order by message)
from t
group by name;

However, most databases have different names (and syntax) for string aggregation, such as string_agg() or group_concat().

EDIT:

In BQ the syntax would be:

select name,
string_agg(case when virtual = 1 then message end, ',')
from t
group by name;

That said, I would recommend array_agg() rather than string_agg().

how to read multiple levels of JSON data in Big Query using JSON_EXTRACT or JSON_EXTRACT_SCALAR

Below example BigQuery for Standard SQL

#standardSQL
CREATE TEMP FUNCTION jsonparse(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(input).map(x=>JSON.stringify(x));
""";
WITH `project.lz.json_file` AS (
SELECT '''{
"Combos": [ {
"Id": "1111",
"Type": 0,
"Description": "ABCD",
"ComboDuration": {
"StartDate": "2009-10-26T08:00:00",
"EndDate": "2009-10-29T08:00:00"
} }, {
"Id": "2222",
"Type": 1,
"Description": "XYZ",
"ComboDuration": {
"StartDate": "2019-10-26T08:00:00",
"EndDate": "2019-10-29T08:00:00"
} }, {
"Id": "39933",
"Type": 3,
"Description": "General",
"ComboDuration": {
"StartDate": "2019-10-26T08:00:00",
"EndDate": "2019-10-29T08:00:00"
} }, {
"Id": "39934",
"Type": 2,
"Description": "ABCDXYZ",
"ComboDuration": {
"StartDate": "2019-10-26T08:00:00",
"EndDate": "2019-10-29T08:00:00"
} }]} ''' AS conv_column
)
SELECT
JSON_EXTRACT_SCALAR(combo, '$.Id') AS Id,
JSON_EXTRACT_SCALAR(combo, '$.Type') AS Type,
JSON_EXTRACT_SCALAR(combo, '$.Description') AS Description,
JSON_EXTRACT_SCALAR(combo, '$.ComboDuration.StartDate') AS StartDate,
JSON_EXTRACT_SCALAR(combo, '$.ComboDuration.EndDate') AS EndDate
FROM `project.lz.json_file`,
UNNEST(jsonparse(JSON_EXTRACT(conv_column, '$.Combos'))) combo

with output

Row Id      Type    Description StartDate           EndDate  
1 1111 0 ABCD 2009-10-26T08:00:00 2009-10-29T08:00:00
2 2222 1 XYZ 2019-10-26T08:00:00 2019-10-29T08:00:00
3 39933 3 General 2019-10-26T08:00:00 2019-10-29T08:00:00
4 39934 2 ABCDXYZ 2019-10-26T08:00:00 2019-10-29T08:00:00

Convert HTML characters to unicode in BigQuery

The following general technique works:

  • Split the text on each character where an HTML entity character like 😜 is considered a single character
  • Keep track of character position with OFFSET
  • Rejoin all characters, but use some BigQuery STRING function magic to replace HTML entities with their unicode character.
SELECT
id,
ANY_VALUE(text) AS original,
STRING_AGG(
COALESCE(
-- Support hex codepoints
CODE_POINTS_TO_STRING(
[CAST(CONCAT('0x', REGEXP_EXTRACT(char, r'(?:&#x)(\w+)(?:;)')) AS INT64)]
),
-- Support decimal codepoints
CODE_POINTS_TO_STRING(
[CAST(CONCAT('0x', FORMAT('%x', CAST(REGEXP_EXTRACT(char, r'(?:&#)(\d+)(?:;)') AS INT64))) AS INT64)]
),
-- Fall back to the character itself
char
),
'' ORDER BY char_position) AS text
FROM UNNEST([
STRUCT(1 AS id, 'Hello World 😜' AS text),
STRUCT(2 AS id, 'Yes 😜 It works great 😜'),
STRUCT(3 AS id, '—' AS text),
STRUCT(4 AS id, '—' AS text)
])
CROSS JOIN
-- Extract all characters individually except for HTML entity characters
UNNEST(REGEXP_EXTRACT_ALL(text, r'(&#\w+;|.)')) char WITH OFFSET AS char_position
GROUP BY id

Best way to unnest and select column if table has repeated record column which itself contains many repeated record column

Below is for BigQuery Standard SQL

#standardSQL
SELECT
ANY_VALUE(sku),
SUM((SELECT SUM(cost) FROM f.unit)),
SUM((SELECT SUM(fee) FROM f.product))
FROM nonpii_air_ticketed.test,
UNNEST(fan) f


Related Topics



Leave a reply



Submit