How to Parse Xml Tags in Bigquery Standard SQL

Is there a way to parse XML tags in BigQuery Standard SQL?

Here is the documentation to how to use Javascript UDFs in BigQuery like Elliot has mentioned.

https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions

I imagine the UDF might look something like

CREATE TEMPORARY FUNCTION XML(x STRING)
RETURNS STRING
  LANGUAGE js AS """
  var data = fromXML(x);
  return data.title;
"""
OPTIONS(
library="gs://<BUCKET_NAME>/from-xml.min.js"
);
SELECT XML(a) FROM UNNEST(["<title>Title of Page</title>"]) as a

Where from-xml.min.js is from this library and loaded into your gcs account

How to Parse simple data in BigQuery

Below example is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, '1/2' list UNION ALL
  SELECT 2, '1/3' UNION ALL
  SELECT 3, '10/20' UNION ALL
  SELECT 4, '15/' UNION ALL
  SELECT 5, '12/31' 
)
SELECT id, 
  SPLIT(list, '/')[SAFE_OFFSET(0)] AS first_element,
  SPLIT(list, '/')[SAFE_OFFSET(1)] AS second_element
FROM `project.dataset.table`
-- ORDER BY id

with result as below

Row id  first_element   second_element   
1   1   1               2    
2   2   1               3    
3   3   10              20   
4   4   15       
5   5   12              31

BigQuery get columns from JSON file keys

Below is for BigQuery Standard SQL

#standardSQL 
SELECT 
  JSON_EXTRACT_SCALAR(line, '$.id') id,
  TRIM(SPLIT(aud_kv, ':')[OFFSET(0)], '"') audiences,
  TRIM(SPLIT(seg_kv, ':')[OFFSET(0)], '"') segments
FROM `project.dataset.table`,
UNNEST(SPLIT(TRIM(JSON_EXTRACT(line, '$.key1.key2.audiences'),'{}'))) aud_kv,
UNNEST(SPLIT(TRIM(JSON_EXTRACT(line, '$.key1.key2.segments'),'{}'))) seg_kv

if to apply to sample data from your question - output is

Row id      audiences   segments     
1   abcdefg aud1        seg1     
2   abcdefg aud1        seg2     
3   abcdefg aud1        seg3     
4   abcdefg aud1        seg4     
5   abcdefg aud2        seg1     
6   abcdefg aud2        seg2     
7   abcdefg aud2        seg3     
8   abcdefg aud2        seg4

SQL conditional aggregation?

Standard SQL offers listagg() to aggregate strings. So this looks something like:

select name,
       listagg(case when virtual = 1 then message end, ',') within group (order by message)
from t
group by name;

However, most databases have different names (and syntax) for string aggregation, such as string_agg() or group_concat().

EDIT:

In BQ the syntax would be:

select name,
       string_agg(case when virtual = 1 then message end, ',')
from t
group by name;

That said, I would recommend array_agg() rather than string_agg().

how to read multiple levels of JSON data in Big Query using JSON_EXTRACT or JSON_EXTRACT_SCALAR

Below example BigQuery for Standard SQL

#standardSQL
CREATE TEMP FUNCTION jsonparse(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  return JSON.parse(input).map(x=>JSON.stringify(x));
"""; 
WITH `project.lz.json_file` AS (
  SELECT '''{
  "Combos": [  {
    "Id": "1111",
    "Type": 0,
    "Description": "ABCD",
    "ComboDuration": {
      "StartDate": "2009-10-26T08:00:00",
      "EndDate": "2009-10-29T08:00:00"
    }  },  {
    "Id": "2222",
    "Type": 1,
    "Description": "XYZ",
    "ComboDuration": {
      "StartDate": "2019-10-26T08:00:00",
      "EndDate": "2019-10-29T08:00:00"
    }  },  {
    "Id": "39933",
    "Type": 3,
    "Description": "General",
    "ComboDuration": {
      "StartDate": "2019-10-26T08:00:00",
      "EndDate": "2019-10-29T08:00:00"
    }  },  {
    "Id": "39934",
    "Type": 2,
    "Description": "ABCDXYZ",
    "ComboDuration": {
      "StartDate": "2019-10-26T08:00:00",
      "EndDate": "2019-10-29T08:00:00"
    }  }]}  ''' AS conv_column
)
SELECT
  JSON_EXTRACT_SCALAR(combo, '$.Id') AS Id,
  JSON_EXTRACT_SCALAR(combo, '$.Type') AS Type,
  JSON_EXTRACT_SCALAR(combo, '$.Description') AS Description,
  JSON_EXTRACT_SCALAR(combo, '$.ComboDuration.StartDate') AS StartDate,
  JSON_EXTRACT_SCALAR(combo, '$.ComboDuration.EndDate') AS EndDate
FROM `project.lz.json_file`,
UNNEST(jsonparse(JSON_EXTRACT(conv_column, '$.Combos'))) combo

with output

Row Id      Type    Description StartDate           EndDate  
1   1111    0       ABCD        2009-10-26T08:00:00 2009-10-29T08:00:00  
2   2222    1       XYZ         2019-10-26T08:00:00 2019-10-29T08:00:00  
3   39933   3       General     2019-10-26T08:00:00 2019-10-29T08:00:00  
4   39934   2       ABCDXYZ     2019-10-26T08:00:00 2019-10-29T08:00:00

Convert HTML characters to unicode in BigQuery

The following general technique works:

Split the text on each character where an HTML entity character like 😜 is considered a single character
Keep track of character position with OFFSET
Rejoin all characters, but use some BigQuery STRING function magic to replace HTML entities with their unicode character.

SELECT
  id,
  ANY_VALUE(text) AS original,
  STRING_AGG(
    COALESCE(
      -- Support hex codepoints
      CODE_POINTS_TO_STRING(
        [CAST(CONCAT('0x', REGEXP_EXTRACT(char, r'(?:&#x)(\w+)(?:;)')) AS INT64)]
      ),
      -- Support decimal codepoints
      CODE_POINTS_TO_STRING(
        [CAST(CONCAT('0x', FORMAT('%x', CAST(REGEXP_EXTRACT(char, r'(?:&#)(\d+)(?:;)') AS INT64))) AS INT64)]
      ),
      -- Fall back to the character itself
      char
    ),
  '' ORDER BY char_position) AS text
FROM UNNEST([
  STRUCT(1 AS id, 'Hello World 😜' AS text),
  STRUCT(2 AS id, 'Yes 😜 It works great 😜'),
  STRUCT(3 AS id, '—' AS text),
  STRUCT(4 AS id, '—' AS text)
])
CROSS JOIN
  -- Extract all characters individually except for HTML entity characters
  UNNEST(REGEXP_EXTRACT_ALL(text, r'(&#\w+;|.)')) char WITH OFFSET AS char_position
GROUP BY id

Best way to unnest and select column if table has repeated record column which itself contains many repeated record column