Bigquery SQL: Average, Geometric Mean, Remove Outliers, Median

How to calculate the geometric mean, taking into account the weight for each item in the sample using BigQuery?

Formula in your question is equivalent to below one

Sample Image

which can easily be coded as in below example

select exp(sum(mass.subs * ln(mass.division)) / sum(mass.subs ))
from data

If applied to sample data in your question

with data as (
SELECT STRUCT(
cast(JSON_EXTRACT_SCALAR(mass, '$.subs_sum') as float64) AS subs,
cast(JSON_EXTRACT_SCALAR(mass, '$.division') as float64) AS division
) as mass
FROM UNNEST ([
'{"subs_sum": "188292","division": "0.7708596151869399"}',
'{"subs_sum": "1182","division": "0.8344408128719736"}',
'{"subs_sum": "142559","division": "0.9539818702339475"}',
'{"subs_sum": "14047","division": "0.7836811141666864"}',
'{"subs_sum": "70344","division": "0.7724158684628387"}',
'{"subs_sum": "101516","division": "0.8676896770665041"}',
'{"subs_sum": "12459","division": "0.8029440607145902"}',
'{"subs_sum": "26070","division": "0.9793106723267602"}',
'{"subs_sum": "151959","division": "0.839048212451375"}',
'{"subs_sum": "5234","division": "0.684263034290403"}'
]) mass
)
select exp(sum(mass.subs * ln(mass.division)) / sum(mass.subs ))
from data

output is

Sample Image

BigQuery - Moving median calculation

Here is an approach that might work:

CREATE TEMP FUNCTION MEDIAN(arr ANY TYPE) AS ((
SELECT
IF(
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));

SELECT
Company,
Month,
MEDIAN(
ARRAY_AGG(Sales) OVER (PARTITION BY Company ORDER BY Month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW)
) AS trailing_median
FROM (
SELECT 'Adidas' AS Company, '2018-09' AS Month, 100 AS Sales UNION ALL
SELECT 'Adidas', '2018-08', 95 UNION ALL
SELECT 'Adidas', '2018-07', 120 UNION ALL
SELECT 'Adidas', '2018-06', 155
);

The results are:

+---------+---------+-----------------+
| Company | Month | trailing_median |
+---------+---------+-----------------+
| Adidas | 2018-06 | 155.0 |
| Adidas | 2018-07 | 137.5 |
| Adidas | 2018-08 | 120.0 |
| Adidas | 2018-09 | 110.0 |
+---------+---------+-----------------+

Calculating time_diff between rows with same id in GoogleBigQuery

There are lot of nulls in bikeid column. You are seeing nulls because ASC order will fetch nulls first.
There are few option you can choose
• You can change your order by clause to DESC on bikeid
SELECT bikeid,
DATE_DIFF(date(start_time), date(prev_start_time), day) AS Tempo,
OrderCount
FROM ( SELECT bikeid,
start_time,
ROW_NUMBER() OVER(PARTITION BY bikeid ORDER BY start_time ASC) OrderCount,
LAG(start_time) OVER(PARTITION BY bikeid ORDER BY start_time ASC) prev_start_time
FROM bigquery-public-data.austin_bikeshare.bikeshare_trips
)

ORDER BY bikeid desc, start_time
• You can remove null bikeid by adding where clause “where bikeid is not null”
SELECT bikeid,
DATE_DIFF(date(start_time), date(prev_start_time), day) AS Tempo,
OrderCount
FROM ( SELECT bikeid,
start_time,
ROW_NUMBER() OVER(PARTITION BY bikeid ORDER BY start_time ASC) OrderCount,
LAG(start_time) OVER(PARTITION BY bikeid ORDER BY start_time ASC) prev_start_time
FROM bigquery-public-data.austin_bikeshare.bikeshare_trips
where bikeid is not null
)

ORDER BY OrderCount desc, bikeid desc, start_time



Related Topics



Leave a reply



Submit