Use Google Bigquery to Build Histogram Graph

Use google bigquery to build histogram graph

See the 2019 update, with #standardSQL --Fh


The subquery idea works, as does "CASE WHEN" and then doing a group by:

SELECT COUNT(field1), bucket 
FROM (
SELECT field1, CASE WHEN age >= 0 AND age < 10 THEN 1
WHEN age >= 10 AND age < 20 THEN 2
WHEN age >= 20 AND age < 30 THEN 3
...
ELSE -1 END as bucket
FROM table1)
GROUP BY bucket

Alternately, if the buckets are regular -- you could just divide and cast to an integer:

SELECT COUNT(field1), bucket 
FROM (
SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket

Get a histogram of a query Bigquery

I think you want two levels of aggregation:

SELECT total_transaction, COUNT(*)
FROM (SELECT customer_no, COUNT(*) AS total_transaction
FROM [bi-dwhdev-01:source.daily_order]
WHERE DATE(order_time) >= '2018-04-01' AND DATE(order_time) <= '2018-04-10'
GROUP BY customer_no
) c
GROUP BY total_transaction
ORDER BY total_transaction DESC;

BigQuery: Count elements in APPROX_QUNATILES

The title of the question says "Count elements in APPROX_QUANTILES", and I'm going to answer that. As your ultimate goal is to build a histogram, please see this question.

To count the number of elements in each bucket, we can do something like:

WITH data AS ( 
SELECT *, ActualElapsedTime datapoint
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate_year = "2018-01-01"
AND Origin = 'SFO' AND Dest = 'JFK'
)
, quantiles AS (
SELECT *, IFNULL(LEAD(bucket_start) OVER(ORDER BY bucket_i) , 0100000) bucket_end
FROM UNNEST((
SELECT APPROX_QUANTILES(datapoint, 10)
FROM data
)) bucket_start WITH OFFSET bucket_i
)

SELECT COUNT(*) count, bucket_i
, ANY_VALUE(STRUCT(bucket_start, bucket_end)) b, MIN(datapoint) min, MAX(datapoint) max
FROM data
JOIN quantiles
ON data.datapoint >= bucket_start AND data.datapoint < bucket_end
GROUP BY bucket_i
ORDER BY bucket_i

Sample Image

Visualized, we get something like:

Sample Image

Which tells us:

  • Don't use APPROX_QUANTILES to build a histogram, because each bucket will end up having about the same amount of elements. That's the goal of a quantile.
  • APPROX_QUANTILES is very "APPROX". As you can see each quantile didn't end up with the same amount of elements.
  • It takes between ~305 and ~357 minutes to fly from SFO to JFK.

Calculating and displaying customer lifetime value histogram with BigQuery and Data Studio

I was able to do a similar reproduction to what you describe but it's not straightforward so I'll try to detail everything. The main idea is to have two data sources from the same table: one contains customer_id and product_id so that we can filter it while the other one contains customer_id and the already calculated amount_bucket field. This way we can join it (blend data) on customer_id and filter according to product_id which won't change the amount_bucket calculations.

I used the following script to create some data in BigQuery:

CREATE OR REPLACE TABLE data_studio.histogram
(
customer_id STRING,
product_id STRING,
amount INT64
);

INSERT INTO data_studio.histogram (customer_id, product_id, amount)
VALUES ('John', 'Game', 60),
('John', 'TV', 800),
('John', 'Console', 300),
('Paul', 'Sofa', 1200),
('George', 'TV', 750),
('Ringo', 'Movie', 20),
('Ringo', 'Console', 250)
;

Then I connect directly to the BigQuery table and get the following fields. Data source is called histogram:

Sample Image

We add our second data source (BigQuery) using a custom query:

SELECT
customer_id,
CASE
WHEN SUM(amount) < 500 THEN '0-500'
WHEN SUM(amount) < 1000 THEN '500-1000'
WHEN SUM(amount) < 1500 THEN '1000-1500'
ELSE '1500+'
END
AS amount_bucket
FROM
data_studio.histogram
GROUP BY
customer_id

With only the latter we could already do a basic histogram with the following configuration:

Sample Image

Dimension is amount_bucket, metric is Record count. I made a bucket_order custom field to sort it as lexicographically '1000-1500' comes before '500-1000':

CASE 
WHEN amount_bucket = '0-500' THEN 0
WHEN amount_bucket = '500-1000' THEN 1
WHEN amount_bucket = '1000-1500' THEN 2
ELSE 3
END

Now we add the product_id filter on top and a new chart with the following configuration:

Sample Image

Note that metric is CTD (Count Distinct) of customer_id and the Blended data data source is implemented as:

Sample Image

An example where I filter by TV so only George and John appear but the other products are still counted for the total amount calculation:

Sample Image

I hope it works for you.

Create ranges based on the data

Below is for BigQuery Standard SQL

#standardSQL
WITH price_ranges AS (
SELECT '0-10' price_range UNION ALL
SELECT '11-20' UNION ALL
SELECT '21-30' UNION ALL
SELECT '30-40' UNION ALL
SELECT '40-50'
)
SELECT price_range, COUNT(1) number_sold
FROM `project.dataset.table`
JOIN price_ranges
ON CAST(price_sold AS INT64)
BETWEEN CAST(SPLIT(price_range, '-')[OFFSET(0)] AS INT64)
AND CAST(SPLIT(price_range, '-')[OFFSET(1)] AS INT64)
GROUP BY price_range
-- ORDER BY price_range

If to apply to sample data from your question - result is

Row price_range number_sold  
1 0-10 1
2 11-20 2
3 30-40 1
4 40-50 2


Related Topics



Leave a reply



Submit