Stratified Random Sampling with Bigquery

Stratified random sampling with BigQuery?

With #standardSQL, let's define our table and some stats over it:

WITH table AS (
SELECT *, subreddit category
FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
SELECT *, SUM(c) OVER() total
FROM (
SELECT category, COUNT(*) c
FROM table
GROUP BY 1
HAVING c>1000000)
)

In this setup:

  • subreddit will be our category
  • we only want subreddits with more than 1000000 comments

So, if we want 1% of each category in our sample:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 1/100
)
GROUP BY 2

Sample Image

Or let's say we want ~80,000 samples - but chosen proportionally through all categories:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 80000/total
)
GROUP BY 2

Sample Image

Now, if you want to get the ~same number of samples from each group (let's say, 20,000):

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 20000/c
)
GROUP BY 2

Sample Image

If you want exactly 20,000 elements from each category:

SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
)

Sample Image

If you want exactly 2% of each group:

SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2

Sample Image


If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:

SELECT sample.*
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c

Taking a Random Sample From Each Group in Big Query

Something like below should work

SELECT recordID, groupID
FROM (
SELECT
recordID, groupID,
RAND() AS rnd, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY rnd) AS pos
FROM yourTable
)
WHERE pos <= 100
ORDER BY groupID, recordID

Also check RAND() here if you want to improve randomness

Stratified random sample to match a different table in BigQuery

I think your rand() comparison is off:

WITH table AS (
SELECT a.*
FROM `another_table` a
),
table_stats AS (
SELECT cc.*, SUM(c) OVER () as total
FROM (SELECT cat1, cat2, COUNT(*) as c
FROM table
GROUP BY cat1, cat2
HAVING c > 1000000
) cc
)
SELECT COUNT(*) as num_samples, cat1, cat2, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (SELECT id, cat1, cat2, c
FROM (select t.*, COUNT(*) OVER () as t_total,
COUNT(*) OVER (PARTITION BY cat1, cat2) as tcc_total
from table `fh-bigquery.reddit_comments.2018_09` t
) t JOIN
table_stats b
USING (cat1, cat2)
WHERE RAND() < (1000.0 / t.t_total) * (c / total) / (tcc_total / t_total)
) t
GROUP BY 2, 3;

Note that you need the total size of the second table to get the sample size (approximately) correct.

This is also random. If you really want a stratified sample, then you should do an nth sample on an order set. If that is of interest to you, then ask a new question, with appropriate sample data, desired results, and explanation.

efficient way to do stratified sampling in big query

Consider below approach

with samples as (
select 1 DimA, 115623 Total, 3077 Sample, 3 DimA_value_perc union all
select 2, 108203, 3943, 4 union all
select 3, 153477, 6802, 4 union all
select 4, 232252, 12426, 5 union all
select 5, 223004, 14052, 6 union all
select 6, 242386, 17589, 7 union all
select 7, 121519, 9783, 8 union all
select 8, 371342, 34026, 9 union all
select 9, 147683, 15400, 10 union all
select 10, 281101, 32775, 12 union all
select 11, 93380, 12075, 13 union all
select 12, 181293, 25675, 14 union all
select 13, 122206, 19344, 16 union all
select 14, 140559, 25141, 18 union all
select 15, 95576, 19498, 20 union all
select 16, 94319, 21969, 23 union all
select 17, 108282, 30054, 28 union all
select 18, 94920, 33228, 35 union all
select 19, 82764, 39700, 48 union all
select 20, 28417, 23442, 82
)
select a.* except(pos) from (
select *, row_number() over(partition by DimA order by rand()) pos
from tableA
) a
join samples
using(DimA)
where pos <= Sample

Random Sampling in Google BigQuery

For stratified sampling, check https://stackoverflow.com/a/52901452/132438


Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

A scalable approach, to get around 10 random rows:

SELECT word
FROM [publicdata:samples.shakespeare]
WHERE RAND() < 10/164656

(where 10 is the approximate number of results I want to get, and 164656 the number of rows that table has)



standardSQL update:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/164656

or even:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/(SELECT COUNT(*) FROM `publicdata.samples.shakespeare`)

Efficient sampling of a fixed number of rows in BigQuery

I compared the two queries execution times using BigQuery standard SQL with the natality sample dataset (137,826,763 rows) and getting a sample for source_year column of size n. The queries are executed without using cached results.

Query1:

SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < n/137826763

Query2:

SELECT source_year, rand() AS r
FROM `bigquery-public-data.samples.natality`
ORDER BY r
LIMIT n

Result:

n        Query1   Query2
1000 ~2.5s ~2.5s
10000 ~3s ~3s
100000 ~3s ~4s
1000000 ~4.5s ~15s

For n <= 105 the difference is ~ 1s and for n >= 106 the execution time differ significantly. The cause seems to be that when LIMIT is added to the query, then the ORDER BY runs on multiple workers. See the original answer provided by Mikhail Berlyant.

I thought your propose to combine both queries could be a possible solution. Therefore I compared the execution time for the combined query:

New Query:

SELECT source_year,rand() AS r
FROM (
SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < 2*n/137826763)
ORDER BY r
LIMIT n

Result:

n       Query1    New Query
1000 ~2.5s ~3s
10000 ~3s ~3s
100000 ~3s ~3s
1000000 ~4.5s ~6s

The execution time in this case vary in <=1.5s for n <= 106. It is a good idea to select n+some_rows rows in the subquery instead of 2n rows, where some_rows is a constant number large enough to get more than n rows.

Regarding what you said about “not guaranteed to work”, I understand that you are worried that the new query doesn’t retrieve exactly n rows. In this case, if some_rows is large enough, it will always get more than n rows in the subquery. Therefore, the query will return exactly n rows.

To summarize, the combined query is not so fast as Query1 but it get exactly n rows and it is faster than the Query2. So, it could be a solution for uniformly random samples. I want to point out that if ORDER BY is not specified, the BigQuery output is non-deterministic, which means you might receive a different result each time you execute the query. If you try to execute the following query several times without using cached results, you will got different results.

SELECT *
FROM `bigquery-samples.wikipedia_benchmark.Wiki1B`
LIMIT 5

Therefore, depends on how randomly you want to have the samples, this maybe a better solution.

Set random seed in bigquery

Maybe just the order of rows is different? Try to sort them.

How to do repeatable sampling in BigQuery Standard SQL?

Standard SQL would re-write the query thus:

#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8

Specifically here are the changes:

  • a period (not colon) to separate the Google Cloud project from table name.
  • backticks (not square brackets) to escape hyphen in the table name.
  • MOD function (not %).
  • FARM_FINGERPRINT (not HASH). This is actually a different hashing function than Legacy SQL's HASH, which wasn't in fact consistent over time as the blog had implied.

How to do sampling in sql query to get dataframe with pandas

A simple random sample can be performed using the following syntax:

select * from mydata where rand()>0.9

This gives each row in the table a 10% chance of being selected. It doesn't guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here's a fiddle of this approach

http://sqlfiddle.com/#!9/21d1ee/2

On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to 'force' the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.



Related Topics



Leave a reply



Submit