Sql Random Sample with Groups

SQL random sample with groups

You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

EDIT:

You can also calculate the population size for each group "on the fly":

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)

Select random row for each group

select distinct on (id) id, attribute
from like_this
order by id, random()

If you only need the attribute column:

select distinct on (id) attribute
from like_this
order by id, random()

Notice that you still need to order by id first as it is a column of the distinct on.

If you only want the distinct attributes:

select distinct attribute
from (
    select distinct on (id) attribute
    from like_this
    order by id, random()
) s

SQL. Get samples of data in group query

Below is for BigQuery Standard SQL

#standardSQL
SELECT domain, names_count,
  samples[OFFSET(0)] AS sample_name_1,
  samples[SAFE_OFFSET(1)] AS sample_name_2,
  samples[SAFE_OFFSET(2)] AS sample_name_3,
  samples[SAFE_OFFSET(3)] AS sample_name_4,
  samples[SAFE_OFFSET(4)] AS sample_name_5
FROM (
  SELECT domain, 
    COUNT(name) names_count,
    ARRAY_AGG(name ORDER BY RAND() LIMIT 5) samples
  FROM `project.dataset.table`
  GROUP BY domain
)

SQL random sampling into equal groups

What you have looks fine to me. You don't need a subquery BTW. This will do just fine

select *, ntile(4) over (order by random())

Snowflake doesn't guarantee the query will reproduce the same result set even if you provide a random seed so make sure to dump any intermediate result set into a temp table if you plan on re-using it.

Select a random row from each group SQL Server

select top 1 with ties id,code,age 
from
table
order by row_number() over (partition by id order by rand())

Update: as per this Return rows in random order, you have to use NEWId,since RAND() is fixed for the duration of the SELECT on MS SQL Server.

 select top 1 with ties id,code,age 
 from
 table
order by row_number() over (partition by id order by NEWID())

SQL - 5% random sample by group

You need to be able to count each group and then coerce the data out in a random order. Fortuantly, we can do this with a CTE-style query. Although CTE isn't strictly needed it will help break down the solution into little bits, rather than a lots of sub-selects and the like.

I assume you've already got a column that groups the data, and that the value in this column is the same for all items in the group. If so, something like this might work (columns and table names to be changed to suit your situation):

WITH randomID AS (
    -- First assign a random ID to all rows. This will give us a random order.
    SELECT *, NEWID() as random FROM sourceTable
),
countGroups AS (
    -- Now we add row numbers for each group. So each group will start at 1. We order 
    -- by the random column we generated in the previous expression, so you should get
    -- different results in each execution
    SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID
)
-- Now we get the data
SELECT * 
    FROM countGroups c1
    WHERE rowcnt <= (
        SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn
    )

The two CTE expressions allow you to randomly order and then count each group. The final select should then be fairly straightforward: for each group, find out how many rows there are in it, and only return 5% of them (total_row_count_in_group / 20).

Generate a random number for each group and assign it to all rows in the group

If the random number can be sequential, you can use dense_rank():

select t.*, dense_rank() over (order by id) as group_num
from t;

Or for a bit more randomness:

select t.*,
       dense_rank() over (order by farm_fingerprint(cast(id as string)), id) as group_num
from t;

Alternatively, a separate calculation by id might be simplest:

select *
from t join
     (select id,
             dense_rank() over (order by rand()) as group_num
      from t
      group by id
     ) tt
     using (id)

Sql Random Sample with Groups