SQL random sample with groups
You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:
select d.*
from (select d.*,
row_number() over (order by coursecode, newid) as seqnum,
count(*) over () as cnt
from degree d
) d
where seqnum % (cnt / 500) = 1;
EDIT:
You can also calculate the population size for each group "on the fly":
select d.*
from (select d.*,
row_number() over (partition by coursecode order by newid) as seqnum,
count(*) over () as cnt,
count(*) over (partition by coursecode) as cc_cnt
from degree d
) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)
Select random row for each group
select distinct on (id) id, attribute
from like_this
order by id, random()
If you only need the attribute column:
select distinct on (id) attribute
from like_this
order by id, random()
Notice that you still need to order by id
first as it is a column of the distinct on
.
If you only want the distinct attributes:
select distinct attribute
from (
select distinct on (id) attribute
from like_this
order by id, random()
) s
SQL. Get samples of data in group query
Below is for BigQuery Standard SQL
#standardSQL
SELECT domain, names_count,
samples[OFFSET(0)] AS sample_name_1,
samples[SAFE_OFFSET(1)] AS sample_name_2,
samples[SAFE_OFFSET(2)] AS sample_name_3,
samples[SAFE_OFFSET(3)] AS sample_name_4,
samples[SAFE_OFFSET(4)] AS sample_name_5
FROM (
SELECT domain,
COUNT(name) names_count,
ARRAY_AGG(name ORDER BY RAND() LIMIT 5) samples
FROM `project.dataset.table`
GROUP BY domain
)
SQL random sampling into equal groups
What you have looks fine to me. You don't need a subquery BTW. This will do just fine
select *, ntile(4) over (order by random())
Snowflake doesn't guarantee the query will reproduce the same result set even if you provide a random seed so make sure to dump any intermediate result set into a temp table if you plan on re-using it.
Select a random row from each group SQL Server
select top 1 with ties id,code,age
from
table
order by row_number() over (partition by id order by rand())
Update: as per this Return rows in random order, you have to use NEWId,since RAND() is fixed for the duration of the SELECT on MS SQL Server.
select top 1 with ties id,code,age
from
table
order by row_number() over (partition by id order by NEWID())
SQL - 5% random sample by group
You need to be able to count each group and then coerce the data out in a random order. Fortuantly, we can do this with a CTE-style query. Although CTE isn't strictly needed it will help break down the solution into little bits, rather than a lots of sub-selects and the like.
I assume you've already got a column that groups the data, and that the value in this column is the same for all items in the group. If so, something like this might work (columns and table names to be changed to suit your situation):
WITH randomID AS (
-- First assign a random ID to all rows. This will give us a random order.
SELECT *, NEWID() as random FROM sourceTable
),
countGroups AS (
-- Now we add row numbers for each group. So each group will start at 1. We order
-- by the random column we generated in the previous expression, so you should get
-- different results in each execution
SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID
)
-- Now we get the data
SELECT *
FROM countGroups c1
WHERE rowcnt <= (
SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn
)
The two CTE expressions allow you to randomly order and then count each group. The final select should then be fairly straightforward: for each group, find out how many rows there are in it, and only return 5% of them (total_row_count_in_group / 20).
Generate a random number for each group and assign it to all rows in the group
If the random number can be sequential, you can use dense_rank()
:
select t.*, dense_rank() over (order by id) as group_num
from t;
Or for a bit more randomness:
select t.*,
dense_rank() over (order by farm_fingerprint(cast(id as string)), id) as group_num
from t;
Alternatively, a separate calculation by id
might be simplest:
select *
from t join
(select id,
dense_rank() over (order by rand()) as group_num
from t
group by id
) tt
using (id)
Related Topics
How to Replace Blank (Null ) Values with 0 for All Records
Joining Multiple Common Table Expressions
How to Increment Value in Postgres Update Statement on JSON Key
How to Increase Dbms_Output Buffer
Query SQL Server Database from Native iOS Application
How to Use "Partition By" or "Max"
What Is The Purpose of Rowlock on Delete and When Should I Use It
Acts-As-Taggable-On Find All Tags by Context
SQL Loop Through Each Row in a Table
How to Perform a SQL 'Not In' Query Faster
How to Use Time-Series with Sqlite, with Fast Time-Range Queries
Oracle SQL - Max() with Null Values
Referencing a Composite Primary Key
Thoughts on Index Creation for SQL Server for Missing Indexes