Simple Random Samples from a SQL Database

Simple Random Samples from a Sql database

There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.

Three asumptions we might try out:

  1. there is a unique, indexed, primary key in the table

  2. the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)

  3. the unique primary key is an integer that ranges from 1 to n with no gaps

With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.

Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:


create table RandomKeys (RandomKey int)
create table RandomKeysAttempt (RandomKey int)

-- generate m random keys between 1 and n
for i = 1 to m
insert RandomKeysAttempt select rand()*n + 1

-- eliminate duplicates
insert RandomKeys select distinct RandomKey from RandomKeysAttempt

-- as long as we don't have enough, keep generating new keys,
-- with luck (and m much less than n), this won't be necessary
while count(RandomKeys) < m
NextAttempt = rand()*n + 1
if not exists (select * from RandomKeys where RandomKey = NextAttempt)
insert RandomKeys select NextAttempt

-- get our random rows
select *
from RandomKeys r
join table t ON r.RandomKey = t.UniqueKey

If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.

Select a random sample of results from a query result

SELECT  *
FROM (
SELECT *
FROM mytable
ORDER BY
dbms_random.value
)
WHERE rownum <= 1000

Select n random rows from SQL Server table

select top 10 percent * from [yourtable] order by newid()

In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.

select  * from [yourtable] where [yourPk] in 
(select top 10 percent [yourPk] from [yourtable] order by newid())

The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

How to randomly select rows in SQL?

SELECT TOP 5 Id, Name FROM customerNames
ORDER BY NEWID()

That said, everybody seems to come to this page for the more general answer to your question:

Selecting a random row in SQL

Select a random row with MySQL:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

Select a random row with PostgreSQL:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

Select a random row with Microsoft SQL Server:

SELECT TOP 1 column FROM table
ORDER BY NEWID()

Select a random row with IBM DB2

SELECT column, RAND() as IDX 
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY

Select a random record with Oracle:

SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1

Select a random row with sqlite:

SELECT column FROM table 
ORDER BY RANDOM() LIMIT 1

simple random sampling while pulling data from warehouse(oracle engine) using proc sql in sas

None of the answers posted or comments helped my cause, it could but we have 87 MM rows

Now I wanted the answer with the help of sas: here is what I did: and it works. Thanks all!

    libname dwh path username pwd;
proc sql;
create table sample as
(select
, ranuni(any arbitrary seed)
from dwh.
);
quit);

Random data sampling with oracle sql, data generation

The solution depends, if you want all rows from first initial set(s) and random additional rows from last one then use:

with params(size_, sample_) as (select 4, 6 from dual)
select val
from (
select mod(level - 1, size_) + 1 val, sample_,
case when level <= size_ * floor(sample_ / size_) then 0
else dbms_random.value()
end rand
from params
connect by level <= size_ * ceil(sample_ / size_)
order by rand)
where rownum <= sample_

But if you allow possibility of result like (1, 1, 2, 2, 3, 3), where some values may not appear at all in output (here 4) then use this:

with params(size_, sample_) as (select 4, 6 from dual)
select val
from (
select mod(level - 1, size_) + 1 val, sample_, dbms_random.value() rand
from params
connect by level <= size_ * ceil(sample_ / size_)
order by rand)
where rownum <= sample_

How it works? We build set of (1, 2, 3, 4) as many times as it results from division sample / size. Then we assign random values. In first case I assign 0 to first set(s), so they will be in output for sure, and random values to last set. In second case randoms are assigned to all rows.



Related Topics



Leave a reply



Submit