Select a random sample of results from a query result
SELECT *
FROM (
SELECT *
FROM mytable
ORDER BY
dbms_random.value
)
WHERE rownum <= 1000
Select random sample of N rows from Oracle SQL query result
You can sample your data by ordering randomly and then fetching first N rows.
DBMS_RANDOM.RANDOM
select round((to_date('2019-12-31') - date_birth) / 365, 0) as age
From personal_info a
where exists ( select person_id b from credit_info where credit_type = 'C' and a.person_id = b.person_id )
Order by DBMS_RANDOM.RANDOM
Fetch first 250 Rows
Edit: for oracle 11g and prior
Select * from (
select round((to_date('2019-12-31') - date_birth) / 365, 0) as age
From personal_info a
where exists ( select person_id b from credit_info where credit_type = 'C' and a.person_id = b.person_id )
Order by DBMS_RANDOM.RANDOM
)
Where rownum< 250
Simple Random Samples from a Sql database
There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.
Three asumptions we might try out:
there is a unique, indexed, primary key in the table
the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)
the unique primary key is an integer that ranges from 1 to n with no gaps
With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.
Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:
create table RandomKeys (RandomKey int)
create table RandomKeysAttempt (RandomKey int)
-- generate m random keys between 1 and n
for i = 1 to m
insert RandomKeysAttempt select rand()*n + 1
-- eliminate duplicates
insert RandomKeys select distinct RandomKey from RandomKeysAttempt
-- as long as we don't have enough, keep generating new keys,
-- with luck (and m much less than n), this won't be necessary
while count(RandomKeys) < m
NextAttempt = rand()*n + 1
if not exists (select * from RandomKeys where RandomKey = NextAttempt)
insert RandomKeys select NextAttempt
-- get our random rows
select *
from RandomKeys r
join table t ON r.RandomKey = t.UniqueKey
If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.
How to randomly select rows in SQL?
SELECT TOP 5 Id, Name FROM customerNames
ORDER BY NEWID()
That said, everybody seems to come to this page for the more general answer to your question:
Selecting a random row in SQL
Select a random row with MySQL:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
Select a random row with PostgreSQL:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
Select a random row with Microsoft SQL Server:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
Select a random row with IBM DB2
SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY
Select a random record with Oracle:
SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1
Select a random row with sqlite:
SELECT column FROM table
ORDER BY RANDOM() LIMIT 1
Generate random sample from huge table, with conditions
Add a column to your table and populate it with random numbers.
ALTER TABLE `table` ADD COLUMN rando FLOAT DEFAULT NULL;
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then do
SELECT *
FROM `table`
WHERE rando > RAND() * 0.9
AND condition = 0
ORDER BY rando
LIMIT 5000
Do it again for condition = 1
and Bob's your uncle. It will pull rows in random order starting from a random row.
A couple of notes:
- 0.9 is there to improve the chances you'll actually get 5000 rows and not some lesser number.
- You may have to add
LIMIT 1000
to the UPDATE statement and run it a whole bunch of times to populate the completerando
column: trying to update all the rows in a big table can generate a huge transaction and swamp your server for a long time. - If you need to generate another random sample, run the UPDATE or UPDATEs again.
Select random sampling from sqlserver quickly
If you can use a pseudo-random sampling and you're on SQL Server 2005/2008, then take a look at TABLESAMPLE. For instance, an example from SQL Server 2008 / AdventureWorks 2008 which works based on rows:
USE AdventureWorks2008;
GO
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (100 ROWS)
WHERE EmailPromotion = 2;
The catch is that TABLESAMPLE isn't exactly random as it generates a given number of rows from each physical page. You may not get back exactly 5000 rows unless you limit with TOP as well. If you're on SQL Server 2000, you're going to have to either generate a temporary table which match the primary key or you're going to have to do it using a method using NEWID().
Select n random rows from SQL Server table
select top 10 percent * from [yourtable] order by newid()
In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.
select * from [yourtable] where [yourPk] in
(select top 10 percent [yourPk] from [yourtable] order by newid())
The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.
Related Topics
Why Does Using an Underscore Character in a Like Filter Give Me All the Results
SQL Primary Key: Integer VS Varchar
Finding the Next Available Id in MySQL
How to Compare Time in SQL Server
Sort Null Values to the End of a Table
Detect Overlapping Date Ranges from the Same Table
Given an Rgb Value What Would Be the Best Way to Find the Closest Match in the Database
How to Find Rows in One Table That Have No Corresponding Row in Another Table
Extracting Hours from a Datetime (SQL Server 2005)
Get Month Name from Date in Oracle
Name Database Design Notation You Prefer and Why
Column Name or Number of Supplied Values Does Not Match Table Definition
Postgres Not Using Index When Index Scan Is Much Better Option
SQL - Does the Order of Where Conditions Matter
How to Get the Number of Days Between 2 Dates in Oracle 11G