Best Way to Select Random Rows Postgresql

Best way to select random rows PostgreSQL

Fast ways

Given your specifications (plus additional info in the comments),

  • You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
  • Obviously no or few write operations.
  • Your ID column has to be indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
, min(id) AS min_id
, max(id) AS max_id
, max(id) - min(id) AS id_span
FROM big;

The only possibly expensive part is the count(*) (for huge tables). Given above specifications, you don't need it. An estimate to replace the full count will do just fine, available at almost no cost:

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM pg_class
WHERE oid = 'big'::regclass; -- your table name

Detailed explanation:

  • Fast way to discover the row count of a table in PostgreSQL

As long as ct isn't much smaller than id_span, the query will outperform other approaches.

WITH params AS (
SELECT 1 AS min_id -- minimum id <= current min id
, 5100000 AS id_span -- rounded up. (max_id - min_id + buffer)
)
SELECT *
FROM (
SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
FROM params p
, generate_series(1, 1100) g -- 1000 + buffer
GROUP BY 1 -- trim duplicates
) r
JOIN big USING (id)
LIMIT 1000; -- trim surplus
  • Generate random numbers in the id space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.

  • Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).

  • Join the ids to the big table. This should be very fast with the index in place.

  • Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

SELECT *
FROM (
SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
FROM generate_series(1, 1100) g
) r
JOIN big USING (id)
LIMIT 1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs
LIMIT 1030 -- hint for query planner
) r
JOIN big b USING (id) -- eliminate miss

UNION -- eliminate dupe
SELECT b.*
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM random_pick r -- plus 3 percent - adapt to your needs
LIMIT 999 -- less than 1000, hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
)
TABLE random_pick
LIMIT 1000; -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with the same table with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
RETURNS SETOF big
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
_surplus int := _limit * _gaps;
_estimate int := ( -- get current estimate from system
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'big'::regclass);
BEGIN
RETURN QUERY
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM generate_series(1, _surplus) g
LIMIT _surplus -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses

UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM random_pick -- just to make it recursive
LIMIT _limit -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
)
TABLE random_pick
LIMIT _limit;
END
$func$;

Call:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

Generic function

We can make this generic to work for any table with a unique integer column (typically the PK): Pass the table as polymorphic type and (optionally) the name of the PK column and use EXECUTE:

CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
, _id text = 'id'
, _limit int = 1000
, _gaps real = 1.03)
RETURNS SETOF anyelement
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
-- safe syntax with schema & quotes where needed
_tbl text := pg_typeof(_tbl_type)::text;
_estimate int := (SELECT (reltuples / relpages
* (pg_relation_size(oid) / 8192))::bigint
FROM pg_class -- get current estimate from system
WHERE oid = _tbl::regclass);
BEGIN
RETURN QUERY EXECUTE format(
$$
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM generate_series(1, $2) g
LIMIT $2 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses

UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM random_pick -- just to make it recursive
LIMIT $3 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses
)
TABLE random_pick
LIMIT $3;
$$
, _tbl, _id
)
USING _estimate -- $1
, (_limit * _gaps)::int -- $2 ("surplus")
, _limit -- $3
;
END
$func$;

Call with defaults (important!):

SELECT * FROM f_random_sample(null::big);  --!

Or more specifically:

SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);

About the same performance as the static version.

Related:

  • Refactor a PL/pgSQL function to return the output of various SELECT queries - chapter "Various complete table types"
  • Return SETOF rows from PostgreSQL function
  • Format specifier for integer variables in format() for EXECUTE?
  • INSERT with dynamic table name in trigger function

This is safe against SQL injection. See:

  • Table name as a PostgreSQL function parameter
  • SQL injection in Postgres functions vs prepared queries

Possible alternative

I your requirements allow identical sets for repeated calls (and we are talking about repeated calls) consider a MATERIALIZED VIEW. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single
argument which is the fraction of the table to sample, expressed as a
percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:

The SYSTEM method is significantly faster than the BERNOULLI method
when small sampling percentages are specified, but it may return a
less-random sample of the table as a result of clustering effects.

The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Related:

  • Fast way to discover the row count of a table in PostgreSQL

Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:

SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);

See Evan's answer for details.

But that's still not exactly random.

quick random row selection in Postgres

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

best way to select a single random row from table

This is independent from the table size:

SELECT * FROM "User"
WHERE "nickname" LIKE 'test%'
ORDER BY random()
LIMIT 1

Selecting Random rows from a table in PostgreSQL

For exclude duplicate rows you can use SELECT DISTINCT ON (prod.prod_id).
You can do a subquery:

SELECT *
FROM (SELECT DISTINCT ON (prod.prod_id) prod.prod_id,
prod.prod_subcateg_id,
prod.prod_name,
prod.prod_desc
FROM product prod
WHERE prod.prod_subcateg_id = 4
AND prod.prod_datetill is null) AS product
ORDER BY random()
LIMIT 5;

Would it be possible to select random rows with a little preference for a specific column?

Firstly I want to explain how we can select random records on a table. On PostgreSQL, we can use random() function in the order by statement. Example:

select * from test_table
order by random()
limit 1;

I am using limit 1 for selecting only one record. But, using this method our query performance will be very bad for large size tables (over 100 million data)

The second way, you can manually be selecting records using random() if the tables are had id fields. This way is very high performance.
Let's firstly write our own randomize function for using it's easily on our queries.

CREATE OR REPLACE FUNCTION random_between(low integer, high integer)
RETURNS integer
LANGUAGE plpgsql
STRICT
AS $function$
BEGIN
RETURN floor(random()* (high-low + 1) + low);
END;
$function$;

This function returns a random integer value in the range of our input argument values. Then we can write a query using our random function. Example:

select * from test_table 
where id = (select random_between(min(id), max(id)) from test_table);

This query I tested on the table has 150 million data and gets the best performance, Duration 12 ms. In this query, if you need many rows but not one, then you can write where id > instead of where id=.

Now, for your little preference, I don't know your detailed business logic and condition statements which you want to set to randomizing. I can write for you some sample queries for understanding the mechanism. PostgreSQL has not a function for doing this process, so randomize data using preferences. We must write this logic manually. I created a sample table for testing our queries.

CREATE TABLE test_table (
id serial4 NOT NULL,
is_created bool NULL,
action_date date NULL,
CONSTRAINT test_table_pkey PRIMARY KEY (id)
);
CREATE INDEX test_table_id_idx ON test_table USING btree (id);

For example, I want to set more preference only to data which are action dates has a closest to today. Sample query:

select 
id,
is_created,
action_date,
(extract(day from (now()-action_date))) as dif_days
from
test.test_table
where
id > (select random_between(min(id), max(id)) from test.test_table)
and
(extract(day from (now()-action_date))) = random_between(0, 6)
limit 1;

In this query this (extract(day from (now()-action_date))) as dif_days query will returned difference between action_date and today. On the where clause firstly I select data that are id field values greater than the resulting randomize value. Then using this query (extract(day from (now()-action_date))) = random_between(0, 6) I select from this resulting data only which data are action_date equals maximum 6 days ago (maybe 4 days ago or 2 days ago, mak 6 days ago).

Сan wrote many logic queries (for example set more preferences using boolean fields: closed are opened and etc.)

PostgreSQL select 20 random records

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 20

Select random rows according to a given criteria PostgreSQL

Based on the EXPLAIN plan, your table is large. About 2 rows per page. Either it is very bloated, or the rows themselves are very wide.

The key to getting good performance is probably to get it to use an index-only scan, by creating an index which contains all 4 columns referenced in your query. The column tested for equality should come first. After that, you have to choose between your two range-or-inequality queried columns ("last_active" or "rating"), based on whichever you think will be more selective. Then you add the other range-or-inequality and the id column to the end, so that an index-only scan can be used. So maybe create index on app_user (country, last_active, rating, id). That will probably be good enough.

You could also try a GiST index on those same columns. This has the theoretical advantage that the two range-or-inequality restrictions can be used together in defining what index pages to look at. But in practise GiST indexes have very high overhead, and this overhead would likely exceed the theoretical benefit.

If the above aren't good enough, you could try partitioning. But how exactly you do that should be based on a holistic view of your application, not just one query.



Related Topics



Leave a reply



Submit