Best Way to Select Random Rows Postgresql

Best way to select random rows PostgreSQL

Fast ways

Given your specifications (plus additional info in the comments),

You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
Obviously no or few write operations.
Your ID column has to be indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
     , max(id)  AS max_id
     , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*) (for huge tables). Given above specifications, you don't need it. An estimate to replace the full count will do just fine, available at almost no cost:

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM   pg_class
WHERE  oid = 'big'::regclass;  -- your table name

Detailed explanation:

Fast way to discover the row count of a table in PostgreSQL

As long as ct isn't much smaller than id_span, the query will outperform other approaches.

WITH params AS (
   SELECT 1       AS min_id           -- minimum id <= current min id
        , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
   SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
   FROM   params p
        , generate_series(1, 1100) g  -- 1000 + buffer
   GROUP  BY 1                        -- trim duplicates
) r
JOIN   big USING (id)
LIMIT  1000;                          -- trim surplus

Generate random numbers in the id space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
Join the ids to the big table. This should be very fast with the index in place.
Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

SELECT *
FROM  (
   SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
   FROM   generate_series(1, 1100) g
   ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
TABLE  random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with the same table with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
      FROM   pg_class
      WHERE  oid = 'big'::regclass);
BEGIN
   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  _limit;
END
$func$;

Call:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

Generic function

We can make this generic to work for any table with a unique integer column (typically the PK): Pass the table as polymorphic type and (optionally) the name of the PK column and use EXECUTE:

CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
                                         , _id text = 'id'
                                         , _limit int = 1000
                                         , _gaps real = 1.03)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   -- safe syntax with schema & quotes where needed
   _tbl text := pg_typeof(_tbl_type)::text;
   _estimate int := (SELECT (reltuples / relpages
                          * (pg_relation_size(oid) / 8192))::bigint
                     FROM   pg_class  -- get current estimate from system
                     WHERE  oid = _tbl::regclass);
BEGIN
   RETURN QUERY EXECUTE format(
   $$
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   generate_series(1, $2) g
         LIMIT  $2                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  $3                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  $3;
   $$
 , _tbl, _id
   )
   USING _estimate              -- $1
       , (_limit * _gaps)::int  -- $2 ("surplus")
       , _limit                 -- $3
   ;
END
$func$;

Call with defaults (important!):

SELECT * FROM f_random_sample(null::big);  --!

Or more specifically:

SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);

About the same performance as the static version.

Refactor a PL/pgSQL function to return the output of various SELECT queries - chapter "Various complete table types"
Return SETOF rows from PostgreSQL function
Format specifier for integer variables in format() for EXECUTE?
INSERT with dynamic table name in trigger function

This is safe against SQL injection. See:

Table name as a PostgreSQL function parameter
SQL injection in Postgres functions vs prepared queries

Possible alternative

I your requirements allow identical sets for repeated calls (and we are talking about repeated calls) consider a MATERIALIZED VIEW. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces `TABLESAMPLE SYSTEM (n)`

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single
argument which is the fraction of the table to sample, expressed as a
percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:

The SYSTEM method is significantly faster than the BERNOULLI method
when small sampling percentages are specified, but it may return a
less-random sample of the table as a result of clustering effects.

The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Fast way to discover the row count of a table in PostgreSQL

Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:

SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);

See Evan's answer for details.

But that's still not exactly random.

quick random row selection in Postgres

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

best way to select a single random row from table

This is independent from the table size:

SELECT * FROM "User"
WHERE "nickname" LIKE 'test%'
ORDER BY random()
LIMIT 1

Selecting Random rows from a table in PostgreSQL

For exclude duplicate rows you can use SELECT DISTINCT ON (prod.prod_id).
You can do a subquery:

SELECT *
FROM (SELECT DISTINCT ON (prod.prod_id) prod.prod_id,
                                        prod.prod_subcateg_id,
                                        prod.prod_name,
                                        prod.prod_desc
      FROM product prod
      WHERE prod.prod_subcateg_id = 4
        AND prod.prod_datetill is null) AS product
ORDER BY random()
LIMIT 5;

Would it be possible to select random rows with a little preference for a specific column?

Firstly I want to explain how we can select random records on a table. On PostgreSQL, we can use random() function in the order by statement. Example:

select * from test_table
order by random()
limit 1;

I am using limit 1 for selecting only one record. But, using this method our query performance will be very bad for large size tables (over 100 million data)

The second way, you can manually be selecting records using random() if the tables are had id fields. This way is very high performance.
Let's firstly write our own randomize function for using it's easily on our queries.

CREATE OR REPLACE FUNCTION random_between(low integer, high integer)
 RETURNS integer
 LANGUAGE plpgsql
 STRICT
AS $function$
BEGIN
   RETURN floor(random()* (high-low + 1) + low);
END;
$function$;

This function returns a random integer value in the range of our input argument values. Then we can write a query using our random function. Example:

select * from test_table 
where id = (select random_between(min(id), max(id)) from test_table);

This query I tested on the table has 150 million data and gets the best performance, Duration 12 ms. In this query, if you need many rows but not one, then you can write where id > instead of where id=.

Now, for your little preference, I don't know your detailed business logic and condition statements which you want to set to randomizing. I can write for you some sample queries for understanding the mechanism. PostgreSQL has not a function for doing this process, so randomize data using preferences. We must write this logic manually. I created a sample table for testing our queries.

CREATE TABLE test_table (
    id serial4 NOT NULL,
    is_created bool NULL,
    action_date date NULL,
    CONSTRAINT test_table_pkey PRIMARY KEY (id)
);
CREATE INDEX test_table_id_idx ON test_table USING btree (id);

For example, I want to set more preference only to data which are action dates has a closest to today. Sample query:

select 
    id, 
    is_created, 
    action_date, 
    (extract(day from (now()-action_date))) as dif_days 
from 
    test.test_table
where 
    id > (select random_between(min(id), max(id)) from test.test_table)
    and 
    (extract(day from (now()-action_date))) = random_between(0, 6)
limit 1;

In this query this (extract(day from (now()-action_date))) as dif_days query will returned difference between action_date and today. On the where clause firstly I select data that are id field values greater than the resulting randomize value. Then using this query (extract(day from (now()-action_date))) = random_between(0, 6) I select from this resulting data only which data are action_date equals maximum 6 days ago (maybe 4 days ago or 2 days ago, mak 6 days ago).

Сan wrote many logic queries (for example set more preferences using boolean fields: closed are opened and etc.)

PostgreSQL select 20 random records

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 20

Select random rows according to a given criteria PostgreSQL

Based on the EXPLAIN plan, your table is large. About 2 rows per page. Either it is very bloated, or the rows themselves are very wide.

The key to getting good performance is probably to get it to use an index-only scan, by creating an index which contains all 4 columns referenced in your query. The column tested for equality should come first. After that, you have to choose between your two range-or-inequality queried columns ("last_active" or "rating"), based on whichever you think will be more selective. Then you add the other range-or-inequality and the id column to the end, so that an index-only scan can be used. So maybe create index on app_user (country, last_active, rating, id). That will probably be good enough.

You could also try a GiST index on those same columns. This has the theoretical advantage that the two range-or-inequality restrictions can be used together in defining what index pages to look at. But in practise GiST indexes have very high overhead, and this overhead would likely exceed the theoretical benefit.

If the above aren't good enough, you could try partitioning. But how exactly you do that should be based on a holistic view of your application, not just one query.

Best Way to Select Random Rows Postgresql