Select random row from a PostgreSQL table with weighted row probabilities
This should do the trick:
WITH CTE AS (
SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
SELECT id, SUM(percent) OVER (ORDER BY id) S, R
FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;
The sub-query Q
gives the following result:
1 50
2 85
3 100
We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE
clause). We use common table expression (WITH
) to ensure the random number is calculated only once.
BTW, the SELECT SUM(percent) FROM YOUR_TABLE
allows you to have any weights in percent
- they don't strictly need to be percentages (i.e. add-up to 100).
[SQL Fiddle]
Random row selection with weighted filters in SQL/PostgreSQL
Ok. Managed to solve it. Basically, used the strategy already outlined in the question and a little help from here -- I had already seen this post before, but I was (and still am) trying to solve in a more elegant way -- something like this but for multiple rows --, not needing to create the "bounds" by hand.
Let's try step-by-step:
Since the filters, with the weights, come from outside the schema, let's create a CTE:
WITH filters (type, id, weight) AS (
SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
)
Now, let's filter the rows, ignoring the weight (for now), so later we don't need to work with the whole table:
WITH filtered_questions AS (
SELECT
q.id,
s.id subject_id,
a.id area_id,
i.id institution_id
FROM
public.questions q
INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
INNER JOIN public.subjects s ON s.id = sq.subject_id
INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
INNER JOIN public.institutions i ON i.id = iq.institution_id
INNER JOIN public.areas_questions aq ON aq.question_id = q.id
INNER JOIN public.areas a ON a.id = aq.area_id
WHERE
subject_id IN (SELECT id from filters where type = 'subject')
and institution_id IN (SELECT id from filters where type = 'institution')
and area_id IN (SELECT id from filters where type = 'area')
)
The same question can be selected by multiple filters, increasing the chance of it being selected. We must update the weights to solve this.
WITH filtered_questions_weights_sum AS (
SELECT
q.id,
SUM(filters.weight) weight_sum
FROM filtered_questions q
INNER JOIN filters
ON (filters.type = 'subject' AND q.subject_id IN(filters.id))
OR (filters.type = 'area' AND q.area_id IN(filters.id))
OR (filters.type = 'institution' AND q.institution_id IN(filters.id))
GROUP BY q.id
)
Generating the bounds, like exposed here.
WITH cumulative_prob AS (
SELECT
id,
SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
FROM filtered_questions_weights_sum
),
cumulative_bounds AS (
SELECT
id,
COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM cumulative_prob
)
Generating the random series. Had to re-normalize (random() * (SELECT SUM(weight_sum)
) because the weights were updated in a previous step. 10 is the number of rows that we want to return.
WITH random_series AS (
SELECT generate_series (1,10),random() * (SELECT SUM(weight_sum) FROM filtered_questions_weights_sum) AS R
)
And finally:
SELECT
id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')
And we get the following distribution:
id lower_cum_bound upper_cum_bound r
------------------------------------ --------------- --------------- -------------------
380f46e9-f373-4b89-a863-05f484e6b3b6 0 2.0 0.41090718149207534
42bcb088-fc19-4272-8c49-e77999edd01c 2.0 3.9 3.4483200465794654
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9 5.9 5.159445870062337
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9 5.9 5.524481557868421
972d0296-acc3-4b44-b67d-928049d5e9c2 5.9 7.8 6.842470594821498
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6 13.5 12.207371663767844
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6 13.5 12.674184153741226
c935e3de-f1b6-4399-b5eb-ed3a9194eb7b 15.5 17.5 17.16804686235264
e5061aeb-53b7-4247-8404-87508c5ac723 21.4 23.4 22.622627633158118
f8c37700-0c3a-457e-8882-7c65269482ea 25.4 27.3 26.841821723571048
Putting it all together:
WITH filters (type, id, weight) AS (
SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
)
,
filtered_questions AS
(
SELECT
q.id,
SUM(filters.weight) weight_sum
FROM
public.questions q
INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
INNER JOIN public.subjects s ON s.id = sq.subject_id
INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
INNER JOIN public.institutions i ON i.id = iq.institution_id
INNER JOIN public.activity_areas_questions aq ON aq.question_id = q.id
INNER JOIN public.activity_areas a ON a.id = aq.activity_area_id
INNER JOIN filters
ON (filters.type = 'subject' AND s.id IN(filters.id))
OR (filters.type = 'area' AND a.id IN(filters.id))
OR (filters.type = 'institution' AND i.id IN(filters.id))
WHERE
s.id IN (SELECT id from filters where type = 'subject')
and i.id IN (SELECT id from filters where type = 'institution')
and a.id IN (SELECT id from filters where type = 'area')
GROUP BY q.id
)
,
cumulative_prob AS (
SELECT
id,
SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
FROM filtered_questions
)
,
cumulative_bounds AS (
SELECT
id,
COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM cumulative_prob
)
,
random_series AS
(
SELECT generate_series (1,14),random() * (SELECT SUM(weight_sum) FROM filtered_questions) AS R
)
SELECT id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')
Weighted Random Selection
Given the small input dataset, it's straightforward to do this in pure SQL. Use CTEs to build lower & upper bound columns for each row in each of the common_FOO_name tables, then use generate_series() to generate sets of random numbers. Join everything together, and use the random value between the bounds as the WHERE clause.
with first_names_weighted as (
select first_name,
sum(ratio) over (order by first_name) - ratio as lower_bound,
sum(ratio) over (order by first_name) as upper_bound
from common_first_name
),
last_names_weighted as (
select last_name,
sum(ratio) over (order by last_name) - ratio as lower_bound,
sum(ratio) over (order by last_name) as upper_bound
from common_last_name
),
randoms as (
select random() * (select sum(ratio) from common_first_name) as f_random,
random() * (select sum(ratio) from common_last_name) as l_random
from generate_series(1, 32)
)
select r, first_name, last_name
from randoms r
cross join first_names_weighted f
cross join last_names_weighted l
where f.lower_bound <= r.f_random and r.f_random <= f.upper_bound
and l.lower_bound <= r.l_random and r.l_random <= l.upper_bound;
Change the value passed to generate_series()
to control how many names to generate. If it's important that it be a function, you can just use a LANGAUGE SQL function definition to parameterize that number:
https://www.db-fiddle.com/f/mmGQRhCP2W1yfhZTm1yXu5/3
How to dynamically perform a weighted random row selection in PostgreSQL?
Here is a solution that works as follows:
- first compute the weight of each student
- sum the weight of all students and multiply if by a random seed
- then pick the first student above that target, random, weight
Query:
with
student_with_weight as (
select
id,
(
extract(epoch from (now() - last_played_datetime)) / 60 * 0.75
+ total_play_duration * 0.5
+ total_points_earned * 0.25
) / 1.5 weight
from student
),
random_weight as (
select random() * (select sum(weight) weight from student_with_weight ) weight
)
select id
from
student_with_weight s
inner join random_weight r on s.weight >= r.weight
order by id
limit 1;
Infinite scroll algorithm for random items with different weight ( probability to show to the user )
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
- log(R) / W,
where—
- W is the record's weight greater than 0 (itself its own column), and
- R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM()
for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND()
with T-SQL's NEWID()
).
quick random row in PostgreSQL: why time (floor(random()*N)) + (select * from a where id = const) 100 times less then select where id = random?
This is because random()
is defined as volatile, so Postgres evaluates it for each and every row again - effectively going through all rows.
If you want to prevent that, "hide" it behind a (otherwise useless) subselect:
SELECT *
FROM a
where id = (select trunc(random ()*1000)::integer);
Related Topics
How to Save the Result of a SQL Query into a Variable in Vba
Postgresql Selecting Most Recent Entry for a Given Id
Oracle - Update Join - Non Key-Preserved Table
How to Bulk Insert Only New Rows in Postresql
Why Is Postgresql Not Using My Indexes on a Small Table
How to See Cakephp's SQL Dump in the Controller
Difference Between Select Unique and Select Distinct
SQL Server Management Studio 2008 Runas User on Different Domain Over Vpn
SQL Query for Parent Child Relationship
SQL Server 2008 Iif Statement Does Not Seem Enabled
How to Return Default Value from SQL Query
Why am I Able to Call the Class Method as If It Were an Instance Method Here
How to Create Daylight Savings Time Start and End Function in SQL Server
Role of Selectivity in Index Scan/Seek
SQL Sort Order with Null Values Last
How to Exclude a Column from Select Query
SQL Same Unit Between Two Tables Needs Order Numbers in 1 Cell
How to Get the Actual Stored Procedure Line Number from an Error Message