How Can a Do a "Greatest-N-Per-Group" Query in Django

How can a do a greatest-n-per-group query in django?

You can take a look at similar discussion:

Django Query That Get Most Recent Objects From Different Categories

“greatest-n-per-group” query in Django 2.0?

If you are using PostgreSQL:

Purchases.objects.filter(.....).order_by(
'customer', '-field_of_interest'
).distinct('customer')

UPDATE: Window expressions are not allowed in filter, so following methods does not work. Please refer to this answer for up-to-date solution

or with Window expression

Purchases.objects.filter(.....).annotate(my_max=Window(
expression=Max('field_of_interest'),
partition_by=F('customer')
)
).filter(my_max=F('field_of_interest'))

but latter can yield multiple rows per customer if they have the same field_of_interest

Another Window, with single row per customer

Purchases.objects.filter(.....).annotate(row_number=Window(
expression=RowNumber(),
partition_by=F('customer'),
order_by=F('field_of_interest').desc()
)
).filter(row_number=1)

Get top n records for each group with Django queryset

Another SQL with similar output would have window function that annotates each row with row number within particular group name and then you would filter row numbers lower or equal 2 in HAVING clause.

At the moment of writing django does not support filtering based on window function result so you need to calculate row in the first query and filter People in the second query.

Following code is based on similar question but it implements limiting number of rows to be returned per group_name.

from django.db.models import F, When, Window
from django.db.models.functions import RowNumber

person_ids = {
pk
for pk, row_no_in_group in Person.objects.annotate(
row_no_in_group=Window(
expression=RowNumber(),
partition_by=[F('group_name')],
order_by=['group_name', F('age').desc(), 'person']
)
).values_list('id', 'row_no_in_group')
if row_no_in_group <= 2
}
filtered_persons = Person.objects.filter(id__in=person_ids)

For following state of Person table

>>> Person.objects.order_by('group_name', '-age', 'person').values_list('group_name', 'age', 'person')
<QuerySet [(1, 19, 'Brian'), (1, 17, 'Brett'), (1, 14, 'Teresa'), (1, 13, 'Sydney'), (2, 20, 'Daniel'), (2, 18, 'Maureen'), (2, 14, 'Vincent'), (2, 12, 'Carlos'), (2, 11, 'Kathleen'), (2, 11, 'Sandra')]>

queries above return

>>> filtered_persons.order_by('group_name', '-age', 'person').values_list('group_name', 'age', 'person')
<QuerySet [(1, 19, 'Brian'), (1, 17, 'Brett'), (2, 20, 'Daniel'), (2, 18, 'Maureen')]>

limit n results per group by - Django queryset

As @exprator said but a small change,

categories = ForumCategory.objects.filter(class_id=class_id)
result_dict = {}
for cat in categories:
forum_post = ForumPost.objects.filter(category__pk=cat.pk).order_by('-created')[:3]
result_dict[cat.name] = forum_post

Then the queryset for each category can be accessed from the corresponding category name.

How to get latest n record for per group in Django Orm

The first thing to try in this situation is a select_related to fetch the phone_number for each dude.

dudes = Dude.objects.select_related('phone_number', 'phone_number__business').all()
for dude in dudes:
do_the_thing()

(note the select_related on the other object you're querying too, phone_number.business)

In this case, if you have lots of phone_numbers per Dude, then this might perform worse than your original query, as it will grab every Dude.phone_number.

Unfortunately, as the comments have suggested, there's no ORM way to limit the select_related. You'll need to write some SQL. You can get a head-start by observing what SQL Django generated for the select_related query by turning up DB kogging, and then running your own custom SQL query.

Django + PostgreSQL best way to improve performance of slow summary aggregation?

The below solution reduces querytime from 5mins on my sample down to 20s.

from collections import Counter

items = Item.objects.filter(...)
{
"System_1": Counter(
items.
filter(record__system='System_1').
order_by('id', '-record__created').
values_list('record__status', flat=True).
distinct('id')),
"System_2": Counter(
items.
filter(record__system='System_2').
order_by('id', '-record__created').
values_list('record__status', flat=True).
distinct('id'))
}

The resulting PostgreSQL plan is given below, let me know if this can be improved:

SELECT DISTINCT ON ("api_item"."id") "api_record"."status" 
FROM "api_item" INNER JOIN "api_record_items" ON ("api_item"."id" = "api_record_items"."item_id")
INNER JOIN "api_record" ON ("api_record_items"."record_id" = "api_record"."id")
WHERE "api_record"."system" = System_1 ORDER BY "api_item"."id" ASC, "api_record"."created" DESC

I don't like that I need to pull all the values back from the DB to count them, however I've been unable to get aggregations to work with the distinct required to ensure only one record is counted per item.

In case anyone can improve this the queryplan is:

Unique  (cost=563327.33..1064477.67 rows=1010100 width=21) (actual time=16194.180..22301.422 rows=1010100 loops=1)
Planning Time: 0.492 ms
Execution Time: 22401.646 ms
-> Gather Merge (cost=563327.33..1053946.33 rows=4212534 width=21) (actual time=16194.179..21165.116 rows=22200000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=562327.31..566715.36 rows=1755222 width=21) (actual time=16140.068..17729.869 rows=7400000 loops=3)
Worker 1: Sort Method: external merge Disk: 244080kB
Worker 0: Sort Method: external merge Disk: 246880kB
Sort Method: external merge Disk: 247800kB
" Sort Key: api_item.id, api_record.created DESC"
-> Parallel Hash Join (cost=22117.30..308287.51 rows=1755222 width=21) (actual time=5719.348..8826.655 rows=7400000 loops=3)
Hash Cond: (api_record_items.item_id = api_item.id)
-> Parallel Hash Join (cost=2584.61..261932.34 rows=1755222 width=21) (actual time=17.042..3748.984 rows=7400000 loops=3)
-> Parallel Hash (cost=12626.75..12626.75 rows=420875 width=8) (actual time=180.939..180.940 rows=336700 loops=3)
Hash Cond: (api_record_items.record_id = api_record.id)
Buckets: 131072 Batches: 16 Memory Usage: 3520kB
-> Parallel Seq Scan on api_record_items (cost=0.00..234956.18 rows=9291718 width=16) (actual time=0.472..1557.711 rows=7433333 loops=3)
-> Parallel Seq Scan on api_item (cost=0.00..12626.75 rows=420875 width=8) (actual time=0.022..95.567 rows=336700 loops=3)
-> Parallel Hash (cost=2402.11..2402.11 rows=14600 width=21) (actual time=16.479..16.480 rows=10012 loops=3)
Buckets: 32768 Batches: 1 Memory Usage: 1952kB
-> Parallel Seq Scan on api_record (cost=0.00..2402.11 rows=14600 width=21) (actual time=6.063..13.094 rows=10012 loops=3)
Rows Removed by Filter: 33340
Filter: ((system)::text = 'system_1'::text)

Django orm get latest for each group

This should work on Django 1.2+ and MySQL:

Score.objects.annotate(
max_date=Max('student__score__date')
).filter(
date=F('max_date')
)


Related Topics



Leave a reply



Submit