How can a do a greatest-n-per-group query in django?
You can take a look at similar discussion:
Django Query That Get Most Recent Objects From Different Categories
“greatest-n-per-group” query in Django 2.0?
If you are using PostgreSQL:
Purchases.objects.filter(.....).order_by(
'customer', '-field_of_interest'
).distinct('customer')
UPDATE: Window expressions are not allowed in filter, so following methods does not work. Please refer to this answer for up-to-date solution
or with Window
expression
Purchases.objects.filter(.....).annotate(my_max=Window(
expression=Max('field_of_interest'),
partition_by=F('customer')
)
).filter(my_max=F('field_of_interest'))
but latter can yield multiple rows per customer if they have the same field_of_interest
Another Window
, with single row per customer
Purchases.objects.filter(.....).annotate(row_number=Window(
expression=RowNumber(),
partition_by=F('customer'),
order_by=F('field_of_interest').desc()
)
).filter(row_number=1)
Get top n records for each group with Django queryset
Another SQL with similar output would have window function that annotates each row with row number within particular group name and then you would filter row numbers lower or equal 2 in HAVING
clause.
At the moment of writing django does not support filtering based on window function result so you need to calculate row in the first query and filter People
in the second query.
Following code is based on similar question but it implements limiting number of rows to be returned per group_name
.
from django.db.models import F, When, Window
from django.db.models.functions import RowNumber
person_ids = {
pk
for pk, row_no_in_group in Person.objects.annotate(
row_no_in_group=Window(
expression=RowNumber(),
partition_by=[F('group_name')],
order_by=['group_name', F('age').desc(), 'person']
)
).values_list('id', 'row_no_in_group')
if row_no_in_group <= 2
}
filtered_persons = Person.objects.filter(id__in=person_ids)
For following state of Person
table
>>> Person.objects.order_by('group_name', '-age', 'person').values_list('group_name', 'age', 'person')
<QuerySet [(1, 19, 'Brian'), (1, 17, 'Brett'), (1, 14, 'Teresa'), (1, 13, 'Sydney'), (2, 20, 'Daniel'), (2, 18, 'Maureen'), (2, 14, 'Vincent'), (2, 12, 'Carlos'), (2, 11, 'Kathleen'), (2, 11, 'Sandra')]>
queries above return
>>> filtered_persons.order_by('group_name', '-age', 'person').values_list('group_name', 'age', 'person')
<QuerySet [(1, 19, 'Brian'), (1, 17, 'Brett'), (2, 20, 'Daniel'), (2, 18, 'Maureen')]>
limit n results per group by - Django queryset
As @exprator said but a small change,
categories = ForumCategory.objects.filter(class_id=class_id)
result_dict = {}
for cat in categories:
forum_post = ForumPost.objects.filter(category__pk=cat.pk).order_by('-created')[:3]
result_dict[cat.name] = forum_post
Then the queryset for each category can be accessed from the corresponding category name.
How to get latest n record for per group in Django Orm
The first thing to try in this situation is a select_related
to fetch the phone_number for each dude.
dudes = Dude.objects.select_related('phone_number', 'phone_number__business').all()
for dude in dudes:
do_the_thing()
(note the select_related
on the other object you're querying too, phone_number.business
)
In this case, if you have lots of phone_number
s per Dude
, then this might perform worse than your original query, as it will grab every Dude.phone_number.
Unfortunately, as the comments have suggested, there's no ORM way to limit the select_related
. You'll need to write some SQL. You can get a head-start by observing what SQL Django generated for the select_related query by turning up DB kogging, and then running your own custom SQL query.
Django + PostgreSQL best way to improve performance of slow summary aggregation?
The below solution reduces querytime from 5mins on my sample down to 20s.
from collections import Counter
items = Item.objects.filter(...)
{
"System_1": Counter(
items.
filter(record__system='System_1').
order_by('id', '-record__created').
values_list('record__status', flat=True).
distinct('id')),
"System_2": Counter(
items.
filter(record__system='System_2').
order_by('id', '-record__created').
values_list('record__status', flat=True).
distinct('id'))
}
The resulting PostgreSQL plan is given below, let me know if this can be improved:
SELECT DISTINCT ON ("api_item"."id") "api_record"."status"
FROM "api_item" INNER JOIN "api_record_items" ON ("api_item"."id" = "api_record_items"."item_id")
INNER JOIN "api_record" ON ("api_record_items"."record_id" = "api_record"."id")
WHERE "api_record"."system" = System_1 ORDER BY "api_item"."id" ASC, "api_record"."created" DESC
I don't like that I need to pull all the values back from the DB to count them, however I've been unable to get aggregations to work with the distinct required to ensure only one record is counted per item.
In case anyone can improve this the queryplan is:
Unique (cost=563327.33..1064477.67 rows=1010100 width=21) (actual time=16194.180..22301.422 rows=1010100 loops=1)
Planning Time: 0.492 ms
Execution Time: 22401.646 ms
-> Gather Merge (cost=563327.33..1053946.33 rows=4212534 width=21) (actual time=16194.179..21165.116 rows=22200000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=562327.31..566715.36 rows=1755222 width=21) (actual time=16140.068..17729.869 rows=7400000 loops=3)
Worker 1: Sort Method: external merge Disk: 244080kB
Worker 0: Sort Method: external merge Disk: 246880kB
Sort Method: external merge Disk: 247800kB
" Sort Key: api_item.id, api_record.created DESC"
-> Parallel Hash Join (cost=22117.30..308287.51 rows=1755222 width=21) (actual time=5719.348..8826.655 rows=7400000 loops=3)
Hash Cond: (api_record_items.item_id = api_item.id)
-> Parallel Hash Join (cost=2584.61..261932.34 rows=1755222 width=21) (actual time=17.042..3748.984 rows=7400000 loops=3)
-> Parallel Hash (cost=12626.75..12626.75 rows=420875 width=8) (actual time=180.939..180.940 rows=336700 loops=3)
Hash Cond: (api_record_items.record_id = api_record.id)
Buckets: 131072 Batches: 16 Memory Usage: 3520kB
-> Parallel Seq Scan on api_record_items (cost=0.00..234956.18 rows=9291718 width=16) (actual time=0.472..1557.711 rows=7433333 loops=3)
-> Parallel Seq Scan on api_item (cost=0.00..12626.75 rows=420875 width=8) (actual time=0.022..95.567 rows=336700 loops=3)
-> Parallel Hash (cost=2402.11..2402.11 rows=14600 width=21) (actual time=16.479..16.480 rows=10012 loops=3)
Buckets: 32768 Batches: 1 Memory Usage: 1952kB
-> Parallel Seq Scan on api_record (cost=0.00..2402.11 rows=14600 width=21) (actual time=6.063..13.094 rows=10012 loops=3)
Rows Removed by Filter: 33340
Filter: ((system)::text = 'system_1'::text)
Django orm get latest for each group
This should work on Django 1.2+ and MySQL:
Score.objects.annotate(
max_date=Max('student__score__date')
).filter(
date=F('max_date')
)
Related Topics
Executing a Stored Procedure Within a Stored Procedure
Modify Materialized View Query
Difference Between Information_Schema VS Sys Tables in SQL Server
How to Subtract 2 Dates in Oracle to Get the Result in Hour and Minute
SQL Recursion Without Recursion
Insert into a Row at Specific Position into SQL Server Table with Pk
How to Create Xml Schema from an Existing Database in SQL Server 2008
Interesting Tree/Hierarchical Data Structure Problem
Preserve SQL Indexes While Altering Column Datatype
Get Last Friday's Date Unless Today Is Friday Using T-Sql
How to Enable Integration Services (Ssis) in SQL Server 2008
Time Zone Conversion in SQL Query
Select Records in on Table Based on Conditions from Another Table
How to Connect to an External Database from a SQL Statement or a Stored Procedure