Get Top N Records For Each Group of Grouped Results

Get top n records for each group of grouped results

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
select *
from mytable
where `group` = 1
order by age desc
LIMIT 2
)
UNION ALL
(
select *
from mytable
where `group` = 2
order by age desc
LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from
(
select person, `group`, age,
(@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number
from test t
CROSS JOIN (select @num:=0, @group:=null) c
order by `Group`, Age desc, person
) as x
where x.row_number <= 2;

See Demo

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

Get top N rows of each group in MySQL

If you want n rows per group, use row_number(). If you then want them interleaved, use order by:

select t.*
from (select t.*,
row_number() over (partition by type order by name) as seqnum
from t
) t
where seqnum <= 2
order by seqnum, type;

This assumes that "top" is alphabetically by name. If you have another definition, use that for the order by for row_number().

Get top n records with grouped table

if you are using mysql 8.0 or higher you can use row_number() over partition by..

select t1.created_at, t1.user_id from (
select row_number() over (partition by user_id order by created_at desc) rn, created_at, user_id
from orders) t1 where t1.rn <=2

using mysql versions 5.7 and below

SELECT t1.created_at, t1.user_id 
FROM (SELECT
@row_number:=CASE
WHEN @varId = user_id
THEN
@row_number + 1
ELSE
1
END AS rn,
@varId:=iuser_id user_id,
created_at
FROM
orders,
(SELECT @varId:=0,@row_number:=0) as t
ORDER BY
user_id asc, created_at desc) t1
WHERE t1.rn <= 2

Get records with max value for each group of grouped SQL results

There's a super-simple way to do this in mysql:

select * 
from (select * from mytable order by `Group`, age desc, Person) x
group by `Group`

This works because in mysql you're allowed to not aggregate non-group-by columns, in which case mysql just returns the first row. The solution is to first order the data such that for each group the row you want is first, then group by the columns you want the value for.

You avoid complicated subqueries that try to find the max() etc, and also the problems of returning multiple rows when there are more than one with the same maximum value (as the other answers would do)

Note: This is a mysql-only solution. All other databases I know will throw an SQL syntax error with the message "non aggregated columns are not listed in the group by clause" or similar. Because this solution uses undocumented behavior, the more cautious may want to include a test to assert that it remains working should a future version of MySQL change this behavior.

Version 5.7 update:

Since version 5.7, the sql-mode setting includes ONLY_FULL_GROUP_BY by default, so to make this work you must not have this option (edit the option file for the server to remove this setting).

Optimized way to get top n records of each group

Your approach is fine, but your query is not. In particular, MySQL does not guarantee the order of evaluation of expressions in a SELECT, so you should not assign a variable in one expression and use it in another.

Fortunately, you can combine the assignments into a single expression:

SELECT b.*
FROM (SELECT b.sub_cat_id, b.title, created_date
(@rn := IF(@sc = b.sub_cat_id, @rn + 1,
if(@sc := b.sub_cat_id, 1, 1)
)
) as rn
FROM blog b CROSS JOIN
(SELECT @sc := -1, @rn := 0) params
WHERE b.type = 'BLOG' AND
b.sub_cat_id IN (1, 2, 8) AND
b.created_date <= NOW() -- is this really needed?
ORDER BY b.sub_cat_id DESC, b.created_date DESC) AS records
) b
WHERE rn <= 6;

For this query, you want indexes. I think this will work: type, sub_cat_id, created_date). Unfortunately, the group by will still require sorting the data. In more recent versions of MySQL, I think you need to do the sorting in a subquery and then the rn assignment afterwards.

I do wonder if this formulation could be made to be more effective:

select b.*
from blogs b
where b.type = 'BLOG' and
b.sub_cat_id in (1, 2, 8) and
b.created_at >= (select b2.created_at
from blogs b2
where b2.type = b.type and
b2.sub_cat_id = b.sub_cat_id
order by b2.created_at desc
limit 1 offset 5
);

For this, you want an index on blog(type, sub_cat_id, created_at).

Get top n records for each group of grouped results with Bigquery (standard SQL)

This is row_number():

select t.*
from (select t.*,
row_number() over (partition by group order by age desc) as seqnum
from t
) t
where seqnum <= 2;

row_number() is an ANSI standard window function. It is available in most databases. In general, I would suggest that you look more for solutions using Postgres rather than MySQL for solving problems in BQ (if you can't find a BQ resource itself).

Pandas get topmost n records within each group

Did you try

df.groupby('id').head(2)

Output generated:

       id  value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use

df.groupby('id').head(2).reset_index(drop=True)

to remove the MultiIndex and flatten the results:

    id  value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1


Related Topics



Leave a reply



Submit