Select Top N Rows for Each Group

SELECT TOP 20 rows for each group

The easiest way would be to use the row_number() window function to number the rows for each city according to their visitnumber descending and use that as a filter. This query should work in any SQL Server version from 2005 onwards.

select * 
from (
select *, r = row_number() over (partition by City order by VisitNumber desc)
from your_table
) a
where r <= 20
and City in ('Washington', 'New York', 'Los Angeles')

This would select the top 20 items for each city specified in the where clause.

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

Selecting the first N rows of each group ordered by date

As well as the row_number solution, another option is CROSS APPLY(SELECT TOP:

SELECT m.masterid,
d.detailid,
m.numbers,
d.date_time,
d.value
FROM masters AS m
CROSS APPLY (
SELECT TOP (3) *
FROM details AS d
WHERE d.date_time >= '2020-01-01'
AND m.masterid = d.masterid
) AS d
WHERE m.tags LIKE '%Tag2%'
ORDER BY m.masterid DESC,
d.date_time;

This may be faster or slower than row_number, mostly depending on cardinalities (quantity of rows) and indexing.

If indexing is good and it's a small number of rows it will usually be faster. If the inner table needs sorting or you are anyway selecting most rows then use row_number.

Selecting top N rows for each group in dataframe

Using the dplyrpackage in the tidyverse you can do this:

library(tidyverse)

df <- tribble(
~Index, ~Country
, 4.1, "USA"
, 2.1, "USA"
, 5.2, "USA"
, 1.1, "Singapore"
, 6.2, "Singapore"
, 8.1, "Germany"
, 4.5, "Italy"
, 7.1, "Italy"
, 2.3, "Italy"
, 5.9, "Italy"
, 8.8, "Russia"
)

df %>% # take the dataframe
group_by(Country) %>% # group it by the grouping variable
slice(1:3) # and pick rows 1 to 3 per group

Output:

   Index Country  
<dbl> <chr>
1 8.1 Germany
2 4.5 Italy
3 7.1 Italy
4 2.3 Italy
5 8.8 Russia
6 1.1 Singapore
7 6.2 Singapore
8 4.1 USA
9 2.1 USA
10 5.2 USA

Get top n records for each group of grouped results

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
select *
from mytable
where `group` = 1
order by age desc
LIMIT 2
)
UNION ALL
(
select *
from mytable
where `group` = 2
order by age desc
LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from
(
select person, `group`, age,
(@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number
from test t
CROSS JOIN (select @num:=0, @group:=null) c
order by `Group`, Age desc, person
) as x
where x.row_number <= 2;

See Demo

Get top N rows of each group in MySQL

If you want n rows per group, use row_number(). If you then want them interleaved, use order by:

select t.*
from (select t.*,
row_number() over (partition by type order by name) as seqnum
from t
) t
where seqnum <= 2
order by seqnum, type;

This assumes that "top" is alphabetically by name. If you have another definition, use that for the order by for row_number().

Sorting columns and selecting top n rows in each group pandas dataframe

There are 2 solutions:

1.sort_values and aggregate head:

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)

mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1

2.set_index and aggregate nlargest:

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5

Timings:

np.random.seed(123)
N = 1000000

L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)

def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)

print (epat(df))

In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop

In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop

In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop

Get the top N rows by row count in GROUP BY

here is one way , to get top 10 rows in each group:

select * from(
select *, row_number() over (partition by recordtype order by cnt desc) rn
from (
SELECT recordtype, createdby, COUNT(*) cnt
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
)t
)t where rn <= 10


Related Topics



Leave a reply



Submit