Selecting Top N Rows for Each Group in a Table

Get top 1 row of each group


;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

Selecting the first N rows of each group ordered by date

As well as the row_number solution, another option is CROSS APPLY(SELECT TOP:

SELECT m.masterid,
d.detailid,
m.numbers,
d.date_time,
d.value
FROM masters AS m
CROSS APPLY (
SELECT TOP (3) *
FROM details AS d
WHERE d.date_time >= '2020-01-01'
AND m.masterid = d.masterid
) AS d
WHERE m.tags LIKE '%Tag2%'
ORDER BY m.masterid DESC,
d.date_time;

This may be faster or slower than row_number, mostly depending on cardinalities (quantity of rows) and indexing.

If indexing is good and it's a small number of rows it will usually be faster. If the inner table needs sorting or you are anyway selecting most rows then use row_number.

Get the top N rows by row count in GROUP BY

here is one way , to get top 10 rows in each group:

select * from(
select *, row_number() over (partition by recordtype order by cnt desc) rn
from (
SELECT recordtype, createdby, COUNT(*) cnt
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
)t
)t where rn <= 10

Get top N rows of each group in MySQL

If you want n rows per group, use row_number(). If you then want them interleaved, use order by:

select t.*
from (select t.*,
row_number() over (partition by type order by name) as seqnum
from t
) t
where seqnum <= 2
order by seqnum, type;

This assumes that "top" is alphabetically by name. If you have another definition, use that for the order by for row_number().

Select top N rows for each group

If there were a constant number per group, you could do:

select i.*
from items as i inner join
groups as g
on i.group_id = g.id
where i.id in (select top 2 i2.id
from items i2
where i2.group_id = i.group_id
order by i2.score desc
);

Instead, you will need to enumerate the values and this is expensive in MS Access:

select i.*
from (select i.*,
(select count(*)
from items i2
where i2.group_id = i.group_id and
(i2.score < i.score or
i2.score = i.score and i2.id <= i2.id
)
) as seqnum
from items as i
) as i inner join
groups as g
on i.group_id = g.id
where i.seqnum <= g.top_count;

This logic implements the equivalent of row_number(), which is the right way to solve this problem (if your database supports it).

Selecting top N rows for each group based on value in column

A solution with base R:

# df is split according to y, then we keep only the top "z" value (after ordering x) 
# and rbind everything back together:
do.call(rbind,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3

EDIT:

A much more direct way (still in base R) provided in comment by @mt1022:

df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3

Sorting columns and selecting top n rows in each group pandas dataframe

There are 2 solutions:

1.sort_values and aggregate head:

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)

mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1

2.set_index and aggregate nlargest:

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5

Timings:

np.random.seed(123)
N = 1000000

L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)

def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)

print (epat(df))

In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop

In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop

In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop

SELECT TOP 20 rows for each group

The easiest way would be to use the row_number() window function to number the rows for each city according to their visitnumber descending and use that as a filter. This query should work in any SQL Server version from 2005 onwards.

select * 
from (
select *, r = row_number() over (partition by City order by VisitNumber desc)
from your_table
) a
where r <= 20
and City in ('Washington', 'New York', 'Los Angeles')

This would select the top 20 items for each city specified in the where clause.



Related Topics



Leave a reply



Submit