Get top 1 row of each group
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
- maintain status in 2 places
- preserve status history
- ...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
Selecting the first N rows of each group ordered by date
As well as the row_number
solution, another option is CROSS APPLY(SELECT TOP
:
SELECT m.masterid,
d.detailid,
m.numbers,
d.date_time,
d.value
FROM masters AS m
CROSS APPLY (
SELECT TOP (3) *
FROM details AS d
WHERE d.date_time >= '2020-01-01'
AND m.masterid = d.masterid
) AS d
WHERE m.tags LIKE '%Tag2%'
ORDER BY m.masterid DESC,
d.date_time;
This may be faster or slower than row_number
, mostly depending on cardinalities (quantity of rows) and indexing.
If indexing is good and it's a small number of rows it will usually be faster. If the inner table needs sorting or you are anyway selecting most rows then use row_number
.
Get the top N rows by row count in GROUP BY
here is one way , to get top 10 rows in each group:
select * from(
select *, row_number() over (partition by recordtype order by cnt desc) rn
from (
SELECT recordtype, createdby, COUNT(*) cnt
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
)t
)t where rn <= 10
Get top N rows of each group in MySQL
If you want n
rows per group, use row_number()
. If you then want them interleaved, use order by
:
select t.*
from (select t.*,
row_number() over (partition by type order by name) as seqnum
from t
) t
where seqnum <= 2
order by seqnum, type;
This assumes that "top" is alphabetically by name
. If you have another definition, use that for the order by
for row_number()
.
Select top N rows for each group
If there were a constant number per group, you could do:
select i.*
from items as i inner join
groups as g
on i.group_id = g.id
where i.id in (select top 2 i2.id
from items i2
where i2.group_id = i.group_id
order by i2.score desc
);
Instead, you will need to enumerate the values and this is expensive in MS Access:
select i.*
from (select i.*,
(select count(*)
from items i2
where i2.group_id = i.group_id and
(i2.score < i.score or
i2.score = i.score and i2.id <= i2.id
)
) as seqnum
from items as i
) as i inner join
groups as g
on i.group_id = g.id
where i.seqnum <= g.top_count;
This logic implements the equivalent of row_number()
, which is the right way to solve this problem (if your database supports it).
Selecting top N rows for each group based on value in column
A solution with base R:
# df is split according to y, then we keep only the top "z" value (after ordering x)
# and rbind everything back together:
do.call(rbind,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3
EDIT:
A much more direct way (still in base R
) provided in comment by @mt1022:
df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3
Sorting columns and selecting top n rows in each group pandas dataframe
There are 2 solutions:
1.sort_values
and aggregate head
:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)
mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1
2.set_index
and aggregate nlargest
:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5
Timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)
def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)
print (epat(df))
In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop
In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop
In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop
SELECT TOP 20 rows for each group
The easiest way would be to use the row_number()
window function to number the rows for each city according to their visitnumber descending and use that as a filter. This query should work in any SQL Server version from 2005 onwards.
select *
from (
select *, r = row_number() over (partition by City order by VisitNumber desc)
from your_table
) a
where r <= 20
and City in ('Washington', 'New York', 'Los Angeles')
This would select the top 20 items for each city specified in the where clause.
Related Topics
SQL Server Trigger Insert Values from New Row into Another Table
Restrict an SQL Server Connection to a Specific Ip Address
How to Flush Output from Pl/SQL in Oracle
Search for "Whole Word Match" with SQL Server Like Pattern
How to Export Image Field to File
How to Get a SQL Row_Number Equivalent for a Spark Rdd
SQL Server 2008 Paging Methods
Sqlite Select with Condition on Date
How Exactly Does Using or in a MySQL Statement Differ With/Without Parentheses
Custom Function with Check Constraint SQL Server 2008
SQL Speed Up Performance of Insert
Sql, How to Concatenate Results
Why Is My T-SQL Left Join Not Working
Anonymous Table or Varray Type in Oracle
SQL Query: Simulating an "And" Over Several Rows Instead of Sub-Querying