How to Select the First N Rows of Each Group

How to select the first N rows of each group?

You can do the counting using a correlated subquery:

SELECT b.BookId, a.AuthorId, a.AuthorName, b.Title
FROM Author a join
Book b
on a.AuthorId = b.AuthorId
where (select count(*)
from book b2
where b2.bookId <= b.BookId and b2.AuthorId = b.AuthorId
) <= 2;

For a small database this should be fine. If you create a composite index on Book(AuthorId, BookId) then that will help the query.

Selecting the first N rows of each group ordered by date

As well as the row_number solution, another option is CROSS APPLY(SELECT TOP:

SELECT m.masterid,
d.detailid,
m.numbers,
d.date_time,
d.value
FROM masters AS m
CROSS APPLY (
SELECT TOP (3) *
FROM details AS d
WHERE d.date_time >= '2020-01-01'
AND m.masterid = d.masterid
) AS d
WHERE m.tags LIKE '%Tag2%'
ORDER BY m.masterid DESC,
d.date_time;

This may be faster or slower than row_number, mostly depending on cardinalities (quantity of rows) and indexing.

If indexing is good and it's a small number of rows it will usually be faster. If the inner table needs sorting or you are anyway selecting most rows then use row_number.

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

Selecting top N rows for each group in dataframe

Using the dplyrpackage in the tidyverse you can do this:

library(tidyverse)

df <- tribble(
~Index, ~Country
, 4.1, "USA"
, 2.1, "USA"
, 5.2, "USA"
, 1.1, "Singapore"
, 6.2, "Singapore"
, 8.1, "Germany"
, 4.5, "Italy"
, 7.1, "Italy"
, 2.3, "Italy"
, 5.9, "Italy"
, 8.8, "Russia"
)

df %>% # take the dataframe
group_by(Country) %>% # group it by the grouping variable
slice(1:3) # and pick rows 1 to 3 per group

Output:

   Index Country  
<dbl> <chr>
1 8.1 Germany
2 4.5 Italy
3 7.1 Italy
4 2.3 Italy
5 8.8 Russia
6 1.1 Singapore
7 6.2 Singapore
8 4.1 USA
9 2.1 USA
10 5.2 USA

Get top n records for each group of grouped results

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
select *
from mytable
where `group` = 1
order by age desc
LIMIT 2
)
UNION ALL
(
select *
from mytable
where `group` = 2
order by age desc
LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from
(
select person, `group`, age,
(@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number
from test t
CROSS JOIN (select @num:=0, @group:=null) c
order by `Group`, Age desc, person
) as x
where x.row_number <= 2;

See Demo

How to select the first n rows of each group in specified columns (after a join)?

Have you tried

tmp_dt1[tmp_dt2, on=.(grp, time>=time_from, time<=time_to), 
x.time, by=.EACHI] # or head(x.time, 2L) to get first 2 rows etc.

?

You'll need to rename the duplicate columns by yourself until that's taken care of internally, as explained here.

Select first row in each GROUP BY group?

On databases that support CTE and windowing functions:

WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total

data.table - select first n rows within group

As an alternative:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
SD = dt[, .SD[1:3], cyl],
I = dt[dt[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")

results in:

Unit: relative
expr min lq mean median uq max neval cld
head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10 a
SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401 10 b
I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973 10 a

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
SD = largeDT[, .SD[1:3], cyl],
I = largeDT[largeDT[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")

results in:

Unit: relative
expr min lq mean median uq max neval cld
head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876 10 b
SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a

Now the .I method is clearly the fastest one.


Update 2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
expr min lq mean median uq max neval cld
head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213 10 b
SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
expr min lq mean median uq max neval cld
head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812 10 b
SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

mysql select the first n rows per group

You can simplify your query, and use LEAST function

SELECT USERid, LEAST(7, COUNT(*))
FROM table
GROUP BY USERid

from the question in your comment

SELECT SUM(countByUser) 
FROM
(SELECT LEAST(7, COUNT(*)) as countByUser
FROM table
GROUP BY USERid) c

SqlFiddle



Related Topics



Leave a reply



Submit