How to Select the First N Rows of Each Group

How to select the first N rows of each group?

You can do the counting using a correlated subquery:

SELECT b.BookId, a.AuthorId, a.AuthorName, b.Title
FROM Author a join
     Book b
     on a.AuthorId = b.AuthorId
where (select count(*)
       from book b2
       where b2.bookId <= b.BookId and b2.AuthorId = b.AuthorId
      ) <= 2;

For a small database this should be fine. If you create a composite index on Book(AuthorId, BookId) then that will help the query.

Selecting the first N rows of each group ordered by date

As well as the row_number solution, another option is CROSS APPLY(SELECT TOP:

SELECT m.masterid,
       d.detailid,
       m.numbers,
       d.date_time,
       d.value
    FROM masters AS m
    CROSS APPLY (
        SELECT TOP (3) *
        FROM details AS d
        WHERE d.date_time >= '2020-01-01'
        AND m.masterid = d.masterid
    ) AS d
    WHERE m.tags LIKE '%Tag2%'
    ORDER BY m.masterid DESC,
             d.date_time;

This may be faster or slower than row_number, mostly depending on cardinalities (quantity of rows) and indexing.

If indexing is good and it's a small number of rows it will usually be faster. If the inner table needs sorting or you are anyway selecting most rows then use row_number.

Get top 1 row of each group

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

maintain status in 2 places
preserve status history
...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

Selecting top N rows for each group in dataframe

Using the dplyrpackage in the tidyverse you can do this:

library(tidyverse)

df <- tribble(
    ~Index, ~Country
    , 4.1, "USA"
    , 2.1, "USA"
    , 5.2, "USA"
    , 1.1, "Singapore"
    , 6.2, "Singapore"
    , 8.1, "Germany"
    , 4.5, "Italy"
    , 7.1, "Italy"
    , 2.3, "Italy"
    , 5.9, "Italy"
    , 8.8, "Russia"
)

df %>% # take the dataframe
    group_by(Country) %>% # group it by the grouping variable
    slice(1:3) # and pick rows 1 to 3 per group

Output:

   Index Country  
   <dbl> <chr>    
 1   8.1 Germany  
 2   4.5 Italy    
 3   7.1 Italy    
 4   2.3 Italy    
 5   8.8 Russia   
 6   1.1 Singapore
 7   6.2 Singapore
 8   4.1 USA      
 9   2.1 USA      
10   5.2 USA

Get top n records for each group of grouped results

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
  select *
  from mytable 
  where `group` = 1
  order by age desc
  LIMIT 2
)
UNION ALL
(
  select *
  from mytable 
  where `group` = 2
  order by age desc
  LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from 
(
   select person, `group`, age,
      (@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number 
  from test t
  CROSS JOIN (select @num:=0, @group:=null) c
  order by `Group`, Age desc, person
) as x 
where x.row_number <= 2;

See Demo

How to select the first n rows of each group in specified columns (after a join)?

Have you tried

tmp_dt1[tmp_dt2, on=.(grp, time>=time_from, time<=time_to), 
    x.time, by=.EACHI] # or head(x.time, 2L) to get first 2 rows etc.

You'll need to rename the duplicate columns by yourself until that's taken care of internally, as explained here.

Select first row in each GROUP BY group?

On databases that support CTE and windowing functions:

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rank
      FROM PURCHASES p)
 SELECT *
   FROM summary
 WHERE rank = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

data.table - select first n rows within group

As an alternative:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq       max neval cld
 head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    10  a 
   SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401    10   b
    I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973    10  a

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq     max neval cld
 head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876    10   b
   SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a

Now the .I method is clearly the fastest one.

Update 2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812    10   b
   SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a

mysql select the first n rows per group

You can simplify your query, and use LEAST function

SELECT USERid, LEAST(7, COUNT(*))
FROM table
GROUP BY USERid

from the question in your comment

SELECT SUM(countByUser) 
FROM
 (SELECT LEAST(7, COUNT(*)) as countByUser
  FROM table
  GROUP BY USERid) c

SqlFiddle