Data.Table - Select First N Rows Within Group

data.table - select first n rows within group

As an alternative:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq       max neval cld
 head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    10  a 
   SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401    10   b
    I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973    10  a

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq     max neval cld
 head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876    10   b
   SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a

Now the .I method is clearly the fastest one.

Update 2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812    10   b
   SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a

r data.table - select all rows expect first (in each group)

If you use tail() and set n = -1 it will return all but the first row (see ?tail). You can use this in your command as follows:

mtcars[order(cyl, mpg), tail(.SD, -1), by = .(cyl)]

data.table - keep first row per group OR based on condition

Try this.

Using mpg >= 50, we should get one row per carb:

x[ rowid(carb) == 1 | mpg >= 50,]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6 160.0   110  3.90  2.62 16.46     0     1     4     4
# 2:  22.8     4 108.0    93  3.85  2.32 18.61     1     1     4     1
# 3:  18.7     8 360.0   175  3.15  3.44 17.02     0     0     3     2
# 4:  16.4     8 275.8   180  3.07  4.07 17.40     0     0     3     3
# 5:  19.7     6 145.0   175  3.62  2.77 15.50     0     1     5     6
# 6:  15.0     8 301.0   335  3.54  3.57 14.60     0     1     5     8

Using mpg >= 30 (since all(mpg > 10)), we should get all of the above plus a few more:

x[ rowid(carb) == 1 | mpg >= 30,]
#       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#  1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#  2:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#  3:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#  4:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3
#  5:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#  6:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#  7:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#  8:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#  9:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
# 10:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8

An alternative, in case you need more grouping variables:

x[, .SD[seq_len(.N) == 1L | mpg >= 30,], by = carb]

though I've been informed that rowid(...) is more efficient than seq_len(.N).

How to select the first n rows of each group in specified columns (after a join)?

Have you tried

tmp_dt1[tmp_dt2, on=.(grp, time>=time_from, time<=time_to), 
    x.time, by=.EACHI] # or head(x.time, 2L) to get first 2 rows etc.

You'll need to rename the duplicate columns by yourself until that's taken care of internally, as explained here.

data.table select first row of each group limited to n-1 columns?

As commented by @christoph, .SD doesn't include group columns (which I believe is for efficiency purpose so as not to store duplicated group values), you can verify it by doing this:

unique(DT[, .(name = names(.SD)), by=c('x','v')]$name)
# [1] "y" "a" "b"

unique(DT[, .(name = names(.SD)), by=c('x','v','a')]$name)
# [1] "y" "b"

So if you group by all columns, .SD has nothing in it; And for your specific case, you can just use unique and pass the group variables to the by parameter, which will drop duplicates based on the by columns:

unique(DT, by=c('x','v'))

#   x v y a b
#1: b 1 1 1 9
#2: a 2 1 4 6
#3: a 1 6 6 4
#4: c 1 1 7 3
#5: c 2 3 8 2

unique(DT, by=c('x','v','y','a','b'))

#   x v y a b
#1: b 1 1 1 9
#2: b 1 3 2 8
#3: b 1 6 3 7
#4: a 2 1 4 6
#5: a 2 3 5 5
#6: a 1 6 6 4
#7: c 1 1 7 3
#8: c 2 3 8 2
#9: c 2 6 9 1

How to extract the first n rows per group?

yep, just use .SD and index it as needed.

  DT[, .SD[1:2], by=date]

           date age   name
  1: 2000-01-01   3 Andrew
  2: 2000-01-01   4    Ben
  3: 2000-01-02   6   Adam
  4: 2000-01-02   7    Bob

Edited as per @eddi's suggestion.

@eddi's suggestion is spot on:

Use this instead, for speed:

  DT[DT[, .I[1:2], by = date]$V1]

  # using a slightly larger data set
  > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)
  Unit: milliseconds
      expr       min        lq    median        uq      max neval
   SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719   200
    IStyle  1.675185  2.018773  2.168818  2.269292 11.31072   200

Get top 1 row of each group

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

maintain status in 2 places
preserve status history
...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

How is it possible that selecting first after GroupBy on a DataTable does not return one value for every group?

If you think it's just weird side effects of viewing the data in a watch, which can happen with LINQ statements, then split it out into

var groups = dt.AsEnumerable().GroupBy(i => i.Field<int>("id")).ToList();
var firstOfGroups = groups.Select(i => i.First()).ToList();

and then look at groups and firstOfGroups in the debugger. Temporarily evaluating items with .ToList() can help a lot with viewing things in the debugger.

data.table: Select n specific rows before & after other rows meeting a condition

You are very close. This should do it:

row_numbers <- DT[exclude==0 | y==5, .I[{
    x <- rep(which(y==5), each=length(idx)) + idx 
    x[x>0 & x<=.N]
  }], by=grp]$V1
DT[row_numbers]

Top N rows by group using python datatable

Starting from datatable version 0.8.0, this can be achieved by combining grouping, sorting and filtering:

from datatable import *
DT = Frame(id2=[1, 2, 1, 2, 1, 2], 
           id4=[1, 1, 1, 1, 1, 1], 
           v3=[1, 3, 2, 3, 3, 3])

DT[:2, :, by(f.id2, f.id4), sort(-f.v3)]

which produces

     id2  id4  v3
---  ---  ---  --
 0     1    1   3
 1     1    1   2
 2     2    1   3
 3     2    1   3

[4 rows x 3 columns]

Explanation:

by(f.id2, f.id4) groups the data by columns "id2" and "id4";
the sort(-f.v3) command tells datatable to sort the records by column "v3" in descending order. In the presence of by() this operator will be applied within each group;
the first :2 selects the top 2 rows, again within each group;
the second : selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.

Data.Table - Select First N Rows Within Group