Data.Table - Select First N Rows Within Group

data.table - select first n rows within group

As an alternative:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
SD = dt[, .SD[1:3], cyl],
I = dt[dt[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")

results in:

Unit: relative
expr min lq mean median uq max neval cld
head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10 a
SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401 10 b
I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973 10 a

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
SD = largeDT[, .SD[1:3], cyl],
I = largeDT[largeDT[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")

results in:

Unit: relative
expr min lq mean median uq max neval cld
head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876 10 b
SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a

Now the .I method is clearly the fastest one.


Update 2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
expr min lq mean median uq max neval cld
head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213 10 b
SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
expr min lq mean median uq max neval cld
head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812 10 b
SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

r data.table - select all rows expect first (in each group)

If you use tail() and set n = -1 it will return all but the first row (see ?tail). You can use this in your command as follows:

mtcars[order(cyl, mpg), tail(.SD, -1), by = .(cyl)]

data.table - keep first row per group OR based on condition

Try this.

Using mpg >= 50, we should get one row per carb:

x[ rowid(carb) == 1 | mpg >= 50,]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 160.0 110 3.90 2.62 16.46 0 1 4 4
# 2: 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
# 3: 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
# 4: 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
# 5: 19.7 6 145.0 175 3.62 2.77 15.50 0 1 5 6
# 6: 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8

Using mpg >= 30 (since all(mpg > 10)), we should get all of the above plus a few more:

x[ rowid(carb) == 1 | mpg >= 30,]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 3: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 4: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 5: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 6: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 7: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 8: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 9: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# 10: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

An alternative, in case you need more grouping variables:

x[, .SD[seq_len(.N) == 1L | mpg >= 30,], by = carb]

though I've been informed that rowid(...) is more efficient than seq_len(.N).

How to select the first n rows of each group in specified columns (after a join)?

Have you tried

tmp_dt1[tmp_dt2, on=.(grp, time>=time_from, time<=time_to), 
x.time, by=.EACHI] # or head(x.time, 2L) to get first 2 rows etc.

?

You'll need to rename the duplicate columns by yourself until that's taken care of internally, as explained here.

data.table select first row of each group limited to n-1 columns?

As commented by @christoph, .SD doesn't include group columns (which I believe is for efficiency purpose so as not to store duplicated group values), you can verify it by doing this:

unique(DT[, .(name = names(.SD)), by=c('x','v')]$name)
# [1] "y" "a" "b"

unique(DT[, .(name = names(.SD)), by=c('x','v','a')]$name)
# [1] "y" "b"

So if you group by all columns, .SD has nothing in it; And for your specific case, you can just use unique and pass the group variables to the by parameter, which will drop duplicates based on the by columns:

unique(DT, by=c('x','v'))

# x v y a b
#1: b 1 1 1 9
#2: a 2 1 4 6
#3: a 1 6 6 4
#4: c 1 1 7 3
#5: c 2 3 8 2

unique(DT, by=c('x','v','y','a','b'))

# x v y a b
#1: b 1 1 1 9
#2: b 1 3 2 8
#3: b 1 6 3 7
#4: a 2 1 4 6
#5: a 2 3 5 5
#6: a 1 6 6 4
#7: c 1 1 7 3
#8: c 2 3 8 2
#9: c 2 6 9 1

How to extract the first n rows per group?

yep, just use .SD and index it as needed.

  DT[, .SD[1:2], by=date]

date age name
1: 2000-01-01 3 Andrew
2: 2000-01-01 4 Ben
3: 2000-01-02 6 Adam
4: 2000-01-02 7 Bob

Edited as per @eddi's suggestion.

@eddi's suggestion is spot on:

Use this instead, for speed:

  DT[DT[, .I[1:2], by = date]$V1]

# using a slightly larger data set
> microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)
Unit: milliseconds
expr min lq median uq max neval
SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719 200
IStyle 1.675185 2.018773 2.168818 2.269292 11.31072 200

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

How is it possible that selecting first after GroupBy on a DataTable does not return one value for every group?

If you think it's just weird side effects of viewing the data in a watch, which can happen with LINQ statements, then split it out into

var groups = dt.AsEnumerable().GroupBy(i => i.Field<int>("id")).ToList();
var firstOfGroups = groups.Select(i => i.First()).ToList();

and then look at groups and firstOfGroups in the debugger. Temporarily evaluating items with .ToList() can help a lot with viewing things in the debugger.

data.table: Select n specific rows before & after other rows meeting a condition

You are very close. This should do it:

row_numbers <- DT[exclude==0 | y==5, .I[{
x <- rep(which(y==5), each=length(idx)) + idx
x[x>0 & x<=.N]
}], by=grp]$V1
DT[row_numbers]

Top N rows by group using python datatable

Starting from datatable version 0.8.0, this can be achieved by combining grouping, sorting and filtering:

from datatable import *
DT = Frame(id2=[1, 2, 1, 2, 1, 2],
id4=[1, 1, 1, 1, 1, 1],
v3=[1, 3, 2, 3, 3, 3])

DT[:2, :, by(f.id2, f.id4), sort(-f.v3)]

which produces

     id2  id4  v3
--- --- --- --
0 1 1 3
1 1 1 2
2 2 1 3
3 2 1 3

[4 rows x 3 columns]

Explanation:

  • by(f.id2, f.id4) groups the data by columns "id2" and "id4";
  • the sort(-f.v3) command tells datatable to sort the records by column "v3" in descending order. In the presence of by() this operator will be applied within each group;
  • the first :2 selects the top 2 rows, again within each group;
  • the second : selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.


Related Topics



Leave a reply



Submit