Keep First Row by Multiple Columns in an R Data.Table

data.table - keep first row per group OR based on condition

Try this.

Using mpg >= 50, we should get one row per carb:

x[ rowid(carb) == 1 | mpg >= 50,]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6 160.0   110  3.90  2.62 16.46     0     1     4     4
# 2:  22.8     4 108.0    93  3.85  2.32 18.61     1     1     4     1
# 3:  18.7     8 360.0   175  3.15  3.44 17.02     0     0     3     2
# 4:  16.4     8 275.8   180  3.07  4.07 17.40     0     0     3     3
# 5:  19.7     6 145.0   175  3.62  2.77 15.50     0     1     5     6
# 6:  15.0     8 301.0   335  3.54  3.57 14.60     0     1     5     8

Using mpg >= 30 (since all(mpg > 10)), we should get all of the above plus a few more:

x[ rowid(carb) == 1 | mpg >= 30,]
#       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#  1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#  2:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#  3:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#  4:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3
#  5:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#  6:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#  7:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#  8:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#  9:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
# 10:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8

An alternative, in case you need more grouping variables:

x[, .SD[seq_len(.N) == 1L | mpg >= 30,], by = carb]

though I've been informed that rowid(...) is more efficient than seq_len(.N).

Keep first row by multiple columns in an R data.table

data.table provides S3 methods for unique, duplicated and anyDuplicated

unique(dt, by = c('x','y'))

will give you what you want.

data.table select first row of each group limited to n-1 columns?

As commented by @christoph, .SD doesn't include group columns (which I believe is for efficiency purpose so as not to store duplicated group values), you can verify it by doing this:

unique(DT[, .(name = names(.SD)), by=c('x','v')]$name)
# [1] "y" "a" "b"

unique(DT[, .(name = names(.SD)), by=c('x','v','a')]$name)
# [1] "y" "b"

So if you group by all columns, .SD has nothing in it; And for your specific case, you can just use unique and pass the group variables to the by parameter, which will drop duplicates based on the by columns:

unique(DT, by=c('x','v'))

#   x v y a b
#1: b 1 1 1 9
#2: a 2 1 4 6
#3: a 1 6 6 4
#4: c 1 1 7 3
#5: c 2 3 8 2

unique(DT, by=c('x','v','y','a','b'))

#   x v y a b
#1: b 1 1 1 9
#2: b 1 3 2 8
#3: b 1 6 3 7
#4: a 2 1 4 6
#5: a 2 3 5 5
#6: a 1 6 6 4
#7: c 1 1 7 3
#8: c 2 3 8 2
#9: c 2 6 9 1

data.table - select first n rows within group

As an alternative:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq       max neval cld
 head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    10  a 
   SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401    10   b
    I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973    10  a

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq     max neval cld
 head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876    10   b
   SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a

Now the .I method is clearly the fastest one.

Update 2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812    10   b
   SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a

Group-by in data.table with choosing first element in multiple columns

Suppose you want to see the first element of C1, C2 and C3 respectively, you can use the head on the .SD and specify the column names using .SDcols.

cols <- c("C1", "C2", "C3")
DT[, c(head(.SD, 1), list(AVG_C3=mean(C3), Freq=.N)), by=C4, .SDcols = cols]

   C4 C1 C2 C3 AVG_C3 Freq
1:  A  1 10  1      2    3
2:  B  2 11  2      2    3

Calling a row on a data.table with multiple column key

rowKey <- list(1,4)

dt[rowKey]

#   a b         c
#1: 1 4 0.4778884

.(1,4) means list(1,4). If you want the same, use the same for rowKey.

Data Tables: Creating New Column By Examining Multiple Columns On Multiple Rows

This is now recursive with two recursive steps, using dplyr and data.table functionality.

dt <- structure(list(id = c("11a", "11b", "11c", "12a", "12b"), prevId = c(NA,  "11a", "11b", NA, "12a")), row.names = c(NA, -5L), class = c("data.table", "data.frame"))

data.table(left_join(x = dt
       , y = dt[,.(prevId)]
       , by = c("id" = "prevId")) %>% left_join(
                                            y = dt[,.(id,prevId)]
                                            , by = c("prevId" = "id")
       ))[, .(id, prevId, originatorId = ifelse(is.na(prevId.y), ifelse(is.na(prevId), id, prevId), prevId.y ))]

>  id   prevId  originatorId
1: 11a   <NA>          11a
2: 11b    11a          11a
3: 11c    11b          11a
4: 12a   <NA>          12a
5: 12b    12a          12a

Expanded the example to incorporate the comment by @Michael. It is pretty scalable and allows to adjust the number of recursive steps, by adding additional joins into the pipe. It saves the resulting joined data.table after each iteration and thus allows to follow the matching steps pretty easily. Finally, the results of each join are combined and the resulting table should offer a good overview over the chain of ids in the data.

library(dplyr)
left_join(x = dt
          , y = dt[,.(prevId)]
          , by = c("id" = "prevId")) %>% data.table(.) %>% { . ->> dt.join.1}   %>% left_join(x = .
                                                                                 , y = dt[,.(Second.id = id, Second.prevId = prevId)]
                                                                                , by = c("prevId" = "Second.id")) %>%  data.table(.) %>% { . ->> dt.join.2}

dt.join.final.data <- rbindlist(list(  dt.join.1
                                       , dt.join.2)
                                , fill = TRUE
                                , idcol = "id"
                                , use.names = TRUE)

The resulting data.table looks then like this:

> dt.join.final.data
    id  id prevId Second.prevId
 1:  1 11a   <NA>          <NA>
 2:  1 11b    11a          <NA>
 3:  1 11c    11b          <NA>
 4:  1 12a   <NA>          <NA>
 5:  1 12b    12a          <NA>
 6:  2 11a   <NA>          <NA>
 7:  2 11b    11a          <NA>
 8:  2 11c    11b           11a
 9:  2 12a   <NA>          <NA>
10:  2 12b    12a          <NA>

r data.table - select all rows expect first (in each group)

If you use tail() and set n = -1 it will return all but the first row (see ?tail). You can use this in your command as follows:

mtcars[order(cyl, mpg), tail(.SD, -1), by = .(cyl)]

Data.table row averages by multiple column groups

An option would be to use split.default

DT[, lapply(split.default(.SD, as.integer(gl(length(vec), 3, length(vec)))), 
                   rowMeans), .SDcols = vec]

data.table, filter = median per group and keep two lowest

Do a group by 'MainCat', get the row index (.I) after creating the logical expression with the median 'Value', extract the index ($V1), subset the data, order by the 'MainCat', 'Value', get the first two rows with head, grouped by 'MainCat'

library(data.table)
df2[df2[, .I[Value >= median(Value, na.rm = TRUE)],.(MainCat)]$V1
    ][order(MainCat, Value), head(.SD, 2), MainCat]

-output

   MainCat SubCat Value
    <char> <char> <num>
1:       A   ZZZZ    80
2:       A     XX    90
3:       B     YY    60
4:       B    ZZZ   150