Keep First Row by Multiple Columns in an R Data.Table

data.table - keep first row per group OR based on condition

Try this.

Using mpg >= 50, we should get one row per carb:

x[ rowid(carb) == 1 | mpg >= 50,]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 160.0 110 3.90 2.62 16.46 0 1 4 4
# 2: 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
# 3: 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
# 4: 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
# 5: 19.7 6 145.0 175 3.62 2.77 15.50 0 1 5 6
# 6: 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8

Using mpg >= 30 (since all(mpg > 10)), we should get all of the above plus a few more:

x[ rowid(carb) == 1 | mpg >= 30,]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 3: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 4: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 5: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 6: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 7: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 8: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 9: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# 10: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

An alternative, in case you need more grouping variables:

x[, .SD[seq_len(.N) == 1L | mpg >= 30,], by = carb]

though I've been informed that rowid(...) is more efficient than seq_len(.N).

Keep first row by multiple columns in an R data.table

data.table provides S3 methods for unique, duplicated and anyDuplicated

unique(dt, by = c('x','y'))

will give you what you want.

data.table select first row of each group limited to n-1 columns?

As commented by @christoph, .SD doesn't include group columns (which I believe is for efficiency purpose so as not to store duplicated group values), you can verify it by doing this:

unique(DT[, .(name = names(.SD)), by=c('x','v')]$name)
# [1] "y" "a" "b"

unique(DT[, .(name = names(.SD)), by=c('x','v','a')]$name)
# [1] "y" "b"

So if you group by all columns, .SD has nothing in it; And for your specific case, you can just use unique and pass the group variables to the by parameter, which will drop duplicates based on the by columns:

unique(DT, by=c('x','v'))

# x v y a b
#1: b 1 1 1 9
#2: a 2 1 4 6
#3: a 1 6 6 4
#4: c 1 1 7 3
#5: c 2 3 8 2

unique(DT, by=c('x','v','y','a','b'))

# x v y a b
#1: b 1 1 1 9
#2: b 1 3 2 8
#3: b 1 6 3 7
#4: a 2 1 4 6
#5: a 2 3 5 5
#6: a 1 6 6 4
#7: c 1 1 7 3
#8: c 2 3 8 2
#9: c 2 6 9 1

data.table - select first n rows within group

As an alternative:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
SD = dt[, .SD[1:3], cyl],
I = dt[dt[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")

results in:

Unit: relative
expr min lq mean median uq max neval cld
head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10 a
SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401 10 b
I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973 10 a

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
SD = largeDT[, .SD[1:3], cyl],
I = largeDT[largeDT[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")

results in:

Unit: relative
expr min lq mean median uq max neval cld
head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876 10 b
SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a

Now the .I method is clearly the fastest one.


Update 2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
expr min lq mean median uq max neval cld
head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213 10 b
SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
expr min lq mean median uq max neval cld
head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812 10 b
SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

Group-by in data.table with choosing first element in multiple columns

Suppose you want to see the first element of C1, C2 and C3 respectively, you can use the head on the .SD and specify the column names using .SDcols.

cols <- c("C1", "C2", "C3")
DT[, c(head(.SD, 1), list(AVG_C3=mean(C3), Freq=.N)), by=C4, .SDcols = cols]

C4 C1 C2 C3 AVG_C3 Freq
1: A 1 10 1 2 3
2: B 2 11 2 2 3

Calling a row on a data.table with multiple column key

rowKey <- list(1,4)

dt[rowKey]

# a b c
#1: 1 4 0.4778884

.(1,4) means list(1,4). If you want the same, use the same for rowKey.

Data Tables: Creating New Column By Examining Multiple Columns On Multiple Rows

This is now recursive with two recursive steps, using dplyr and data.table functionality.

dt <- structure(list(id = c("11a", "11b", "11c", "12a", "12b"), prevId = c(NA,  "11a", "11b", NA, "12a")), row.names = c(NA, -5L), class = c("data.table", "data.frame"))

data.table(left_join(x = dt
, y = dt[,.(prevId)]
, by = c("id" = "prevId")) %>% left_join(
y = dt[,.(id,prevId)]
, by = c("prevId" = "id")
))[, .(id, prevId, originatorId = ifelse(is.na(prevId.y), ifelse(is.na(prevId), id, prevId), prevId.y ))]

> id prevId originatorId
1: 11a <NA> 11a
2: 11b 11a 11a
3: 11c 11b 11a
4: 12a <NA> 12a
5: 12b 12a 12a

Expanded the example to incorporate the comment by @Michael. It is pretty scalable and allows to adjust the number of recursive steps, by adding additional joins into the pipe. It saves the resulting joined data.table after each iteration and thus allows to follow the matching steps pretty easily. Finally, the results of each join are combined and the resulting table should offer a good overview over the chain of ids in the data.

library(dplyr)
left_join(x = dt
, y = dt[,.(prevId)]
, by = c("id" = "prevId")) %>% data.table(.) %>% { . ->> dt.join.1} %>% left_join(x = .
, y = dt[,.(Second.id = id, Second.prevId = prevId)]
, by = c("prevId" = "Second.id")) %>% data.table(.) %>% { . ->> dt.join.2}

dt.join.final.data <- rbindlist(list( dt.join.1
, dt.join.2)
, fill = TRUE
, idcol = "id"
, use.names = TRUE)

The resulting data.table looks then like this:

> dt.join.final.data
id id prevId Second.prevId
1: 1 11a <NA> <NA>
2: 1 11b 11a <NA>
3: 1 11c 11b <NA>
4: 1 12a <NA> <NA>
5: 1 12b 12a <NA>
6: 2 11a <NA> <NA>
7: 2 11b 11a <NA>
8: 2 11c 11b 11a
9: 2 12a <NA> <NA>
10: 2 12b 12a <NA>

r data.table - select all rows expect first (in each group)

If you use tail() and set n = -1 it will return all but the first row (see ?tail). You can use this in your command as follows:

mtcars[order(cyl, mpg), tail(.SD, -1), by = .(cyl)]

Data.table row averages by multiple column groups

An option would be to use split.default

DT[, lapply(split.default(.SD, as.integer(gl(length(vec), 3, length(vec)))), 
rowMeans), .SDcols = vec]

data.table, filter = median per group and keep two lowest

Do a group by 'MainCat', get the row index (.I) after creating the logical expression with the median 'Value', extract the index ($V1), subset the data, order by the 'MainCat', 'Value', get the first two rows with head, grouped by 'MainCat'

library(data.table)
df2[df2[, .I[Value >= median(Value, na.rm = TRUE)],.(MainCat)]$V1
][order(MainCat, Value), head(.SD, 2), MainCat]

-output

   MainCat SubCat Value
<char> <char> <num>
1: A ZZZZ 80
2: A XX 90
3: B YY 60
4: B ZZZ 150


Related Topics



Leave a reply



Submit