What Is the Fastest Way to Get a Vector of Sorted Unique Values from a Data.Table

What is the fastest way to get a vector of sorted unique values from a data.table?

For benchmarking, a larger data.table is created with 1.000.000 rows:

n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE),
item = sample(item, n, TRUE),
sales = sample(sales, n, TRUE))

For the sake of completeness also the variants

# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])

will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.

In addition, two other sort options from this answer are included:

# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]

Edit: Following from Frank's comment, I've included his suggestion:

# 8
salesdt[,logical(1), keyby = company]$company

Benchmarking, no key set

Benchmarking is done with help of the microbenchmark package:

timings <- microbenchmark::microbenchmark(
sort(salesdt[, unique(company)]),
sort(unique(salesdt$company)),
salesdt[order(company), unique(company)],
unique(sort(salesdt$company)),
unique(salesdt[,sort(company)]),
salesdt[, .N, by = company][order(-N), company],
salesdt[, sum(sales), by = company][order(-V1), company],
salesdt[,logical(1), keyby = company]$company
)

The timings are displayed with

ggplot2::autoplot(timings)

Please, note the reverse order in the chart (#1 at bottom, #8 at top).

Sample Image

As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.

A bit of surprise to me was variant #3. Despite data.table's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.

Benchmarking, data.table keyed by company

Motivated by this observation I repeated the benchmark with the data.table keyed by company.

setkeyv(salesdt, "company")

The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.

Sample Image

Benchmarking, keyed with a bit of tuning

Variant #3 still includes order(company) which isn't necessary if already keyed by company. So, I removed the unnecessary calls to order and sort from #3 and #5:

timings <- microbenchmark::microbenchmark(
sort(salesdt[, unique(company)]),
sort(unique(salesdt$company)),
salesdt[, unique(company)],
unique(salesdt$company),
unique(salesdt[, company]),
salesdt[, .N, by = company][order(-N), company],
salesdt[, sum(sales), by = company][order(-V1), company],
salesdt[,logical(1), keyby = company]$company
)

The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.

Sample Image

Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7.

How to sort(unique(x)) or faster method on many variables with data.table?

I think that sort(unique(x)) is probably the fastest idiomatic way to go, though the link in the OP shows benchmarks favoring another approach that might be worth looking into if speed is critical.

For the OP's case of making an ordered factor with alphanumerically ordered levels, though, we don't need to explicitly specify the levels. From ?factor:

The default [for the levels parameter] is the unique set of values taken by as.character(x), sorted into increasing order of x.

Also, we can reduce code repetition using lapply and .SD:

cols = c("V1", "V2")
DT[, (cols) := lapply(.SD, factor, ordered=TRUE), .SDcols=cols]

Unique sorted rows single column from R data.table

The OP didn't specify in which particular order he wants the company names to show up in the select. The answers given by Symbolix are perfect if you want the company names ordered alphabetically (no need to use keyby = for this purpose).

You can also order the company names by the number of rows (most important first)

salesdt[, .N, by = company][order(-N), company]
[1] "W" "T" "A" "S" "L"

or by total sales volume (again, most important first)

salesdt[, sum(sales), by = company][order(-V1), company]
[1] "W" "T" "A" "L" "S"

In both cases, this is done in two steps: (1) Compute the metric (either count or sum) per company using by =, (2) order the result as desired but return only the company names.

Finding number of unique values (quickly) across a large data.table

How about this - an alternative to unique using an extra data.table within an anonymous function:

review2[,{

uid <- data.table(user_id)
rev_user <- uid[, .N, by = user_id][, .N]

#browser()
list(
review.num = .N,
review.users = rev_user,
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
)}, by = business_id]

Extracting unique rows from a data table in R

Before data.table v1.9.8, the default behavior of unique.data.table method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key was NULL (the default), one would get the original data set back (as in OPs situation).

As of data.table 1.9.8+, unique.data.table method uses all columns by default which is consistent with the unique.data.frame in base R. To have it use the key columns, explicitly pass by = key(DT) into unique (replacing DT in the call to key with the name of the data.table).

Hence, old behavior would be something like

library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3

While for data.table v1.9.8+, just

b <- data.table(a) 
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them

Or without a copy

setDT(a)
dim(unique(a))
## [1] 8 3

Modifying a data.table to only include unique values of one variable

We can get the value of the 'var' with get

dt[,unique(get(var)), by=var_list2] 

-fullcode

If we are passing 'data' and other arguments in function, there is no need to specify it outside (just beats the purpose of the function)

fun_plot_gam <- function(data,var,var_list){

# get the vector of column names that are not in the 'var'
var_list2 <- setdiff(var_list, var)

# convert to data.table, subset the columns
dt <- as.data.table(data)[, var_list, with=FALSE]
# get the `unique` elements of column specified in 'var'
# grouped by 'var_list2' columns
dt1 <- dt[, unique(get(var)), by=var_list2]
setnames(dt1, ncol(dt1), var)
dt1

}

fun_plot_gam(mtcars, var, var_list)
# cyl hp gear
# 1: 6 110 4
# 2: 6 110 3
# 3: 4 93 4
# 4: 8 175 3
# 5: 6 105 3
# 6: 8 245 3
# 7: 4 62 4
# 8: 4 95 4
# 9: 6 123 4
#10: 8 180 3
#11: 8 205 3
#12: 8 215 3
#13: 8 230 3
#14: 4 66 4
#15: 4 52 4
#16: 4 65 4
#17: 4 97 3
#18: 8 150 3
#19: 4 91 5
#20: 4 113 5
#21: 8 264 5
#22: 6 175 5
#23: 8 335 5
#24: 4 109 4


Related Topics



Leave a reply



Submit