What is the fastest way to get a vector of sorted unique values from a data.table?
For benchmarking, a larger data.table
is created with 1.000.000 rows:
n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE),
item = sample(item, n, TRUE),
sales = sample(sales, n, TRUE))
For the sake of completeness also the variants
# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])
will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.
In addition, two other sort options from this answer are included:
# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]
Edit: Following from Frank's comment, I've included his suggestion:
# 8
salesdt[,logical(1), keyby = company]$company
Benchmarking, no key set
Benchmarking is done with help of the microbenchmark
package:
timings <- microbenchmark::microbenchmark(
sort(salesdt[, unique(company)]),
sort(unique(salesdt$company)),
salesdt[order(company), unique(company)],
unique(sort(salesdt$company)),
unique(salesdt[,sort(company)]),
salesdt[, .N, by = company][order(-N), company],
salesdt[, sum(sales), by = company][order(-V1), company],
salesdt[,logical(1), keyby = company]$company
)
The timings are displayed with
ggplot2::autoplot(timings)
Please, note the reverse order in the chart (#1 at bottom, #8 at top).
As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.
A bit of surprise to me was variant #3. Despite data.table
's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.
Benchmarking, data.table keyed by company
Motivated by this observation I repeated the benchmark with the data.table
keyed by company
.
setkeyv(salesdt, "company")
The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.
Benchmarking, keyed with a bit of tuning
Variant #3 still includes order(company)
which isn't necessary if already keyed by company
. So, I removed the unnecessary calls to order
and sort
from #3 and #5:
timings <- microbenchmark::microbenchmark(
sort(salesdt[, unique(company)]),
sort(unique(salesdt$company)),
salesdt[, unique(company)],
unique(salesdt$company),
unique(salesdt[, company]),
salesdt[, .N, by = company][order(-N), company],
salesdt[, sum(sales), by = company][order(-V1), company],
salesdt[,logical(1), keyby = company]$company
)
The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.
Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7
.
How to sort(unique(x)) or faster method on many variables with data.table?
I think that sort(unique(x))
is probably the fastest idiomatic way to go, though the link in the OP shows benchmarks favoring another approach that might be worth looking into if speed is critical.
For the OP's case of making an ordered factor with alphanumerically ordered levels, though, we don't need to explicitly specify the levels. From ?factor
:
The default [for the levels parameter] is the unique set of values taken by as.character(x), sorted into increasing order of x.
Also, we can reduce code repetition using lapply
and .SD
:
cols = c("V1", "V2")
DT[, (cols) := lapply(.SD, factor, ordered=TRUE), .SDcols=cols]
Unique sorted rows single column from R data.table
The OP didn't specify in which particular order he wants the company names to show up in the select. The answers given by Symbolix are perfect if you want the company names ordered alphabetically (no need to use keyby =
for this purpose).
You can also order the company names by the number of rows (most important first)
salesdt[, .N, by = company][order(-N), company]
[1] "W" "T" "A" "S" "L"
or by total sales volume (again, most important first)
salesdt[, sum(sales), by = company][order(-V1), company]
[1] "W" "T" "A" "L" "S"
In both cases, this is done in two steps: (1) Compute the metric (either count or sum) per company using by =
, (2) order the result as desired but return only the company names.
Finding number of unique values (quickly) across a large data.table
How about this - an alternative to unique using an extra data.table within an anonymous function:
review2[,{
uid <- data.table(user_id)
rev_user <- uid[, .N, by = user_id][, .N]
#browser()
list(
review.num = .N,
review.users = rev_user,
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
)}, by = business_id]
Extracting unique rows from a data table in R
Before data.table v1.9.8, the default behavior of unique.data.table
method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key
was NULL
(the default), one would get the original data set back (as in OPs situation).
As of data.table 1.9.8+, unique.data.table
method uses all columns by default which is consistent with the unique.data.frame
in base R. To have it use the key columns, explicitly pass by = key(DT)
into unique
(replacing DT
in the call to key with the name of the data.table).
Hence, old behavior would be something like
library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3
While for data.table v1.9.8+, just
b <- data.table(a)
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them
Or without a copy
setDT(a)
dim(unique(a))
## [1] 8 3
Modifying a data.table to only include unique values of one variable
We can get
the value of the 'var' with get
dt[,unique(get(var)), by=var_list2]
-fullcode
If we are passing 'data' and other arguments in function, there is no need to specify it outside (just beats the purpose of the function)
fun_plot_gam <- function(data,var,var_list){
# get the vector of column names that are not in the 'var'
var_list2 <- setdiff(var_list, var)
# convert to data.table, subset the columns
dt <- as.data.table(data)[, var_list, with=FALSE]
# get the `unique` elements of column specified in 'var'
# grouped by 'var_list2' columns
dt1 <- dt[, unique(get(var)), by=var_list2]
setnames(dt1, ncol(dt1), var)
dt1
}
fun_plot_gam(mtcars, var, var_list)
# cyl hp gear
# 1: 6 110 4
# 2: 6 110 3
# 3: 4 93 4
# 4: 8 175 3
# 5: 6 105 3
# 6: 8 245 3
# 7: 4 62 4
# 8: 4 95 4
# 9: 6 123 4
#10: 8 180 3
#11: 8 205 3
#12: 8 215 3
#13: 8 230 3
#14: 4 66 4
#15: 4 52 4
#16: 4 65 4
#17: 4 97 3
#18: 8 150 3
#19: 4 91 5
#20: 4 113 5
#21: 8 264 5
#22: 6 175 5
#23: 8 335 5
#24: 4 109 4
Related Topics
Change Color Actionbutton Shiny R
Apply Tidyr::Separate Over Multiple Columns
What Are the Caveats of Using Source Versus Parse & Eval
R: Save Multiple Plots from a File List into a Single File (Png or PDF or Other Format)
Offline Installation of R Packages
Ddply Multiple Quantiles by Group
Lme4::Glmer VS. Stata's Melogit Command
How to Call the 'Function' Function
R How to Change One of the Level to Na
Using Strsplit and Subset in Dplyr and Mutate
Calculate Mean by Group Using Dplyr Package
Check If Character String Is a Valid Color Representation
Geom_Rect Failure: Error in Eval(Expr, Envir, Enclos):Object 'Variable' Not Found
R: How to Create a Vector of Functions
Knitr: How to Use Child .Rnw Docs with (Relative) Figure Paths
Clear Memory Allocated by R Session (Gc() Doesnt Help !)
When Writing My Own R Package, I Can't Seem to Get Other Packages to Import Correctly