Sort a data.table fast by Ascending/Descending order
Update June 5 2014:
The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder
and setorderv
, which does exactly what you require. These functions reorder the data.table
by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder
for more info.
In addition, DT[order(.)]
is also by default optimised to use data.table
's internal fast order instead of base:::order
. This, unlike setorder
, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.
Benchmarks:
Here's an illustration on the speed differences using setorder
, data.table's internal fast order and with base:::order
:
require(data.table) ## 1.9.3
set.seed(1L)
DT <- data.table(Year = sample(1950:2000, 3e6, TRUE),
memberID = sample(paste0("V", 1:1e4), 3e6, TRUE),
month = sample(12, 3e6, TRUE))
## using base:::order
system.time(ans1 <- DT[base:::order(Year, memberID, -month)])
# user system elapsed
# 76.909 0.262 81.266
## optimised to use data.table's fast order
system.time(ans2 <- DT[order(Year, memberID, -month)])
# user system elapsed
# 0.985 0.030 1.027
## reorders by reference
system.time(setorder(DT, Year, memberID, -month))
# user system elapsed
# 0.585 0.013 0.600
## or alternatively
## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))
## are they equal?
identical(ans2, DT) # [1] TRUE
identical(ans1, ans2) # [1] TRUE
On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order
and setorder
is ~135x faster than base:::order
here.
data.table
always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)]
.
All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.
NOTE: Please refer to the history/revisions for earlier answer and updates.
R data.table - How specify by variable in descending order?
We can first order
the data and then assign the flag as required.
library(data.table)
dt <- mtcars
setDT(dt)
dt[, temp := 1]
dt1 <- dt[order(cyl,-mpg)]
dt1[, first_cyl := seq_len(.N) == which.max(temp), cyl]
# mpg cyl disp hp drat wt qsec vs am gear carb temp first_cyl
# 1: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1 TRUE
# 2: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 1 FALSE
# 3: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1 FALSE
# 4: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 1 FALSE
# 5: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 1 FALSE
# 6: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1 FALSE
# 7: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1 FALSE
# 8: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1 FALSE
# 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1 FALSE
#10: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1 FALSE
#11: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1 FALSE
#12: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1 TRUE
#13: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1 FALSE
#....
#....
This assigns TRUE
to the first row where temp = 1
for each cyl
.
Sort rows in data.table in decreasing order on string key `order(-x,v)` gives error on data.table 1.9.4 or earlier
Update
data.table v1.9.6+ now supports OP's original attempt and the following answer is no longer necessary.
You can use DT[order(-rank(x), y)]
.
x y v
1: c 1 7
2: c 3 8
3: c 6 9
4: b 1 1
5: b 3 2
6: b 6 3
7: a 1 4
8: a 3 5
9: a 6 6
Doing absolute descending sort of data.table through function?
Try this:
sort.field = "score"
sortme <- function(dt, sort.field) dt[order(-abs(dt[[sort.field]]))]
sortme(DT, sort.field)
# id score
#1: b 5
#2: z -2
#3: a 1
What you did after the first function definition was to pass a string to the abs
function, hence the ERROR get non-numeric argument to maths function as it sees string
relabel clusters (column) in descending order in data table in R
[Updated] to reflect clarifications to the original post:
dt <- data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(1, 1, 2, 2)
) %>% group_by(cluster) %>% #create the centroids variable
mutate(centroid = mean(population)) %>% ungroup()
# implicitly rank the centroids, assigning increasing integers to decreasing vals
#assign the result as the cluster
dt %>% mutate("cluster" = frankv(centroid, ties.method = "dense"))
# A tibble: 4 x 4
city population cluster centroid
<chr> <dbl> <int> <dbl>
1 NYC 10 2 9.5
2 LA 9 2 9.5
3 Hawaii 1 1 1.5
4 Essex 2 1 1.5
[Previous]
You can reverse code any variable using the formula: max(x)+min(x) - x
dt <- data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(1, 1, 2, 2)
)
dt %>% mutate_at("cluster", ~max(.)+1-.)
city population cluster
1 NYC 10 2
2 LA 9 2
3 Hawaii 1 1
4 Essex 2 1
or
another solution is to use case_when
if you have more than 2 categories, and you don't want to "reverse code" them, per se:
dt %>% mutate("cluster" = case_when(cluster == 2 ~ 1, cluster == 1 ~ 2))
city population cluster
1 NYC 10 2
2 LA 9 2
3 Hawaii 1 1
4 Essex 2 1
Sort data.frame or data.table using vector of column names
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder
:
setorder(DF, C, A)
The advantage of using setorder
/setorderv
is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table
's as wel as on data.frame
's.
If you want to combine ascending and descending ordering, you can use the order
-parameter of setorderv
:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder
you can achieve the same with:
setorder(DF, C, -A)
Related Topics
Removing Display of Row Names from Data Frame
What Are the Double Colons (::) in R
Inst and Extdata Folders in R Packaging
Multiple Graphs in One Canvas Using Ggplot2
Unique() for More Than One Variable
Displaying a Greater Than or Equal Sign
Subset Based on Variable Column Name
Get All Diagonal Vectors from Matrix
Dplyr::Group_By_ with Character String Input of Several Variable Names
Plotting with Ggplot2: "Error: Discrete Value Supplied to Continuous Scale" on Categorical Y-Axis
Download a File from Https Using Download.File()
Same Function Over Multiple Data Frames in R
How to Insert an Image into the Navbar on a Shiny Navbarpage()
Extract a Column from a Data.Table as a Vector, by Position