Sort a Data.Table Fast by Ascending/Descending Order

Sort a data.table fast by Ascending/Descending order


Update June 5 2014:

The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder and setorderv, which does exactly what you require. These functions reorder the data.table by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder for more info.

In addition, DT[order(.)] is also by default optimised to use data.table's internal fast order instead of base:::order. This, unlike setorder, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.

Benchmarks:

Here's an illustration on the speed differences using setorder, data.table's internal fast order and with base:::order:

require(data.table) ## 1.9.3
set.seed(1L)
DT <- data.table(Year = sample(1950:2000, 3e6, TRUE),
memberID = sample(paste0("V", 1:1e4), 3e6, TRUE),
month = sample(12, 3e6, TRUE))

## using base:::order
system.time(ans1 <- DT[base:::order(Year, memberID, -month)])
# user system elapsed
# 76.909 0.262 81.266

## optimised to use data.table's fast order
system.time(ans2 <- DT[order(Year, memberID, -month)])
# user system elapsed
# 0.985 0.030 1.027

## reorders by reference
system.time(setorder(DT, Year, memberID, -month))
# user system elapsed
# 0.585 0.013 0.600

## or alternatively
## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))

## are they equal?
identical(ans2, DT) # [1] TRUE
identical(ans1, ans2) # [1] TRUE

On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order and setorder is ~135x faster than base:::order here.

data.table always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)].

All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.


NOTE: Please refer to the history/revisions for earlier answer and updates.

R data.table - How specify by variable in descending order?

We can first order the data and then assign the flag as required.

library(data.table)

dt <- mtcars
setDT(dt)
dt[, temp := 1]
dt1 <- dt[order(cyl,-mpg)]
dt1[, first_cyl := seq_len(.N) == which.max(temp), cyl]

# mpg cyl disp hp drat wt qsec vs am gear carb temp first_cyl
# 1: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1 TRUE
# 2: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 1 FALSE
# 3: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1 FALSE
# 4: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 1 FALSE
# 5: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 1 FALSE
# 6: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1 FALSE
# 7: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1 FALSE
# 8: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1 FALSE
# 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1 FALSE
#10: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1 FALSE
#11: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1 FALSE
#12: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1 TRUE
#13: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1 FALSE
#....
#....

This assigns TRUE to the first row where temp = 1 for each cyl.

Sort rows in data.table in decreasing order on string key `order(-x,v)` gives error on data.table 1.9.4 or earlier


Update

data.table v1.9.6+ now supports OP's original attempt and the following answer is no longer necessary.


You can use DT[order(-rank(x), y)].

   x y v
1: c 1 7
2: c 3 8
3: c 6 9
4: b 1 1
5: b 3 2
6: b 6 3
7: a 1 4
8: a 3 5
9: a 6 6

Doing absolute descending sort of data.table through function?

Try this:

sort.field = "score"

sortme <- function(dt, sort.field) dt[order(-abs(dt[[sort.field]]))]

sortme(DT, sort.field)
# id score
#1: b 5
#2: z -2
#3: a 1

What you did after the first function definition was to pass a string to the abs function, hence the ERROR get non-numeric argument to maths function as it sees string

relabel clusters (column) in descending order in data table in R


[Updated] to reflect clarifications to the original post:

dt <- data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(1, 1, 2, 2)
) %>% group_by(cluster) %>% #create the centroids variable
mutate(centroid = mean(population)) %>% ungroup()

# implicitly rank the centroids, assigning increasing integers to decreasing vals
#assign the result as the cluster
dt %>% mutate("cluster" = frankv(centroid, ties.method = "dense"))

# A tibble: 4 x 4
city population cluster centroid
<chr> <dbl> <int> <dbl>
1 NYC 10 2 9.5
2 LA 9 2 9.5
3 Hawaii 1 1 1.5
4 Essex 2 1 1.5

[Previous]

You can reverse code any variable using the formula: max(x)+min(x) - x

dt <- data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(1, 1, 2, 2)
)

dt %>% mutate_at("cluster", ~max(.)+1-.)

city population cluster
1 NYC 10 2
2 LA 9 2
3 Hawaii 1 1
4 Essex 2 1

or

another solution is to use case_when if you have more than 2 categories, and you don't want to "reverse code" them, per se:

dt %>% mutate("cluster" = case_when(cluster == 2 ~ 1, cluster == 1 ~ 2))

city population cluster
1 NYC 10 2
2 LA 9 2
3 Hawaii 1 1
4 Essex 2 1

Sort data.frame or data.table using vector of column names

With data.table:

setorderv(DF, sortby)

which gives:

> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3

For completeness, with setorder:

setorder(DF, C, A)

The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.


If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:

setorderv(DF, sortby, order = c(1L, -1L))

which subsequently gives:

> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3

With setorder you can achieve the same with:

setorder(DF, C, -A)


Related Topics



Leave a reply



Submit