How to Emulate SQLs Rank Functions in R

How to emulate SQLs rank functions in R?

The data.table package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min") in R is similar to Oracle RANK(), and there's a way using factors (described below) to mimic the DENSE_RANK() function. A way to mimic ROW_NUMBER should be obvious by the end.

Here's an example: Load the latest version of data.table from R-Forge:

install.packages("data.table",
  repos= c("http://R-Forge.R-project.org", getOption("repos")))

library(data.table)

Create some example data:

set.seed(10)

DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

> DT
    ID group       value info
 1:  1     1  0.01874617    a
 2:  2     1 -0.18425254    b
 3:  3     1 -1.37133055    b
 4:  4     2 -0.59916772    a
 5:  5     2  0.29454513    b
 6:  6     2  0.38979430    a
 7:  7     3 -1.20807618    b
 8:  8     3 -0.36367602    a
 9:  9     3 -1.62667268    c
10: 10     4 -0.25647839    d
11: 11     4  1.10177950    c
12: 12     4  0.75578151    d

Rank each ID by decreasing value within group (note the - in front of value to denote decreasing order):

> DT[,valRank:=rank(-value),by="group"]
    ID group       value info valRank
 1:  1     1  0.01874617    a       1
 2:  2     1 -0.18425254    b       2
 3:  3     1 -1.37133055    b       3
 4:  4     2 -0.59916772    a       3
 5:  5     2  0.29454513    b       2
 6:  6     2  0.38979430    a       1
 7:  7     3 -1.20807618    b       2
 8:  8     3 -0.36367602    a       1
 9:  9     3 -1.62667268    c       3
10: 10     4 -0.25647839    d       3
11: 11     4  1.10177950    c       1
12: 12     4  0.75578151    d       2

For DENSE_RANK() with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID based on info within group (compare infoRank with infoRankDense):

DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]

R> DT
    ID group       value info valRank infoRank infoRankDense
 1:  1     1  0.01874617    a       1        1             1
 2:  2     1 -0.18425254    b       2        2             2
 3:  3     1 -1.37133055    b       3        2             2
 4:  4     2 -0.59916772    a       3        1             1
 5:  5     2  0.29454513    b       2        3             2
 6:  6     2  0.38979430    a       1        1             1
 7:  7     3 -1.20807618    b       2        2             2
 8:  8     3 -0.36367602    a       1        1             1
 9:  9     3 -1.62667268    c       3        3             3
10: 10     4 -0.25647839    d       3        2             2
11: 11     4  1.10177950    c       1        1             1
12: 12     4  0.75578151    d       2        2             2

p.s. Hi Matthew Dowle.

LEAD and LAG

For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:

set.seed(10)

DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)

> DT
    ID group       value info idRank
 1:  4     1 -0.36367602    b      1
 2:  5     1 -1.62667268    b      2
 3:  7     1 -1.20807618    b      3
 4:  1     2  1.10177950    a      1
 5:  2     2  0.75578151    a      2
 6: 12     2 -0.25647839    b      3
 7:  3     3  0.74139013    c      1
 8:  6     3  0.98744470    b      2
 9:  9     3 -0.23823356    a      3
10:  8     4 -0.19515038    c      1
11: 10     4  0.08934727    c      2
12: 11     4 -0.95494386    c      3

Then to get the values of the previous 1 record, use the group and idRank variables and subtract 1 from the idRank and use the multi = 'last' argument. To get the value from the record two entries above, subtract 2.

DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]

    ID group       value info idRank        prev      prev2
 1:  4     1 -0.36367602    b      1          NA         NA
 2:  5     1 -1.62667268    b      2 -0.36367602         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760
 4:  1     2  1.10177950    a      1          NA         NA
 5:  2     2  0.75578151    a      2  1.10177950         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795
 7:  3     3  0.74139013    c      1          NA         NA
 8:  6     3  0.98744470    b      2  0.74139013         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901
10:  8     4 -0.19515038    c      1          NA         NA
11: 10     4  0.08934727    c      2 -0.19515038         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504

For LEAD, add the appropriate offset to the idRank variable and switch to multi = 'first':

DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]

    ID group       value info idRank        prev      prev2         nex       nex2
 1:  4     1 -0.36367602    b      1          NA         NA -1.62667268 -1.2080762
 2:  5     1 -1.62667268    b      2 -0.36367602         NA -1.20807618         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760          NA         NA
 4:  1     2  1.10177950    a      1          NA         NA  0.75578151 -0.2564784
 5:  2     2  0.75578151    a      2  1.10177950         NA -0.25647839         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795          NA         NA
 7:  3     3  0.74139013    c      1          NA         NA  0.98744470 -0.2382336
 8:  6     3  0.98744470    b      2  0.74139013         NA -0.23823356         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901          NA         NA
10:  8     4 -0.19515038    c      1          NA         NA  0.08934727 -0.9549439
11: 10     4  0.08934727    c      2 -0.19515038         NA -0.95494386         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504          NA         NA

How to get relative rankings of numeric elements in a list or vector in R?

I think you are looking for dplyr::dense_rank():

# Example 1
dplyr::dense_rank(c(1, 1, 1, 3, 1, 4, 1))
#> [1] 1 1 1 2 1 3 1

# Example 2
dplyr::dense_rank(c(4, 1, 1, 1, 3, 5, 1))
#> [1] 3 1 1 1 2 4 1

# Example in code
dplyr::dense_rank(c(1, 1, 1, 2, 1, 3, 1))
#> [1] 1 1 1 2 1 3 1

How to use Spotfire Dense Rank function in R and How to Implement it R?

Please Try dense_rank function from dplyr package
MainData$MeasureRank <- dense_rank(MainData$Number_Measure)

Rank doesn't start at 1 in R

You need to use dense_rank.

test <- data.frame(column1 = c(5,5,5,6,6,7,7,7,8))
test$rank <- dplyr::dense_rank(test$column1)

Working of window ranking function

test %>% rename(input = column1) %>% 
  mutate(row_num_output = row_number(input),
                rank_output = min_rank(input),
                dense_rank_output = dense_rank(input))

Output to give a better understanding for your input

Sample Image

ranking dataframe using two columns in R

You can use data.table::frank or dplyr::min_rank:

`data.table::frank`

dt$Rank <- frank(dt, B, A, ties.method = "min")
dt
  A B Rank
1 1 1    1
2 2 1    2
3 2 1    2
4 4 4    5
5 5 3    4

`dplyr::min_rank`

mutate(dt, Rank = min_rank(paste(B,A)))
  A B Rank
1 1 1    1
2 2 1    2
3 2 1    2
4 4 4    5
5 5 3    4

Data

dt <- data.frame(A = c(1,2,2,4,5), B = c(1,1,1,4,3))

Calculate rank based on several columns, with a precedence rule

frank and frankv in data.table "accepts vectors, lists, data.frames or data.tables as input", which can be useful here.

First, frankv. It has a cols argument where columns to be ranked can be specified in a character vector - convenient if there are many column names which need to be generated programmatically. It also has a neat order argument.

library(data.table)
setDT(df)
df[ , Var34 := Var3 + Var4]
cols = c("Var1", "Var2", "Var34")
df[ , r := frankv(.SD, cols, order = -1L, ties.method = "dense")]
df[ , Var34 := NULL]

#     Var1 Var2 Var3 Var4  r
#  1:    0    0    0    0 12
#  2:    1    0    0    0  6
#  3:    0    1    0    0  9
#  4:    1    1    0    0  3
#  5:    0    0    1    0 11
#  6:    1    0    1    0  5
#  7:    0    1    1    0  8
#  8:    1    1    1    0  2
#  9:    0    0    0    1 11
# 10:    1    0    0    1  5
# 11:    0    1    0    1  8
# 12:    1    1    0    1  2
# 13:    0    0    1    1 10
# 14:    1    0    1    1  4
# 15:    0    1    1    1  7
# 16:    1    1    1    1  1

frank is handy for interactive use:

df[ , r := frank(.SD, -Var1, -Var2, -Var34, ties.method = "dense")]

Related answers: How to emulate SQLs rank functions in R?; Rank based on several variables

How to get consecutive rank for multiple variables

Well, an easy way would be to convert to factor and then integer

df[] <- lapply(df, function(x) as.integer(factor(x)))
df
#   var x y z
#G1   1 1 2 1
#G2   2 1 3 2
#G3   3 2 1 3
#G4   4 3 1 2
#G5   5 3 2 4

Rank function gives incorrect Ranking

Perhaps you just want the ranking by dcount. If so:

select distinct a.pattern, convert(numeric(18,0),
       coalesce(b.DCount, 0) as DCount,
       dense_rank() over (order by b.[DCount] desc) as [Rank]  
from a cross join b;

Simulate rank in mysql (without rank) with two conditions

SQL DEMO

  SELECT player, wins, diff, dense_rank, rank, dense_val, prev_wins, prev_diff
  FROM
     (
         SELECT player, 
                wins,                 
                diff, 
                @dense_rank := IF(wins = @prev_wins and diff = @prev_diff, @dense_rank, @dense_rank + @dense_val ) AS dense_rank,                
                @dense_val  := IF(wins = @prev_wins and diff = @prev_diff, @dense_val + 1 , 1) as dense_val,
                @rank := @rank + 1 as rank,
                @prev_wins := wins as prev_wins,
                @prev_diff := diff as prev_diff

         FROM tmpPoradi,(SELECT @dense_rank := 0, @dense_val := 1, @rank := 0, @prev_wins := 0, @prev_diff := 0) r 
         ORDER BY wins DESC,diff DESC
     )  rt
  ORDER BY rank

OUTPUT

Sample Image

How to Emulate SQLs Rank Functions in R