﻿ Rank a Vector Based on Order and Replace Ties with Their Average - ITCodar

# Rank a Vector Based on Order and Replace Ties with Their Average

## Replacing tied rank by their average

We can just use `rank` from `base R`. The default method for `ties.method` is "average"

``x\$freq.Freq <- rank(-x\$freq.Freq)x\$freq.Freq#[1] 1.0 2.5 2.5 4.0 6.0 6.0 6.0 8.0 9.0``

## How to get ranks with no gaps when there are ties among values?

I can think of a quick function to do this. It's not optimal with a for loop but it works:)

``x=c(1,1,2,3,4,5,8,8)foo <- function(x){    su=sort(unique(x))    for (i in 1:length(su)) x[x==su[i]] = i    return(x)}foo(x)[1] 1 1 2 3 4 5 6 6``

## rank and order in R

``set.seed(1)x <- sample(1:50, 30)    x# [1] 14 19 28 43 10 41 42 29 27  3  9  7 44 15 48 18 25 33 13 34 47 39 49  4 30 46  1 40 20  8rank(x)# [1]  9 12 16 25  7 23 24 17 15  2  6  4 26 10 29 11 14 19  8 20 28 21 30  3 18 27  1 22 13  5order(x)# [1] 27 10 24 12 30 11  5 19  1 14 16  2 29 17  9  3  8 25 18 20 22 28  6  7  4 13 26 21 15 23``

`rank` returns a vector with the "rank" of each value. the number in the first position is the 9th lowest. `order` returns the indices that would put the initial vector `x` in order.

The 27th value of `x` is the lowest, so `27` is the first element of `order(x)` - and if you look at `rank(x)`, the 27th element is `1`.

``x[order(x)]# [1]  1  3  4  7  8  9 10 13 14 15 18 19 20 25 27 28 29 30 33 34 39 40 41 42 43 44 46 47 48 49``

## Efficient method to calculate the rank vector of a list in Python

Using scipy, the function you are looking for is `scipy.stats.rankdata`:

``In [13]: import scipy.stats as ssIn [19]: ss.rankdata([3, 1, 4, 15, 92])Out[19]: array([ 2.,  1.,  3.,  4.,  5.])In [20]: ss.rankdata([1, 2, 3, 3, 3, 4, 5])Out[20]: array([ 1.,  2.,  4.,  4.,  4.,  6.,  7.])``

The ranks start at 1, rather than 0 (as in your example), but then again, that's the way `R`'s `rank` function works as well.

Here is a pure-python equivalent of `scipy`'s rankdata function:

``def rank_simple(vector):    return sorted(range(len(vector)), key=vector.__getitem__)def rankdata(a):    n = len(a)    ivec=rank_simple(a)    svec=[a[rank] for rank in ivec]    sumranks = 0    dupcount = 0    newarray = [0]*n    for i in xrange(n):        sumranks += i        dupcount += 1        if i==n-1 or svec[i] != svec[i+1]:            averank = sumranks / float(dupcount) + 1            for j in xrange(i-dupcount+1,i+1):                newarray[ivec[j]] = averank            sumranks = 0            dupcount = 0    return newarrayprint(rankdata([3, 1, 4, 15, 92]))# [2.0, 1.0, 3.0, 4.0, 5.0]print(rankdata([1, 2, 3, 3, 3, 4, 5]))# [1.0, 2.0, 4.0, 4.0, 4.0, 6.0, 7.0]``

## create a mean rank for a rank-frequency data.frame by R

sure, just group by frequency

``library(dplyr)#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #>     filter, lag#> The following objects are masked from 'package:base':#> #>     intersect, setdiff, setequal, uniondt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))dt %>% arrange(desc(frequency))%>%   mutate(rank = row_number()) %>%  group_by(frequency) %>%  mutate(mean_rank = mean(rank)) %>%  ungroup()#> # A tibble: 13 × 3#>    frequency  rank mean_rank#>        <dbl> <int>     <dbl>#>  1        64     1       1  #>  2        58     2       2  #>  3        54     3       3  #>  4        32     4       4  #>  5        29     5       5.5#>  6        29     6       5.5#>  7        25     7       7  #>  8        17     8       8.5#>  9        17     9       8.5#> 10        15    10      10  #> 11        12    11      11.5#> 12        12    12      11.5#> 13        10    13      13``

## R: Rank-function with two variables and ties.method random

Since `order(order(x))` gives the same result as `rank(x)` (see Why does order(order(x)) equal rank(x) in R?), you could just do

``order(order(y, z, runif(length(y))))``

to get the rank values.

Here's a more involved approach that allows you to use methods from `ties.method`. It requires `dplyr`:

``library(dplyr)rank2 <- function(df, key1, key2, ties.method) {  average <- function(x) mean(x)  random <- function(x) sample(x, length(x))  df\$r <- order(order(df[[key1]], df[[key2]]))  group_by_(df, key1, key2) %>% mutate(rr = get(ties.method)(r))  }rank2(df, "y", "z", "average")# Source: local data frame [10 x 5]# Groups: y, z [8]#        x     y     z     r    rr#    <dbl> <dbl> <dbl> <int> <dbl># 1      1     1   0.2     1   1.0# 2      2     4   0.8     6   6.0# 3      3     5   0.5     8   8.0# 4      4     5   0.4     7   7.0# 5      5     2   0.2     3   3.0# 6      6     8   0.1     9   9.5# 7      7     8   0.1    10   9.5# 8      8     1   0.7     2   2.0# 9      9     3   0.3     4   4.5# 10    10     3   0.3     5   4.5``

## Create ranking for vector of double

One way to do so would be using a `multimap`.

• Place the items in a multimap mapping your objects to `size_t`s (the intial values are unimportant). You can do this with one line (use the ctor that takes iterators).

• Loop (either plainly or using whatever from `algorithm`) and assign 0, 1, ... as the values.

• Loop over the distinct keys. For each distinct key, call `equal_range` for the key, and set its values to the average (again, you can use stuff from `algorithm` for this).

The overall complexity should be Theta(n log(n)), where n is the length of the vector.

## replace subset of vector values with subset average

This is my attempt. I first calculate the average rank, then split the subjects of the same rank into rows.

``library(tidyverse)options(stringsAsFactors = FALSE)subj <- c("A", "B", "C,D,E", "C,D,E", "C,D,E", "F", "G,H", "G,H", "I")rank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)df <- data.frame(rank, subj)df %>%   group_by(subj) %>%   summarise(rank = mean(rank)) %>%   rowwise() %>%   do(tibble(subj = unlist(strsplit(.\$subj, ",")), rank = .\$rank)) %>%   ungroup()``

Output:

``# A tibble: 9 × 2   subj  rank* <chr> <dbl>1     A   1.02     B   2.03     C   4.04     D   4.05     E   4.06     F   6.07     G   7.58     H   7.59     I   9.0``

Another approach:

``m <- aggregate(rank~subj, data=df, mean)m <- apply(m, 1, function(x) data.frame(subj = unlist(strsplit(x[1], ",")), rank = x[2]))m <- do.call(rbind, m)rownames(m) <- NULLm``

Output:

``subj rank1    A  1.02    B  2.03    C  4.04    D  4.05    E  4.06    F  6.07    G  7.58    H  7.59    I  9.0``