Rank a Vector Based on Order and Replace Ties with Their Average

Replacing tied rank by their average

We can just use rank from base R. The default method for ties.method is "average"

x$freq.Freq <- rank(-x$freq.Freq)
x$freq.Freq
#[1] 1.0 2.5 2.5 4.0 6.0 6.0 6.0 8.0 9.0

How to get ranks with no gaps when there are ties among values?

I can think of a quick function to do this. It's not optimal with a for loop but it works:)

x=c(1,1,2,3,4,5,8,8)

foo <- function(x){
su=sort(unique(x))
for (i in 1:length(su)) x[x==su[i]] = i
return(x)
}

foo(x)

[1] 1 1 2 3 4 5 6 6

rank and order in R

set.seed(1)
x <- sample(1:50, 30)
x
# [1] 14 19 28 43 10 41 42 29 27 3 9 7 44 15 48 18 25 33 13 34 47 39 49 4 30 46 1 40 20 8
rank(x)
# [1] 9 12 16 25 7 23 24 17 15 2 6 4 26 10 29 11 14 19 8 20 28 21 30 3 18 27 1 22 13 5
order(x)
# [1] 27 10 24 12 30 11 5 19 1 14 16 2 29 17 9 3 8 25 18 20 22 28 6 7 4 13 26 21 15 23

rank returns a vector with the "rank" of each value. the number in the first position is the 9th lowest. order returns the indices that would put the initial vector x in order.

The 27th value of x is the lowest, so 27 is the first element of order(x) - and if you look at rank(x), the 27th element is 1.

x[order(x)]
# [1] 1 3 4 7 8 9 10 13 14 15 18 19 20 25 27 28 29 30 33 34 39 40 41 42 43 44 46 47 48 49

Efficient method to calculate the rank vector of a list in Python

Using scipy, the function you are looking for is scipy.stats.rankdata:

In [13]: import scipy.stats as ss
In [19]: ss.rankdata([3, 1, 4, 15, 92])
Out[19]: array([ 2., 1., 3., 4., 5.])

In [20]: ss.rankdata([1, 2, 3, 3, 3, 4, 5])
Out[20]: array([ 1., 2., 4., 4., 4., 6., 7.])

The ranks start at 1, rather than 0 (as in your example), but then again, that's the way R's rank function works as well.

Here is a pure-python equivalent of scipy's rankdata function:

def rank_simple(vector):
return sorted(range(len(vector)), key=vector.__getitem__)

def rankdata(a):
n = len(a)
ivec=rank_simple(a)
svec=[a[rank] for rank in ivec]
sumranks = 0
dupcount = 0
newarray = [0]*n
for i in xrange(n):
sumranks += i
dupcount += 1
if i==n-1 or svec[i] != svec[i+1]:
averank = sumranks / float(dupcount) + 1
for j in xrange(i-dupcount+1,i+1):
newarray[ivec[j]] = averank
sumranks = 0
dupcount = 0
return newarray

print(rankdata([3, 1, 4, 15, 92]))
# [2.0, 1.0, 3.0, 4.0, 5.0]
print(rankdata([1, 2, 3, 3, 3, 4, 5]))
# [1.0, 2.0, 4.0, 4.0, 4.0, 6.0, 7.0]

create a mean rank for a rank-frequency data.frame by R

sure, just group by frequency

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union

dt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))
dt %>% arrange(desc(frequency))%>%
mutate(rank = row_number()) %>%
group_by(frequency) %>%
mutate(mean_rank = mean(rank)) %>%
ungroup()
#> # A tibble: 13 × 3
#> frequency rank mean_rank
#> <dbl> <int> <dbl>
#> 1 64 1 1
#> 2 58 2 2
#> 3 54 3 3
#> 4 32 4 4
#> 5 29 5 5.5
#> 6 29 6 5.5
#> 7 25 7 7
#> 8 17 8 8.5
#> 9 17 9 8.5
#> 10 15 10 10
#> 11 12 11 11.5
#> 12 12 12 11.5
#> 13 10 13 13

R: Rank-function with two variables and ties.method random

Since order(order(x)) gives the same result as rank(x) (see Why does order(order(x)) equal rank(x) in R?), you could just do

order(order(y, z, runif(length(y))))

to get the rank values.


Here's a more involved approach that allows you to use methods from ties.method. It requires dplyr:

library(dplyr)
rank2 <- function(df, key1, key2, ties.method) {
average <- function(x) mean(x)
random <- function(x) sample(x, length(x))
df$r <- order(order(df[[key1]], df[[key2]]))
group_by_(df, key1, key2) %>% mutate(rr = get(ties.method)(r))
}

rank2(df, "y", "z", "average")
# Source: local data frame [10 x 5]
# Groups: y, z [8]
# x y z r rr
# <dbl> <dbl> <dbl> <int> <dbl>
# 1 1 1 0.2 1 1.0
# 2 2 4 0.8 6 6.0
# 3 3 5 0.5 8 8.0
# 4 4 5 0.4 7 7.0
# 5 5 2 0.2 3 3.0
# 6 6 8 0.1 9 9.5
# 7 7 8 0.1 10 9.5
# 8 8 1 0.7 2 2.0
# 9 9 3 0.3 4 4.5
# 10 10 3 0.3 5 4.5

Create ranking for vector of double

One way to do so would be using a multimap.

  • Place the items in a multimap mapping your objects to size_ts (the intial values are unimportant). You can do this with one line (use the ctor that takes iterators).

  • Loop (either plainly or using whatever from algorithm) and assign 0, 1, ... as the values.

  • Loop over the distinct keys. For each distinct key, call equal_range for the key, and set its values to the average (again, you can use stuff from algorithm for this).

The overall complexity should be Theta(n log(n)), where n is the length of the vector.

replace subset of vector values with subset average

This is my attempt. I first calculate the average rank, then split the subjects of the same rank into rows.

library(tidyverse)
options(stringsAsFactors = FALSE)
subj <- c("A", "B", "C,D,E", "C,D,E", "C,D,E", "F", "G,H", "G,H", "I")
rank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
df <- data.frame(rank, subj)

df %>%
group_by(subj) %>%
summarise(rank = mean(rank)) %>%
rowwise() %>%
do(tibble(subj = unlist(strsplit(.$subj, ",")), rank = .$rank)) %>%
ungroup()

Output:

# A tibble: 9 × 2
subj rank
* <chr> <dbl>
1 A 1.0
2 B 2.0
3 C 4.0
4 D 4.0
5 E 4.0
6 F 6.0
7 G 7.5
8 H 7.5
9 I 9.0

Another approach:

m <- aggregate(rank~subj, data=df, mean)
m <- apply(m, 1, function(x) data.frame(subj = unlist(strsplit(x[1], ",")), rank = x[2]))
m <- do.call(rbind, m)
rownames(m) <- NULL
m

Output:

subj rank
1 A 1.0
2 B 2.0
3 C 4.0
4 D 4.0
5 E 4.0
6 F 6.0
7 G 7.5
8 H 7.5
9 I 9.0


Related Topics



Leave a reply



Submit