Fastest Way to Find Second (Third...) Highest/Lowest Value in Vector or Column

Fastest way to find second (third...) highest/lowest value in vector or column

Rfast has a function called nth_element that does exactly what you ask.

Further the methods discussed above that are based on partial sort, don't support finding the k smallest values

Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755

Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.

Rfast::nth(x, 5, descending = T)

Will return the 5th largest element of x, while

Rfast::nth(x, 5, descending = F)

Will return the 5th smallest element of x

Benchmarks below against most popular answers.

For 10 thousand numbers:

N = 10000
x = rnorm(N)

maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: microseconds
expr min lq mean median uq max neval
Rfast 160.364 179.607 202.8024 194.575 210.1830 351.517 100
maxN 396.419 423.360 559.2707 446.452 487.0775 4949.452 100
order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148 100

For 1 million numbers:

N = 1e6
x = rnorm(N)

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: milliseconds
expr min lq mean median uq max neval
Rfast 89.7722 93.63674 114.9893 104.6325 120.5767 204.8839 100
maxN 150.2822 207.03922 235.3037 241.7604 259.7476 336.7051 100
order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129 100

Fastest way to find *the index* of the second (third...) highest/lowest value in vector or column

library Rfast has implemented the nth element function with return index option.

UPDATE (28/FEB/21) package kit offers a faster implementation (topn) as shown in the simulations below.

x <- runif(1e+6)

n <- 2

which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}

which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}

microbenchmark::microbenchmark(
topn = kit::topn(x, n,decreasing = T)[n],
Rfast = Rfast::nth(x,n,descending = T,index.return = T),
order = order(x, decreasing = TRUE)[n],
richie = which_nth_highest_richie(x,n),
joris = which_nth_highest_joris(x,n))

Unit: milliseconds
expr min lq mean median uq max neval
topn 3.741101 3.7917 4.517201 4.060752 5.108901 7.403901 100
Rfast 15.8121 16.7586 20.64204 17.73010 20.7083 47.6832 100
order 110.5416 113.4774 120.45807 116.84005 121.2291 164.5618 100
richie 22.7846 24.1552 39.35303 27.10075 42.0132 179.289 100
joris 131.7838 140.4611 158.20704 156.61610 165.1735 243.9258 100

Topn is the clear winner in finding the index of the 2nd biggest value in 1 million numbers.

Futher, simulations where run to estimate running times of finding the nth biggest number for varying n.
Variable x was repopulated for each n but it's size was always 1 million numbers.

Running time of finding the index of the nth biggest element between 1 million numbers.

As shown topn is the best option for finding the nth biggest element and it's index, given that n is not too big. In the plot we can observe that topn becomes slower than Rfast's nth for bigger n.
It is worthy to note that topn has not been implemented for n > 1000 and will throw an error in such cases.

R subtract min from max, second lowest from second highest etc

Following @r2evans' sage advice and putting it into the form of a function:

maxtomin <- function(vec, N = NULL) {
if(is.null(N)) {N <- floor(length(vec)/2)}
vec <- sort(vec)
vecfin <- rev(vec) - vec
return(head(vecfin, N))
}
maxtomin(mtcars$mpg)

How to get the second smallest/largest element in a list

You can use sort, and then an index, to find the n-th smallest element:

sort(test)[n]

For the second smallest element, use n=2:

sort(test)[2]

apply which.max to second, third, etc. highest value

Try using order

> order(x, decreasing =TRUE)
[1] 6 5 2 1 3 4

Fastest way to find *the index* of the second (third...) highest/lowest value in vector or column

library Rfast has implemented the nth element function with return index option.

UPDATE (28/FEB/21) package kit offers a faster implementation (topn) as shown in the simulations below.

x <- runif(1e+6)

n <- 2

which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}

which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}

microbenchmark::microbenchmark(
topn = kit::topn(x, n,decreasing = T)[n],
Rfast = Rfast::nth(x,n,descending = T,index.return = T),
order = order(x, decreasing = TRUE)[n],
richie = which_nth_highest_richie(x,n),
joris = which_nth_highest_joris(x,n))

Unit: milliseconds
expr min lq mean median uq max neval
topn 3.741101 3.7917 4.517201 4.060752 5.108901 7.403901 100
Rfast 15.8121 16.7586 20.64204 17.73010 20.7083 47.6832 100
order 110.5416 113.4774 120.45807 116.84005 121.2291 164.5618 100
richie 22.7846 24.1552 39.35303 27.10075 42.0132 179.289 100
joris 131.7838 140.4611 158.20704 156.61610 165.1735 243.9258 100

Topn is the clear winner in finding the index of the 2nd biggest value in 1 million numbers.

Futher, simulations where run to estimate running times of finding the nth biggest number for varying n.
Variable x was repopulated for each n but it's size was always 1 million numbers.

Running time of finding the index of the nth biggest element between 1 million numbers.

As shown topn is the best option for finding the nth biggest element and it's index, given that n is not too big. In the plot we can observe that topn becomes slower than Rfast's nth for bigger n.
It is worthy to note that topn has not been implemented for n > 1000 and will throw an error in such cases.

Find 10% highest and lowest values, trim all columns in R

Not tested, but

library(data.table)
#set to data.table object
dt <- as.data.table(data)

#select only rows between 0.1 and 0.9 quantiles
dt <- dt[res >= quantile(res, 0.1) & res <= quantile(res, 0.9)]

How to get the highest value in a column depending on three other columns?

You could do:

library(dplyr)

df %>%
group_by(hhid) %>%
mutate(highest_education = max(education[relation %in% c(1, 2)])) %>%
ungroup()
#> # A tibble: 10 × 5
#> hhid id relation education highest_education
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 3 3
#> 2 1 2 2 2 3
#> 3 1 3 3 5 3
#> 4 2 4 1 4 4
#> 5 2 5 2 2 4
#> 6 3 6 1 1 2
#> 7 3 7 2 2 2
#> 8 4 8 1 1 3
#> 9 4 9 2 3 3
#> 10 4 10 3 4 3

second (or third) maximum value of a dataframe column using 'summarise'

We can use arrange/slice/spread way to get this

library(dplyr)
library(tidyr)
df1 %>%
group_by(group1) %>%
arrange(desc(value)) %>%
slice(seq_len(3)) %>%
mutate(Max = paste0("max_", row_number())) %>%
select(-group2) %>%
spread(Max, value)
# A tibble: 2 x 4
# Groups: group1 [2]
# group1 max_1 max_2 max_3
#* <fctr> <dbl> <dbl> <dbl>
#1 a 0.84 0.69 0.41
#2 b 0.89 0.72 0.54

data

df1 <- data.frame(group1,group2,value)


Related Topics



Leave a reply



Submit