Extract Row Corresponding to Minimum Value of a Variable by Group

Extract row corresponding to minimum value of a variable by group

Slightly more elegant:

library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

State Company Employees
1: AK D 24
2: RI E 19

Slighly less elegant than using .SD, but a bit faster (for data with many groups):

DT[DT[ , .I[which.min(Employees)], by = State]$V1]

Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you'd like to subset all of them.

See also Subset rows corresponding to max value by group using data.table.

Only keep the minimum value of each group

With .SD:

dataz[,.SD[value==min(value)],by=.(group)]
group value
<char> <num>
1: ZAS 0.39590814
2: Car 0.42591138
3: EEE 0.07049145
4: EEff 0.34670793
5: 2133 0.05702904
6: EETTE 0.31071582

Select rows with min value by group

Using DWin's solution, tapply can be avoided using ave.

df[ df$v1 == ave(df$v1, df$f, FUN=min), ]

This gives another speed-up, as shown below. Mind you, this is also dependent on the number of levels. I give this as I notice that ave is far too often forgotten about, although it is one of the more powerful functions in R.

f <- rep(letters[1:20],10000)
v1 <- rnorm(20*10000)
v2 <- 1:(20*10000)
df <- data.frame(f,v1,v2)

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
user system elapsed
0.05 0.00 0.05

> system.time(df[ df$v1 %in% tapply(df$v1, df$f, min), ])
user system elapsed
0.25 0.03 0.29

> system.time(lapply(split(df, df$f), FUN = function(x) {
+ vec <- which(x[3] == min(x[3]))
+ return(x[vec, ])
+ })
+ .... [TRUNCATED]
user system elapsed
0.56 0.00 0.58

> system.time(df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]
+ )
user system elapsed
0.17 0.00 0.19

> system.time( ddply(df, .var = "f", .fun = function(x) {
+ return(subset(x, v1 %in% min(v1)))
+ }
+ )
+ )
user system elapsed
0.28 0.00 0.28

Extract row corresponding to maximum value by group for multiple variables

max and which.max are two different functions doing different things. max would give the max value in a vector whereas which.max would give position of the max value in the vector.

x <- 4:1

max(x)
#[1] 4
which.max(x)
#[1] 1

Here which.max returns 1 because 4 is present at the 1st position in the vector x.

So if you need max values in multiple columns, you should use max and not which.max.

library(data.table)
setDT(dt)
variables = colnames(dt[, 2:10])

dt[, lapply(.SD, max), .SDcols = variables, ID]

# ID a b c d e f g h i
# 1: 1 1 1 1 1 1 1 1 1 1
# 2: 2 1 1 1 0 0 1 1 0 1
# 3: 3 1 1 1 0 1 1 1 1 1
# 4: 4 1 1 1 0 0 1 1 0 0
# 5: 5 1 1 1 1 1 1 1 0 0
# 6: 6 1 1 1 1 1 1 1 0 1
# 7: 7 1 1 1 1 1 0 1 0 0
# 8: 8 1 1 1 1 0 1 1 1 1
# 9: 9 1 1 1 0 1 1 1 0 0
#10: 10 1 1 1 1 1 1 1 1 1

Pandas GroupBy and select rows with the minimum value in a specific column

I feel like you're overthinking this. Just use groupby and idxmin:

df.loc[df.groupby('A').B.idxmin()]

A B C
2 1 2 10
4 2 4 4

df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)

A B C
0 1 2 10
1 2 4 4

How to extract the row with min or max values?

You can include your which.max call as the first argument to your subsetting call:

df[which.max(df$Temp),]

In case of duplicated value in variable, keep row with lowest value based on other variable

We can use slice_min after grouping by 'ID'

library(dplyr)
df %>%
group_by(ID) %>%
slice_min(tti) %>%
ungroup

-output

# A tibble: 3 x 2
# ID tti
# <int> <dbl>
#1 9 2.7
#2 12 1.2
#3 118 1.4

Or with collapse

library(collapse)
df %>%
fgroup_by(ID) %>%
fsummarise(tti = fmin(tti))

# ID tti
#1 9 2.7
#2 12 1.2
#3 118 1.4

Or another option is roworder (which is faster than arrange from dplyr) with funique

roworder(df, ID, tti) %>%
funique(cols = 1)
# ID tti
#1 9 2.7
#2 12 1.2
#3 118 1.4


Related Topics



Leave a reply



Submit