## Extract row corresponding to minimum value of a variable by group

Slightly more elegant:

`library(data.table)`

DT[ , .SD[which.min(Employees)], by = State]

State Company Employees

1: AK D 24

2: RI E 19

Slighly less elegant than using `.SD`

, but a bit faster (for data with many groups):

`DT[DT[ , .I[which.min(Employees)], by = State]$V1]`

Also, just replace the expression `which.min(Employees)`

with `Employees == min(Employees)`

, if your data set has multiple identical min values and you'd like to subset all of them.

See also Subset rows corresponding to max value by group using data.table.

## Only keep the minimum value of each group

With `.SD`

:

`dataz[,.SD[value==min(value)],by=.(group)]`

group value

<char> <num>

1: ZAS 0.39590814

2: Car 0.42591138

3: EEE 0.07049145

4: EEff 0.34670793

5: 2133 0.05702904

6: EETTE 0.31071582

## Select rows with min value by group

Using DWin's solution, `tapply`

can be avoided using `ave`

.

`df[ df$v1 == ave(df$v1, df$f, FUN=min), ]`

This gives another speed-up, as shown below. Mind you, this is also dependent on the number of levels. I give this as I notice that `ave`

is far too often forgotten about, although it is one of the more powerful functions in R.

`f <- rep(letters[1:20],10000)`

v1 <- rnorm(20*10000)

v2 <- 1:(20*10000)

df <- data.frame(f,v1,v2)

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])

user system elapsed

0.05 0.00 0.05

> system.time(df[ df$v1 %in% tapply(df$v1, df$f, min), ])

user system elapsed

0.25 0.03 0.29

> system.time(lapply(split(df, df$f), FUN = function(x) {

+ vec <- which(x[3] == min(x[3]))

+ return(x[vec, ])

+ })

+ .... [TRUNCATED]

user system elapsed

0.56 0.00 0.58

> system.time(df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]

+ )

user system elapsed

0.17 0.00 0.19

> system.time( ddply(df, .var = "f", .fun = function(x) {

+ return(subset(x, v1 %in% min(v1)))

+ }

+ )

+ )

user system elapsed

0.28 0.00 0.28

## Extract row corresponding to maximum value by group for multiple variables

`max`

and `which.max`

are two different functions doing different things. `max`

would give the max value in a vector whereas `which.max`

would give position of the max value in the vector.

`x <- 4:1`

max(x)

#[1] 4

which.max(x)

#[1] 1

Here `which.max`

returns 1 because 4 is present at the 1st position in the vector `x`

.

So if you need `max`

values in multiple columns, you should use `max`

and not `which.max`

.

`library(data.table)`

setDT(dt)

variables = colnames(dt[, 2:10])

dt[, lapply(.SD, max), .SDcols = variables, ID]

# ID a b c d e f g h i

# 1: 1 1 1 1 1 1 1 1 1 1

# 2: 2 1 1 1 0 0 1 1 0 1

# 3: 3 1 1 1 0 1 1 1 1 1

# 4: 4 1 1 1 0 0 1 1 0 0

# 5: 5 1 1 1 1 1 1 1 0 0

# 6: 6 1 1 1 1 1 1 1 0 1

# 7: 7 1 1 1 1 1 0 1 0 0

# 8: 8 1 1 1 1 0 1 1 1 1

# 9: 9 1 1 1 0 1 1 1 0 0

#10: 10 1 1 1 1 1 1 1 1 1

## Pandas GroupBy and select rows with the minimum value in a specific column

I feel like you're overthinking this. Just use `groupby`

and `idxmin`

:

`df.loc[df.groupby('A').B.idxmin()]`

A B C

2 1 2 10

4 2 4 4

`df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)`

A B C

0 1 2 10

1 2 4 4

## How to extract the row with min or max values?

You can include your `which.max`

call as the first argument to your subsetting call:

`df[which.max(df$Temp),]`

## In case of duplicated value in variable, keep row with lowest value based on other variable

We can use `slice_min`

after grouping by 'ID'

`library(dplyr)`

df %>%

group_by(ID) %>%

slice_min(tti) %>%

ungroup

-output

`# A tibble: 3 x 2`

# ID tti

# <int> <dbl>

#1 9 2.7

#2 12 1.2

#3 118 1.4

Or with `collapse`

`library(collapse)`

df %>%

fgroup_by(ID) %>%

fsummarise(tti = fmin(tti))

# ID tti

#1 9 2.7

#2 12 1.2

#3 118 1.4

Or another option is `roworder`

(which is faster than `arrange`

from `dplyr`

) with `funique`

`roworder(df, ID, tti) %>%`

funique(cols = 1)

# ID tti

#1 9 2.7

#2 12 1.2

#3 118 1.4

