How to Select the Rows With Maximum Values in Each Group With Dplyr

How to select the rows with maximum values in each group with dplyr?

Try this:

result <- df %>% 
group_by(A, B) %>%
filter(value == max(value)) %>%
arrange(A,B,C)

Seems to work:

identical(
as.data.frame(result),
ddply(df, .(A, B), function(x) x[which.max(x$value),])
)
#[1] TRUE

As pointed out in the comments, slice may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.

Select the row with the maximum value in each group based on multiple columns in R dplyr

We may get rowwise max of the 'count' columns with pmax, grouped by 'col1', filter the rows where the max value of 'Max' column is.

library(dplyr)
df1 %>%
mutate(Max = pmax(count_col1, count_col2) ) %>%
group_by(col1) %>%
filter(Max == max(Max)) %>%
ungroup %>%
select(-Max)

-output

# A tibble: 3 × 4
col1 col2 count_col1 count_col2
<chr> <chr> <dbl> <dbl>
1 apple aple 1 4
2 banana banan 4 1
3 banana bananb 4 1

We may also use slice_max

library(purrr)
df1 %>%
group_by(col1) %>%
slice_max(invoke(pmax, across(starts_with("count")))) %>%
ungroup
# A tibble: 3 × 4
col1 col2 count_col1 count_col2
<chr> <chr> <dbl> <dbl>
1 apple aple 1 4
2 banana banan 4 1
3 banana bananb 4 1

Select the row with the maximum value in each group

Here's a data.table solution:

require(data.table) ## 1.9.2
group <- as.data.table(group)

If you want to keep all the entries corresponding to max values of pt within each group:

group[group[, .I[pt == max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2

If you'd like just the first max value of pt:

group[group[, .I[which.max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2

In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.

How to find the maximum value within each group and then recode all other values in the group as zero?

You can try this

df %>%
group_by(Id) %>%
mutate(maxByGroup = (which.max(value) == seq_along(value)) * value) %>%
ungroup()

which gives

      Id value maxByGroup
<dbl> <dbl> <dbl>
1 1 500 500
2 1 500 0
3 1 500 0
4 2 250 250
5 2 250 0
6 2 250 0
7 3 300 300
8 3 300 0
9 3 300 0
10 4 400 400
11 4 400 0
12 4 400 0

dplyr: max value in a group, excluding the value in each row?

You could try:

df %>% 
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x)))

Which gives:

#Source: local data frame [6 x 3]
#Groups: g
#
# g x max
#1 A 7 3
#2 A 3 7
#3 B 9 5
#4 B 5 9
#5 B 2 9
#6 C 4 NA

Benchmark

I've tried the solutions so far on the benchmark:

df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
x = sample(1:10, 10e5, replace = TRUE))

library(microbenchmark)

mbm <- microbenchmark(
steven = df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x))),
eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
times = 50
)

@Arun's data.table solution is the fastest:

# Unit: milliseconds
# expr min lq mean median uq max neval cld
# steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b
# eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c
# arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a

Sample Image

Select max value across group - R/Dplyr solution

You can assign max(value) for each ID in the 1st row.

library(dplyr)

df %>%
group_by(ID) %>%
mutate(max_value = ifelse(row_number() == 1, max(value), NA_integer_)) -> result

result

# ID value max_value
# <int> <int> <int>
# 1 1 10 10
# 2 1 2 NA
# 3 1 3 NA
# 4 1 4 NA
# 5 2 1 3
# 6 2 2 NA
# 7 2 3 NA
# 8 3 4 6
# 9 3 5 NA
#10 3 6 NA

Select rows above and below max group value dplyr

I think your intended output is incorrect: ARG's "max" (absolutely value!) value is on 2020-03-23 (and -24), yet you show four rows before it and insufficient rows after it.

Try this:

dat %>%
group_by(Country) %>%
mutate(most = row_number() == which.max(abs(MobDecline))) %>%
filter(zoo::rollapply(most, width = 7, FUN = any, fill = FALSE))
# # A tibble: 14 x 4
# # Groups: Country [2]
# Country Date MobDecline most
# <chr> <date> <dbl> <lgl>
# 1 ARG 2020-03-20 -70.3 FALSE
# 2 ARG 2020-03-21 -71.7 FALSE
# 3 ARG 2020-03-22 -75.3 FALSE
# 4 ARG 2020-03-23 -84 TRUE
# 5 ARG 2020-03-24 -84 FALSE
# 6 ARG 2020-03-25 -75.7 FALSE
# 7 ARG 2020-03-26 -76 FALSE
# 8 AUS 2020-03-30 -43.3 FALSE
# 9 AUS 2020-03-31 -45.3 FALSE
# 10 AUS 2020-04-01 -45.7 FALSE
# 11 AUS 2020-04-02 -47.7 TRUE
# 12 AUS 2020-04-03 -45.7 FALSE
# 13 AUS 2020-04-04 -46 FALSE
# 14 AUS 2020-04-05 -47.3 FALSE

(and most can be removed, keeping it here for demonstration).

The use of zoo::rollapply is a much shorter and flexible version than one based on repeated lead and/or lag (which is otherwise one way to approach this).

Now, this is using abs(which.max(...)), which both assumes max absolute value (you did say max, after all) and will return at most one entry, even when tied. If you need +/- 3 rows to include this (so one more row included here), then we can try to use ==, but it will at times fail (R FAQ 7.31), so I'll introduce a "tolerance":

dat %>%
group_by(Country) %>%
mutate(most = MobDecline <= (min(MobDecline) + tol)) %>%
filter(zoo::rollapply(most, width = 7, FUN = any, fill = FALSE))
# # A tibble: 15 x 4
# # Groups: Country [2]
# Country Date MobDecline most
# <chr> <date> <dbl> <lgl>
# 1 ARG 2020-03-20 -70.3 FALSE
# 2 ARG 2020-03-21 -71.7 FALSE
# 3 ARG 2020-03-22 -75.3 FALSE
# 4 ARG 2020-03-23 -84 TRUE
# 5 ARG 2020-03-24 -84 TRUE
# 6 ARG 2020-03-25 -75.7 FALSE
# 7 ARG 2020-03-26 -76 FALSE
# 8 ARG 2020-03-27 -74.3 FALSE
# 9 AUS 2020-03-30 -43.3 FALSE
# 10 AUS 2020-03-31 -45.3 FALSE
# 11 AUS 2020-04-01 -45.7 FALSE
# 12 AUS 2020-04-02 -47.7 TRUE
# 13 AUS 2020-04-03 -45.7 FALSE
# 14 AUS 2020-04-04 -46 FALSE
# 15 AUS 2020-04-05 -47.3 FALSE

dplyr filter out groups in which the max value (per group) is below the top-3 max-values (per group)

Another possible solution:

library(dplyr)

df %>%
group_by(id) %>%
summarise(m = max(volume)) %>%
slice_max(m, n = 3)

#> # A tibble: 3 × 2
#> id m
#> <dbl> <dbl>
#> 1 2 0.0788
#> 2 6 0.0284
#> 3 3 0.0233

To get the entire group for each of the 3 max-values:

library(tidyverse)

df %>%
group_by(id) %>%
summarise(m = max(volume)) %>%
slice_max(m, n = 3) %>%
group_split(id) %>%
map(~ inner_join(df, .x, by = "id"))

#> [[1]]
#> # A tibble: 10 × 4
#> id year volume m
#> <dbl> <chr> <dbl> <dbl>
#> 1 2 2017 0.0788 0.0788
#> 2 2 2018 0.0788 0.0788
#> 3 2 2019 0.0773 0.0788
#> 4 2 2020 0.0766 0.0788
#> 5 2 2021 0.0755 0.0788
#> 6 2 2022 0.0745 0.0788
#> 7 2 2023 0.0748 0.0788
#> 8 2 2024 0.0741 0.0788
#> 9 2 2025 0.0717 0.0788
#> 10 2 2026 0.0681 0.0788
#>
#> [[2]]
#> # A tibble: 10 × 4
#> id year volume m
#> <dbl> <chr> <dbl> <dbl>
#> 1 3 2017 0.0233 0.0233
#> 2 3 2018 0.0230 0.0233
#> 3 3 2019 0.0224 0.0233
#> 4 3 2020 0.0220 0.0233
#> 5 3 2021 0.0214 0.0233
#> 6 3 2022 0.0209 0.0233
#> 7 3 2023 0.0208 0.0233
#> 8 3 2024 0.0204 0.0233
#> 9 3 2025 0.0193 0.0233
#> 10 3 2026 0.0180 0.0233
#>
#> [[3]]
#> # A tibble: 10 × 4
#> id year volume m
#> <dbl> <chr> <dbl> <dbl>
#> 1 6 2017 0.0284 0.0284
#> 2 6 2018 0.0284 0.0284
#> 3 6 2019 0.0278 0.0284
#> 4 6 2020 0.0275 0.0284
#> 5 6 2021 0.0270 0.0284
#> 6 6 2022 0.0265 0.0284
#> 7 6 2023 0.0266 0.0284
#> 8 6 2024 0.0262 0.0284
#> 9 6 2025 0.0251 0.0284
#> 10 6 2026 0.0234 0.0284

R find min and max for each group based on other row

With tidyverse you can try the following approach. First, put your data into long form targeting your year columns. Then, group_by both group and name (which contains the year) and only include subgroups that have a value of x, and keep rows that have condition of 1. Then group_by just group and summarise to get the min and max years. Note, you may wish to convert your year data to numeric after removing x by filtering on condition.

library(tidyverse)

df1 %>%
pivot_longer(cols = -c(group, condition)) %>%
group_by(group, name) %>%
filter(any(value == "x"), condition == 1) %>%
group_by(group) %>%
summarise(min = min(value),
max = max(value))

Output

# A tibble: 3 x 3
group min max
<chr> <chr> <chr>
1 a 2010 2013
2 b 2011 2015
3 c 2010 2014

data.table: Select row with maximum value by group with several grouping variables

You can compare value with max value in A and B, extract the logical vector and use it to subset data.table.

library(data.table)

setDT(mydf)
mydf[mydf[, value == max(value), .(A, B)]$V1, ]


Related Topics



Leave a reply



Submit