In R, How to Add a Max by Group

In R, how do I add a max by group?

Try

# This is how you create your data.frame
group<-c("A","A","A","A","A","B","B","C","C","C")
replicate<-c(1,2,3,4,5,1,2,1,2,3)
x<-data.frame(group,replicate) # here you don't need c()

# Here's my solution
Max <- tapply(x$replicate, x$group,max)
data.frame(x, max.per.group=rep(Max, table(x$group)))
group replicate max.per.group
1 A 1 5
2 A 2 5
3 A 3 5
4 A 4 5
5 A 5 5
6 B 1 2
7 B 2 2
8 C 1 3
9 C 2 3
10 C 3 3

Select the row with the maximum value in each group

Here's a data.table solution:

require(data.table) ## 1.9.2
group <- as.data.table(group)

If you want to keep all the entries corresponding to max values of pt within each group:

group[group[, .I[pt == max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2

If you'd like just the first max value of pt:

group[group[, .I[which.max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2

In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.

Extract the maximum value within each group in a dataframe

There are many possibilities to do this in R. Here are some of them:

df <- read.table(header = TRUE, text = 'Gene   Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')

# aggregate
aggregate(df$Value, by = list(df$Gene), max)
aggregate(Value ~ Gene, data = df, max)

# tapply
tapply(df$Value, df$Gene, max)

# split + lapply
lapply(split(df, df$Gene), function(y) max(y$Value))

# plyr
require(plyr)
ddply(df, .(Gene), summarise, Value = max(Value))

# dplyr
require(dplyr)
df %>% group_by(Gene) %>% summarise(Value = max(Value))

# data.table
require(data.table)
dt <- data.table(df)
dt[ , max(Value), by = Gene]

# doBy
require(doBy)
summaryBy(Value~Gene, data = df, FUN = max)

# sqldf
require(sqldf)
sqldf("select Gene, max(Value) as Value from df group by Gene", drv = 'SQLite')

# ave
df[as.logical(ave(df$Value, df$Gene, FUN = function(x) x == max(x))),]

Using group by to get the value corresponding to the max value of another column

Up front, "never" (okay, almost never) use df$ within a dplyr pipe. In this case, df$value[which.max(df$age)] is referencing the original data each time, not the grouped data. Inside each group in this dataset, value is length 3 whereas df$value is length 9.

The only times I feel it is appropriate to use df$ (referencing the original value of the current dataset) inside a pipe is when it is required to look at pre-pipeline data, in absence of any grouping, reordering, or new variables created outside of the currently-saved (pre-pipeline) version of df.

dplyr

library(dplyr)
df %>%
group_by(groups) %>%
mutate(new_value = value[which.max(age)]) %>%
ungroup()
# # A tibble: 9 x 4
# groups age value new_value
# <dbl> <dbl> <dbl> <dbl>
# 1 1 12 1 3
# 2 1 23 2 3
# 3 1 34 3 3
# 4 2 13 4 6
# 5 2 24 5 6
# 6 2 35 6 6
# 7 3 13 7 9
# 8 3 25 8 9
# 9 3 36 9 9

data.table

library(data.table)
DT <- as.data.table(df)
DT[, new_value := value[which.max(age)], by = .(groups)]

base R

df$new_value <- ave(seq_len(nrow(df)), df$groups,
FUN = function(i) df$value[i][which.max(df$age[i])])
df
# groups age value new_value
# 1 1 12 1 3
# 2 1 23 2 3
# 3 1 34 3 3
# 4 2 13 4 6
# 5 2 24 5 6
# 6 2 35 6 6
# 7 3 13 7 9
# 8 3 25 8 9
# 9 3 36 9 9

The base R approach seems to be the least-elegant-looking solution. I believe that ave is the best approach, but it has many limitations, first being that it only works on one value-object (value) in the absence of others (we need to know age).

R output BOTH maximum and minimum value by group in dataframe

You can use range to get max and min value and use it in summarise to get different rows for each Name.

library(dplyr)

df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")

# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97

If you have large dataset using data.table might be faster.

library(data.table)
setDT(df)[, .(Value = range(Value)), Name]

How to find the maximum value within each group and then recode all other values in the group as zero?

You can try this

df %>%
group_by(Id) %>%
mutate(maxByGroup = (which.max(value) == seq_along(value)) * value) %>%
ungroup()

which gives

      Id value maxByGroup
<dbl> <dbl> <dbl>
1 1 500 500
2 1 500 0
3 1 500 0
4 2 250 250
5 2 250 0
6 2 250 0
7 3 300 300
8 3 300 0
9 3 300 0
10 4 400 400
11 4 400 0
12 4 400 0

Max by Group with Condition for a data.table

Assuming that for each 'group' we need to get the max of 'value' where the 'cond' is "Y", after grouping by 'group', subset the 'value' with the logical condition (cond == 'Y') and get the max value

dt[, max := max(value[cond == 'Y']), by = group]
dt
# group cond value max
#1: a N 2 1
#2: a Y 1 1
#3: a N 3 1
#4: b Y 4 4
#5: b Y 2 4
#6: b N 5 4

Assign max value of group to all rows in that group

An option would be to group by 'Student_ID', mutate the 'Credits' with max of 'Credits' and filter the rows with 'Appraisal.Type' as "Grade supervisor"

library(dplyr)
df1 %>%
group_by(Student_ID) %>%
dplyr::mutate(Credits = max(Credits)) %>%
ungroup %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 2 x 5
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10

If we also need 'Course_code' to be included in the grouping

df2 %>%
group_by(Student_ID, Course_code) %>%
dplyr::mutate(Credits = max(Credits)) %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 3 x 5
# Groups: Student_ID, Course_code [3]
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
#3 Grade supervisor 0 12 ABC1294 23

NOTE: I case, plyr package is also loaded, there can be some masking of functions esp summarise/mutate which is also found in plyr. To prevent it, either do this on a fresh session without loading plyr or explicitly specify dplyr::mutate

data

df1 <- structure(list(Appraisal.Type = c("Final result", "Final result", 
"Grade supervisor", "Grade supervisor"), Resits = c(0L, 0L, 0L,
0L), Credits = c(6.5, 6.5, 0, 0), Course_code = c("ABC1000",
"ABC1003", "ABC1000", "ABC1003"), Student_ID = c(10L, 10L, 10L,
10L)), class = "data.frame", row.names = c(NA, -4L))



df2 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor", "Final result", "Grade supervisor"
), Resits = c(0L, 0L, 0L, 0L, 0L, 0L), Credits = c(6.5, 6.5,
0, 0, 12, 0), Course_code = c("ABC1000", "ABC1003", "ABC1000",
"ABC1003", "ABC1294", "ABC1294"), Student_ID = c(10L, 10L, 10L,
10L, 23L, 23L)), class = "data.frame", row.names = c(NA, -6L))

Get Max value by category as a new column in R

You can just use normal group_by and mutate

library(tidyverse)

df <- structure(list(A = c("a", "a", "a", "a", "a", "b", "b", "b"),
B = c("b", "c", "d", "e", "f", "c", "d", "e"), N = c(1L,
2L, 3L, 4L, 5L, 5L, 4L, 3L), New.Col = c("f", "f", "f", "f",
"f", "c", "c", "c")), class = "data.frame", row.names = c(NA,
-8L))

df %>%
group_by(A) %>%
mutate(new_col = B[which.max(N)]) %>%
ungroup()

Select the row with the maximum value in each group based on multiple columns in R dplyr

We may get rowwise max of the 'count' columns with pmax, grouped by 'col1', filter the rows where the max value of 'Max' column is.

library(dplyr)
df1 %>%
mutate(Max = pmax(count_col1, count_col2) ) %>%
group_by(col1) %>%
filter(Max == max(Max)) %>%
ungroup %>%
select(-Max)

-output

# A tibble: 3 × 4
col1 col2 count_col1 count_col2
<chr> <chr> <dbl> <dbl>
1 apple aple 1 4
2 banana banan 4 1
3 banana bananb 4 1

We may also use slice_max

library(purrr)
df1 %>%
group_by(col1) %>%
slice_max(invoke(pmax, across(starts_with("count")))) %>%
ungroup
# A tibble: 3 × 4
col1 col2 count_col1 count_col2
<chr> <chr> <dbl> <dbl>
1 apple aple 1 4
2 banana banan 4 1
3 banana bananb 4 1


Related Topics



Leave a reply



Submit