Choose the Top Five Values from Each Group in R

Select the top N values by group

# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

Choose the top five values from each group in r

Just an alternative to the answers already provided in case you want to use top_n specifically - you may want to due to the way it handles ties.

 hflights %>%
filter(DayOfWeek %in% c(6, 7)) %>%
mutate(Season = case_when(
Month %in% 3:5 ~ "Spring",
Month %in% 9:11 ~ "Autumn",
Month %in% 6:8 ~ "Summer",
Month %in% c(12, 1, 2) ~ "Winter")) %>%
filter(!is.na(Season)) %>%
group_by(Season, Dest) %>%
summarise(flights = n()) %>%
top_n(5, flights) %>%
arrange(Season, desc(flights))

Your main issue was with group_by(Dest,Season) as I pointed out in the comments. summarise() removes the last layer of grouping so was leaving your data grouped by Dest and not Season.

Your sorting with arrange() was redundant and should be done after using top_n.

As others have pointed out you should also be using %in% when comparing a value to more than one value rather than ==.

Getting the top values by group

From dplyr 1.0.0, "slice_min() and slice_max() select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n()."

d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
# # A tibble: 15 x 2
# # Groups: grp [3]
# x grp
# <dbl> <fct>
# 1 0.994 1
# 2 0.957 1
# 3 0.955 1
# 4 0.940 1
# 5 0.900 1
# 6 0.963 2
# 7 0.902 2
# 8 0.895 2
# 9 0.858 2
# 10 0.799 2
# 11 0.985 3
# 12 0.893 3
# 13 0.886 3
# 14 0.815 3
# 15 0.812 3

Pre-dplyr 1.0.0 using top_n:

From ?top_n, about the wt argument:

The variable to use for ordering [...] defaults to the last variable in the tbl".

The last variable in your data set is "grp", which is not the variable you wish to rank, and which is why your top_n attempt "returns the whole of d". Thus, if you wish to rank by "x" in your data set, you need to specify wt = x.

d %>%
group_by(grp) %>%
top_n(n = 5, wt = x)


Data:

set.seed(123)
d <- data.frame(
x = runif(90),
grp = gl(3, 30))

Selecting top N rows for each group based on value in column

A solution with base R:

# df is split according to y, then we keep only the top "z" value (after ordering x) 
# and rbind everything back together:
do.call(rbind,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3

EDIT:

A much more direct way (still in base R) provided in comment by @mt1022:

df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3

Selecting top N values within a group in a column using R

Or using data.table (mydf from @jazzurro's post). Some options are

  library(data.table)
setDT(mydf)[order(yearmonth,-count), .SD[1:2], by=yearmonth]

Or

   setDT(mydf)[mydf[order(yearmonth, -count), .I[1:2], by=yearmonth]$V1,]

Or

   setorder(setkey(setDT(mydf), yearmonth), yearmonth, -count)[
,.SD[1:2], by=yearmonth]
# yearmonth name count
#1: 201310 Dovas 5
#2: 201310 Indulgd 2
#3: 201311 Dovas 29
#4: 201311 Justina 13
#5: 201312 sUPERED 7
#6: 201312 John Hansen 7

Selecting top N rows for each group in dataframe

Using the dplyrpackage in the tidyverse you can do this:

library(tidyverse)

df <- tribble(
~Index, ~Country
, 4.1, "USA"
, 2.1, "USA"
, 5.2, "USA"
, 1.1, "Singapore"
, 6.2, "Singapore"
, 8.1, "Germany"
, 4.5, "Italy"
, 7.1, "Italy"
, 2.3, "Italy"
, 5.9, "Italy"
, 8.8, "Russia"
)

df %>% # take the dataframe
group_by(Country) %>% # group it by the grouping variable
slice(1:3) # and pick rows 1 to 3 per group

Output:

   Index Country  
<dbl> <chr>
1 8.1 Germany
2 4.5 Italy
3 7.1 Italy
4 2.3 Italy
5 8.8 Russia
6 1.1 Singapore
7 6.2 Singapore
8 4.1 USA
9 2.1 USA
10 5.2 USA

How to choose the most common value in a group related to other group in R?

Another dplyr strategy using count and slice:

library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
     ID VAR   CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG

Select the row with the maximum value in each group

Here's a data.table solution:

require(data.table) ## 1.9.2
group <- as.data.table(group)

If you want to keep all the entries corresponding to max values of pt within each group:

group[group[, .I[pt == max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2

If you'd like just the first max value of pt:

group[group[, .I[which.max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2

In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.

How to select 'x' most recent values in each group in R?

We can use dplyr. We convert the 'Date' to Date class by using as.Date. After grouping by 'Player', we arrange the 'Date' column descendingly and use slice to get the most recent 3 values. If we don't want to change the 'Date' class, we can remove the mutate step and do the conversion within the arrange i.e. arrange(desc(as.Date(Date, '%m/%d/%Y')))

library(dplyr)
df1 %>%
mutate(Date=as.Date(Date, '%m/%d/%Y')) %>%
group_by(Player) %>%
arrange(desc(Date)) %>%
slice(1:3)
# Player Date Result
#1 Jasper 2015-04-26 0
#2 Jasper 2015-04-18 5
#3 Jasper 2015-04-12 4
#4 Sam 2015-08-08 3
#5 Sam 2015-04-26 0
#6 Sam 2015-04-18 1
#7 Steve 2015-08-16 4
#8 Steve 2015-04-29 2
#9 Steve 2015-04-26 1

Or after we group by the 'Player', we can use top_n by specifying the 'n' and the 'wt' variable for ordering.

 df1 %>% 
mutate(Date=as.Date(Date, '%m/%d/%Y')) %>%
group_by(Player) %>%
top_n(n = 3, Date)
# Player Date Result
#1 Sam 2015-04-18 1
#2 Sam 2015-04-26 0
#3 Sam 2015-08-08 3
#4 Steve 2015-04-26 1
#5 Steve 2015-04-29 2
#6 Steve 2015-08-16 4
#7 Jasper 2015-04-12 4
#8 Jasper 2015-04-18 5
#9 Jasper 2015-04-26 0

Using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by the 'Player, we order the 'Date' after converting to Date class, and using the head we can get the first 3 rows of each group.

library(data.table)
setDT(df1)[order(-as.IDate(Date, '%m/%d/%Y')),head(.SD, 3) , by = Player]
# Player Date Result
#1: Steve 08/16/2015 4
#2: Steve 04/29/2015 2
#3: Steve 04/26/2015 1
#4: Sam 08/08/2015 3
#5: Sam 04/26/2015 0
#6: Sam 04/18/2015 1
#7: Jasper 04/26/2015 0
#8: Jasper 04/18/2015 5
#9: Jasper 04/12/2015 4

data

df1 <- structure(list(Player = c("Sam", "Sam", "Sam", "Sam", "Sam", 
"Sam", "Sam", "Steve", "Steve", "Steve", "Steve", "Steve", "Steve",
"Steve", "Steve", "Steve", "Steve", "Steve", "Jasper", "Jasper",
"Jasper", "Jasper", "Jasper", "Jasper"), Date = c("03/15/2015",
"03/22/2015", "04/04/2015", "04/12/2015", "04/18/2015", "04/26/2015",
"08/08/2015", "02/17/2015", "02/21/2015", "03/04/2015", "03/11/2015",
"03/15/2015", "03/22/2015", "04/12/2015", "04/18/2015", "04/26/2015",
"04/29/2015", "08/16/2015", "03/15/2015", "03/22/2015", "04/04/2015",
"04/12/2015", "04/18/2015", "04/26/2015"), Result = c(1, 0, 2,
1, 1, 0, 3, 0, 0, 4, 2, 1, 0, 0, 2, 1, 2, 4, 3, 3.5, 4, 4, 5,
0)), .Names = c("Player", "Date", "Result"),
class = "data.frame", row.names = c(NA, -24L))

dplyr select top 10 values for each category

data <-  tbl_df(data) %>%
group_by(dimension) %>%
arrange(revenues, .by_group = TRUE) %>%
top_n(10)


Related Topics



Leave a reply



Submit