Select the top N values by group
# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.
Choose the top five values from each group in r
Just an alternative to the answers already provided in case you want to use top_n
specifically - you may want to due to the way it handles ties.
hflights %>%
filter(DayOfWeek %in% c(6, 7)) %>%
mutate(Season = case_when(
Month %in% 3:5 ~ "Spring",
Month %in% 9:11 ~ "Autumn",
Month %in% 6:8 ~ "Summer",
Month %in% c(12, 1, 2) ~ "Winter")) %>%
filter(!is.na(Season)) %>%
group_by(Season, Dest) %>%
summarise(flights = n()) %>%
top_n(5, flights) %>%
arrange(Season, desc(flights))
Your main issue was with group_by(Dest,Season)
as I pointed out in the comments. summarise()
removes the last layer of grouping so was leaving your data grouped by Dest and not Season.
Your sorting with arrange()
was redundant and should be done after using top_n
.
As others have pointed out you should also be using %in%
when comparing a value to more than one value rather than ==
.
Getting the top values by group
From dplyr 1.0.0, "slice_min()
and slice_max()
select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n().
"
d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
# # A tibble: 15 x 2
# # Groups: grp [3]
# x grp
# <dbl> <fct>
# 1 0.994 1
# 2 0.957 1
# 3 0.955 1
# 4 0.940 1
# 5 0.900 1
# 6 0.963 2
# 7 0.902 2
# 8 0.895 2
# 9 0.858 2
# 10 0.799 2
# 11 0.985 3
# 12 0.893 3
# 13 0.886 3
# 14 0.815 3
# 15 0.812 3
Pre-dplyr 1.0.0
using top_n
:
From ?top_n
, about the wt
argument:
The variable to use for ordering [...] defaults to the last variable in the tbl".
The last variable in your data set is "grp", which is not the variable you wish to rank, and which is why your top_n
attempt "returns the whole of d". Thus, if you wish to rank by "x" in your data set, you need to specify wt = x
.
d %>%
group_by(grp) %>%
top_n(n = 5, wt = x)
Data:
set.seed(123)
d <- data.frame(
x = runif(90),
grp = gl(3, 30))
Selecting top N rows for each group based on value in column
A solution with base R:
# df is split according to y, then we keep only the top "z" value (after ordering x)
# and rbind everything back together:
do.call(rbind,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3
EDIT:
A much more direct way (still in base R
) provided in comment by @mt1022:
df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3
Selecting top N values within a group in a column using R
Or using data.table
(mydf
from @jazzurro's post). Some options are
library(data.table)
setDT(mydf)[order(yearmonth,-count), .SD[1:2], by=yearmonth]
Or
setDT(mydf)[mydf[order(yearmonth, -count), .I[1:2], by=yearmonth]$V1,]
Or
setorder(setkey(setDT(mydf), yearmonth), yearmonth, -count)[
,.SD[1:2], by=yearmonth]
# yearmonth name count
#1: 201310 Dovas 5
#2: 201310 Indulgd 2
#3: 201311 Dovas 29
#4: 201311 Justina 13
#5: 201312 sUPERED 7
#6: 201312 John Hansen 7
Selecting top N rows for each group in dataframe
Using the dplyr
package in the tidyverse
you can do this:
library(tidyverse)
df <- tribble(
~Index, ~Country
, 4.1, "USA"
, 2.1, "USA"
, 5.2, "USA"
, 1.1, "Singapore"
, 6.2, "Singapore"
, 8.1, "Germany"
, 4.5, "Italy"
, 7.1, "Italy"
, 2.3, "Italy"
, 5.9, "Italy"
, 8.8, "Russia"
)
df %>% # take the dataframe
group_by(Country) %>% # group it by the grouping variable
slice(1:3) # and pick rows 1 to 3 per group
Output:
Index Country
<dbl> <chr>
1 8.1 Germany
2 4.5 Italy
3 7.1 Italy
4 2.3 Italy
5 8.8 Russia
6 1.1 Singapore
7 6.2 Singapore
8 4.1 USA
9 2.1 USA
10 5.2 USA
How to choose the most common value in a group related to other group in R?
Another dplyr
strategy using count
and slice
:
library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG
Select the row with the maximum value in each group
Here's a data.table
solution:
require(data.table) ## 1.9.2
group <- as.data.table(group)
If you want to keep all the entries corresponding to max values of pt
within each group:
group[group[, .I[pt == max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
If you'd like just the first max value of pt
:
group[group[, .I[which.max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.
How to select 'x' most recent values in each group in R?
We can use dplyr
. We convert the 'Date' to Date
class by using as.Date
. After grouping by 'Player', we arrange
the 'Date' column descendingly and use slice
to get the most recent 3 values. If we don't want to change the 'Date' class, we can remove the mutate
step and do the conversion within the arrange
i.e. arrange(desc(as.Date(Date, '%m/%d/%Y')))
library(dplyr)
df1 %>%
mutate(Date=as.Date(Date, '%m/%d/%Y')) %>%
group_by(Player) %>%
arrange(desc(Date)) %>%
slice(1:3)
# Player Date Result
#1 Jasper 2015-04-26 0
#2 Jasper 2015-04-18 5
#3 Jasper 2015-04-12 4
#4 Sam 2015-08-08 3
#5 Sam 2015-04-26 0
#6 Sam 2015-04-18 1
#7 Steve 2015-08-16 4
#8 Steve 2015-04-29 2
#9 Steve 2015-04-26 1
Or after we group by the 'Player', we can use top_n
by specifying the 'n' and the 'wt' variable for ordering.
df1 %>%
mutate(Date=as.Date(Date, '%m/%d/%Y')) %>%
group_by(Player) %>%
top_n(n = 3, Date)
# Player Date Result
#1 Sam 2015-04-18 1
#2 Sam 2015-04-26 0
#3 Sam 2015-08-08 3
#4 Steve 2015-04-26 1
#5 Steve 2015-04-29 2
#6 Steve 2015-08-16 4
#7 Jasper 2015-04-12 4
#8 Jasper 2015-04-18 5
#9 Jasper 2015-04-26 0
Using data.table
, we convert the 'data.frame' to 'data.table' (setDT(df1)
). Grouped by the 'Player, we order
the 'Date' after converting to Date
class, and using the head
we can get the first 3 rows of each group.
library(data.table)
setDT(df1)[order(-as.IDate(Date, '%m/%d/%Y')),head(.SD, 3) , by = Player]
# Player Date Result
#1: Steve 08/16/2015 4
#2: Steve 04/29/2015 2
#3: Steve 04/26/2015 1
#4: Sam 08/08/2015 3
#5: Sam 04/26/2015 0
#6: Sam 04/18/2015 1
#7: Jasper 04/26/2015 0
#8: Jasper 04/18/2015 5
#9: Jasper 04/12/2015 4
data
df1 <- structure(list(Player = c("Sam", "Sam", "Sam", "Sam", "Sam",
"Sam", "Sam", "Steve", "Steve", "Steve", "Steve", "Steve", "Steve",
"Steve", "Steve", "Steve", "Steve", "Steve", "Jasper", "Jasper",
"Jasper", "Jasper", "Jasper", "Jasper"), Date = c("03/15/2015",
"03/22/2015", "04/04/2015", "04/12/2015", "04/18/2015", "04/26/2015",
"08/08/2015", "02/17/2015", "02/21/2015", "03/04/2015", "03/11/2015",
"03/15/2015", "03/22/2015", "04/12/2015", "04/18/2015", "04/26/2015",
"04/29/2015", "08/16/2015", "03/15/2015", "03/22/2015", "04/04/2015",
"04/12/2015", "04/18/2015", "04/26/2015"), Result = c(1, 0, 2,
1, 1, 0, 3, 0, 0, 4, 2, 1, 0, 0, 2, 1, 2, 4, 3, 3.5, 4, 4, 5,
0)), .Names = c("Player", "Date", "Result"),
class = "data.frame", row.names = c(NA, -24L))
dplyr select top 10 values for each category
data <- tbl_df(data) %>%
group_by(dimension) %>%
arrange(revenues, .by_group = TRUE) %>%
top_n(10)
Related Topics
R - Test If a String Vector Contains Any Element of Another List
Filter a Data Frame According to Minimum and Maximum Values
Duplicate Columns in Spark Dataframe
Removing Columns That Are All 0
Select the Row With the Maximum Value in Each Group
How to Replace Na Values With Zeros in an R Dataframe
Transform Year/Week to Date Object
Multirow Axis Labels With Nested Grouping Variables
Fastest Way to Replace Nas in a Large Data.Table
Concatenating Two Text Columns in Dplyr
How to Replace Negative Values in a Dataframe Column With a Different Value
Add Legend to Geom_Line() Graph in R
Coerce Multiple Columns to Factors At Once
Converting Year and Month ("Yyyy-Mm" Format) to a Date
Why Does Summarize or Mutate Not Work With Group_By When I Load 'Plyr' After 'Dplyr'