Dplyr - Group by and Select Top X %

dplyr - Group by and select TOP x %

Or another option with dplyr:

mtcars %>% select(gear, wt) %>% 
  group_by(gear) %>% 
  arrange(gear, desc(wt)) %>% 
  filter(wt > quantile(wt, .8))

Source: local data frame [7 x 2]
Groups: gear [3]

   gear    wt
  (dbl) (dbl)
1     3 5.424
2     3 5.345
3     3 5.250
4     4 3.440
5     4 3.440
6     4 3.190
7     5 3.570

How to extract the top x% of rows by group and number in R?

Here is a solution. It selects the top 30% values by groups of name and then counts the rows that were selected in each group.

library(dplyr)

data %>%
  group_by(name) %>%
  arrange(name, value) %>%
  top_frac(0.30) %>%
  count(name)
#Selecting by value
## A tibble: 4 x 2
## Groups:   name [4]
#  name      n
#  <chr> <int>
#1 A       150
#2 B       300
#3 C         6
#4 D        30

It is possible to see that these numbers are in fact 30% of each group of name with

data %>% count(name) %>% mutate(n = n*0.3)
#  name   n
#1    A 150
#2    B 300
#3    C   6
#4    D  30

If you want the top 30% values, without considering the group the top values come from, then the above must be changed to the following code.

data %>%
  arrange(name, value) %>%
  top_frac(0.30) %>%
  count(name)
#Selecting by value
#  name   n
#1    A  46
#2    B 420
#3    C  20

dplyr select top 10 values for each category

data <-  tbl_df(data) %>%
  group_by(dimension) %>%
  arrange(revenues, .by_group = TRUE) %>%
  top_n(10)

How to select top N values and group the rest of the remaining ones

You can run some tidyverse operations directly on your original dataframe:

library(tidyverse)
dummy_dataframe %>%
  count(group) %>%
  mutate(id = if_else(row_number() < 5, 1L, 2L)) %>%
  group_by(id) %>%
  arrange(id, -n) %>%
  mutate(group = if_else(id == 2L, "others", group),
         n = if_else(group == "others", sum(n), n)) %>%
  ungroup() %>%
  distinct() %>%
  select(-id)

which gives:

# A tibble: 5 x 2
  group      n
  <chr>  <int>
1 A          3
2 C          2
3 D          2
4 B          1
5 others     3

Getting the top values by group

From dplyr 1.0.0, "slice_min() and slice_max() select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n()."

d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
# # A tibble: 15 x 2
# # Groups:   grp [3]
#     x grp  
# <dbl> <fct>
#  1 0.994 1    
#  2 0.957 1    
#  3 0.955 1    
#  4 0.940 1    
#  5 0.900 1    
#  6 0.963 2    
#  7 0.902 2    
#  8 0.895 2    
#  9 0.858 2    
# 10 0.799 2    
# 11 0.985 3    
# 12 0.893 3    
# 13 0.886 3    
# 14 0.815 3    
# 15 0.812 3

Pre-dplyr 1.0.0 using top_n:

From ?top_n, about the wt argument:

The variable to use for ordering [...] defaults to the last variable in the tbl".

The last variable in your data set is "grp", which is not the variable you wish to rank, and which is why your top_n attempt "returns the whole of d". Thus, if you wish to rank by "x" in your data set, you need to specify wt = x.

d %>%
  group_by(grp) %>%
  top_n(n = 5, wt = x)

Data:

set.seed(123)
d <- data.frame(
  x = runif(90),
  grp = gl(3, 30))

Top n rows of each group using dplyr -- with different number per group

This isn't really something that dplyr names easy. I'd recommend merging in the data and then filtering.


tibble(feed=names(top_n_feed), topn=top_n_feed) %>% 
  inner_join(chickwts) %>% 
  group_by(feed) %>% 
  arrange(desc(weight), .by_group=TRUE) %>% 
  filter(row_number() <= topn) %>%
  select(-topn)

Select the top N values by group

# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on 
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties.  using `max` would *exclude* all ties
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

Select a number of top groups from data frame

dplyr has a group_indices function that can be used to assign a consecutive group number. Then filter by that new number. In the example below, I will filter/keep the 2 first groups.

library(dplyr)

Top <- 2

sortedDf <- exampleDf %>%
  group_by(superchar) %>%
  arrange(desc(groupWeight)) %>%
  mutate(new_id = group_indices()) %>%
  filter(new_id <= Top) %>%
  select(-new_id)

sortedDf
## A tibble: 4 x 4
## Groups:   superchar [2]
#  subchar superchar cweight groupWeight
#  <fct>   <fct>       <dbl>       <dbl>
#1 18      age           0.8          70
#2 20      age           0.6          70
#3 male    gender        0.7          20
#4 female  gender        0.3          20

Selecting top n groups with dplyr then plotting other variables

You don't need inner_join() I would just determine top two exams in a separate statement and then filter on those.

top_exams <- count(ap, examName) %>% 
  top_n(2, n) %>% pull(examName)

ap %>% 
  filter(examName %in% top_exams) %>% 
  count(year, examName) %>% 
  ggplot(aes(x = year, y = n, group = examName)) +
  geom_line() +
  facet_wrap(~ examName)

Dplyr - Group by and Select Top X %