dplyr - Group by and select TOP x %
Or another option with dplyr:
mtcars %>% select(gear, wt) %>%
group_by(gear) %>%
arrange(gear, desc(wt)) %>%
filter(wt > quantile(wt, .8))
Source: local data frame [7 x 2]
Groups: gear [3]
gear wt
(dbl) (dbl)
1 3 5.424
2 3 5.345
3 3 5.250
4 4 3.440
5 4 3.440
6 4 3.190
7 5 3.570
How to extract the top x% of rows by group and number in R?
Here is a solution. It selects the top 30% values by groups of name
and then counts the rows that were selected in each group.
library(dplyr)
data %>%
group_by(name) %>%
arrange(name, value) %>%
top_frac(0.30) %>%
count(name)
#Selecting by value
## A tibble: 4 x 2
## Groups: name [4]
# name n
# <chr> <int>
#1 A 150
#2 B 300
#3 C 6
#4 D 30
It is possible to see that these numbers are in fact 30% of each group of name
with
data %>% count(name) %>% mutate(n = n*0.3)
# name n
#1 A 150
#2 B 300
#3 C 6
#4 D 30
If you want the top 30% values, without considering the group the top values come from, then the above must be changed to the following code.
data %>%
arrange(name, value) %>%
top_frac(0.30) %>%
count(name)
#Selecting by value
# name n
#1 A 46
#2 B 420
#3 C 20
dplyr select top 10 values for each category
data <- tbl_df(data) %>%
group_by(dimension) %>%
arrange(revenues, .by_group = TRUE) %>%
top_n(10)
How to select top N values and group the rest of the remaining ones
You can run some tidyverse operations directly on your original dataframe:
library(tidyverse)
dummy_dataframe %>%
count(group) %>%
mutate(id = if_else(row_number() < 5, 1L, 2L)) %>%
group_by(id) %>%
arrange(id, -n) %>%
mutate(group = if_else(id == 2L, "others", group),
n = if_else(group == "others", sum(n), n)) %>%
ungroup() %>%
distinct() %>%
select(-id)
which gives:
# A tibble: 5 x 2
group n
<chr> <int>
1 A 3
2 C 2
3 D 2
4 B 1
5 others 3
Getting the top values by group
From dplyr 1.0.0, "slice_min()
and slice_max()
select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n().
"
d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
# # A tibble: 15 x 2
# # Groups: grp [3]
# x grp
# <dbl> <fct>
# 1 0.994 1
# 2 0.957 1
# 3 0.955 1
# 4 0.940 1
# 5 0.900 1
# 6 0.963 2
# 7 0.902 2
# 8 0.895 2
# 9 0.858 2
# 10 0.799 2
# 11 0.985 3
# 12 0.893 3
# 13 0.886 3
# 14 0.815 3
# 15 0.812 3
Pre-dplyr 1.0.0
using top_n
:
From ?top_n
, about the wt
argument:
The variable to use for ordering [...] defaults to the last variable in the tbl".
The last variable in your data set is "grp", which is not the variable you wish to rank, and which is why your top_n
attempt "returns the whole of d". Thus, if you wish to rank by "x" in your data set, you need to specify wt = x
.
d %>%
group_by(grp) %>%
top_n(n = 5, wt = x)
Data:
set.seed(123)
d <- data.frame(
x = runif(90),
grp = gl(3, 30))
Top n rows of each group using dplyr -- with different number per group
This isn't really something that dplyr
names easy. I'd recommend merging in the data and then filtering.
tibble(feed=names(top_n_feed), topn=top_n_feed) %>%
inner_join(chickwts) %>%
group_by(feed) %>%
arrange(desc(weight), .by_group=TRUE) %>%
filter(row_number() <= topn) %>%
select(-topn)
Select the top N values by group
# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.
Select a number of top groups from data frame
dplyr
has a group_indices
function that can be used to assign a consecutive group number. Then filter
by that new number. In the example below, I will filter/keep the 2 first groups.
library(dplyr)
Top <- 2
sortedDf <- exampleDf %>%
group_by(superchar) %>%
arrange(desc(groupWeight)) %>%
mutate(new_id = group_indices()) %>%
filter(new_id <= Top) %>%
select(-new_id)
sortedDf
## A tibble: 4 x 4
## Groups: superchar [2]
# subchar superchar cweight groupWeight
# <fct> <fct> <dbl> <dbl>
#1 18 age 0.8 70
#2 20 age 0.6 70
#3 male gender 0.7 20
#4 female gender 0.3 20
Selecting top n groups with dplyr then plotting other variables
You don't need inner_join()
I would just determine top two exams in a separate statement and then filter on those.
top_exams <- count(ap, examName) %>%
top_n(2, n) %>% pull(examName)
ap %>%
filter(examName %in% top_exams) %>%
count(year, examName) %>%
ggplot(aes(x = year, y = n, group = examName)) +
geom_line() +
facet_wrap(~ examName)
Related Topics
Plotting Pca Biplot with Ggplot2
How to Write to JSON with Children from R
Delete a Column in a Data Frame Within a List
Exporting Non-S3-Methods with Dots in the Name Using Roxygen2 V4
Find the Most Frequent Value by Row
Combining 'Expression()' with '\N'
Combined Plot of Ggplot2 (Not in a Single Plot), Using Par() or Layout() Function
How Does One Stop Using Rowwise in Dplyr
Take Sum of a Variable If Combination of Values in Two Other Columns Are Unique
How to Test If List Element Exists
How to Use Map from Purrr with Dplyr::Mutate to Create Multiple New Columns Based on Column Pairs
How to Install a Package from a Download Zip File
R Group by Date, and Summarize the Values
How to Check If CSV File Has a Comma or a Semicolon as Separator