Combine result from top_n with an Other category in dplyr
Instead of top_n
, this seems like a good case for the convenience function tally
. It uses summarise
, sum
and arrange
under the hood.
Then use factor
to create an "Other" category. Use the levels
argument to set "Other" as the last level. "Other" will then will be placed last in the table (and in any subsequent plot of the result).
If "Country" is factor
in your original data, you may wrap Country[1:3]
in as.character
.
group_by(df, Country) %>%
tally(Count, sort = TRUE) %>%
group_by(Country = factor(c(Country[1:3], rep("Other", n() - 3)),
levels = c(Country[1:3], "Other"))) %>%
tally(n)
# Country n
# (fctr) (int)
#1 AUS 6
#2 JPN 5
#3 USA 5
#4 Other 7
Create an Other category by summing in dplyr
You could do something like this, if I am understanding right.
Sort the Value
column, keep the ID for the top 10, and then name the rest "Other". Then just sum over that new ID.
This shows the top 10 plus an "other" field. From your expected output, you might want n = 9
instead.
library(dplyr)
n <- 10
df %>%
arrange(desc(Value)) %>%
mutate(ID2 = if_else(row_number() <= n, ID, "Other"),
SortOrder = if_else(row_number() <= n, row_number(), as.integer(n + 1))) %>%
group_by(ID2, SortOrder) %>%
summarize(Value = sum(Value), .groups = "drop") %>%
arrange(SortOrder) %>%
select(ID = ID2, Value)
# # A tibble: 11 x 2
# ID Value
# <chr> <dbl>
# 1 63366849 42084408
# 2 63475529 21879648
# 3 6774736 21321888
# 4 91274582 17393328
# 5 63445915 15215002
# 6 1097911044 9926442
# 7 1095812405 8971332
# 8 1097910557 5388376
# 9 37548920 5114854
# 10 1005153076 5093730
# 11 Other 148494118
Display top values and the sum of all other values in R
Assuming your data is already sorted, one way would be as follows. It is quite verbose, so I am curious to see other solution approaches.
df = read.table(text='NAME TOTAL
A 100
E 85
D 75
C 55
B 30
F 25
G 20
I 15
H 10',header=T,stringsAsFactors=F)
library(dplyr)
df %>%
mutate(NAME=ifelse(NAME %in% head(NAME,5),NAME,'other')) %>%
mutate(NAME=factor(NAME,levels=unique(NAME))) %>%
group_by(NAME) %>%
summarize(TOTAL=sum(TOTAL))
Output:
# A tibble: 6 x 2
NAME TOTAL
<fctr> <int>
1 A 100
2 E 85
3 D 75
4 C 55
5 B 30
6 other 70
Hope this helps!
Select the top N values by group
# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.
dplyr - Group by and select TOP x %
Or another option with dplyr:
mtcars %>% select(gear, wt) %>%
group_by(gear) %>%
arrange(gear, desc(wt)) %>%
filter(wt > quantile(wt, .8))
Source: local data frame [7 x 2]
Groups: gear [3]
gear wt
(dbl) (dbl)
1 3 5.424
2 3 5.345
3 3 5.250
4 4 3.440
5 4 3.440
6 4 3.190
7 5 3.570
Return original data set after top_n
Similar idea using pull
but with slightly different approach:
library(tidyverse)
df %>%
summarise_all(sum) %>% # Your method of selecting
gather(key, val) %>% # top three columns
top_n(3) %>% #
arrange(-val) %>% #
pull(key) %>% # pull 'key'
select(df, .) # select cols from df by `.`
# eggs bread juice
#1 4 4 3
#2 4 5 4
#3 7 2 6
#4 3 1 5
#5 5 10 2
And, developing idea from the previous question:
df[, '['(names(sort(colSums(df), T)), 1:3)]
Which gives the same result.
Passing `top_n` and `arrange` to ggplot (dplyr)
I believe what is tripping you up here is that top_n()
defaults to the last variable in the table, unless you tell it what variable to use for ordering. In the examples in our book, the last variable in the dataframe is tf_idf
so that is what is used for ordering. In the mtcars example, top_n()
is using the last column in the dataframe for ordering; that happens to be carb
.
You can always tell top_n()
what variable you want to use for ordering by passing it as an argument. For example, check out this similar workflow using the diamonds dataset.
library(tidyverse)
diamonds %>%
arrange(desc(price)) %>%
group_by(clarity) %>%
top_n(10, price) %>%
ungroup %>%
ggplot(aes(cut, price, fill = clarity)) +
geom_col(show.legend = FALSE, ) +
facet_wrap(~clarity, scales = "free") +
scale_x_discrete(drop=FALSE) +
coord_flip()
Created on 2018-05-17 by the reprex package (v0.2.0).
These example datasets are not perfect parallels because they don't have one row per combination of characteristics in the way that the tidy text data frames do. I am pretty sure the issue with top_n()
is the problem, though.
Creating an other field
Different package/different syntax version:
library(data.table)
dt = as.data.table(DF)
dt[order(-n), # your data is already sorted, so this does nothing for it
if (.BY[[1]]) .SD else list("Other", sum(n)),
by = 1:nrow(dt) <= 5][, !"nrow", with = F]
# Category n
#1: E 163051
#2: K 127133
#3: M 106680
#4: L 64868
#5: I 49701
#6: Other 217022
How to use top_n for conditional extraction
I'd probably just use head
for this, and filter directly on the data frame
df1 <- head(df[df$category == "Yes",], 500)
df2 <- head(df[df$category == "No",], 500)
# to combine
out <- rbind(df1, df2)
I'm guessing top_n
does something similar. I expect there is a nicer way with dplyr
but this should work :)
Related Topics
Combine Rows and Sum Their Values
Why Does ".." Work to Pass Column Names in a Character Vector Variable
Arrow() in Ggplot2 No Longer Supported
Is the Plyr Package for R Not Available for R Version 3.0.2
Assigning and Removing Objects in a Loop: Eval(Parse(Paste(
Fuzzyjoin Two Data Frames Using Data.Table
Enriching a Ggplot2 Plot with Multiple Geom_Segment in a Loop
Assign Names to Vector Entries Without Assigning the Vector a Variable Name
Pull Nth Day of Month in Xts in R
Why Doesn't "+" Operate on Characters in R
Na Matches Na, But Is Not Equal to Na. Why
Remove Zombie Processes Using Parallel Package
Object Not Found Error with Ggplot2
Subtract Values in One Dataframe from Another
Finding the Index of First Changes in the Elements of a Vector