Combine Result from Top_N with an "Other" Category in Dplyr

Combine result from top_n with an Other category in dplyr

Instead of top_n, this seems like a good case for the convenience function tally. It uses summarise, sum and arrange under the hood.

Then use factor to create an "Other" category. Use the levels argument to set "Other" as the last level. "Other" will then will be placed last in the table (and in any subsequent plot of the result).

If "Country" is factor in your original data, you may wrap Country[1:3] in as.character.

group_by(df, Country) %>%
tally(Count, sort = TRUE) %>%
group_by(Country = factor(c(Country[1:3], rep("Other", n() - 3)),
levels = c(Country[1:3], "Other"))) %>%
tally(n)

# Country n
# (fctr) (int)
#1 AUS 6
#2 JPN 5
#3 USA 5
#4 Other 7

Create an Other category by summing in dplyr

You could do something like this, if I am understanding right.

Sort the Value column, keep the ID for the top 10, and then name the rest "Other". Then just sum over that new ID.

This shows the top 10 plus an "other" field. From your expected output, you might want n = 9 instead.

library(dplyr)

n <- 10

df %>%
arrange(desc(Value)) %>%
mutate(ID2 = if_else(row_number() <= n, ID, "Other"),
SortOrder = if_else(row_number() <= n, row_number(), as.integer(n + 1))) %>%
group_by(ID2, SortOrder) %>%
summarize(Value = sum(Value), .groups = "drop") %>%
arrange(SortOrder) %>%
select(ID = ID2, Value)

# # A tibble: 11 x 2
# ID Value
# <chr> <dbl>
# 1 63366849 42084408
# 2 63475529 21879648
# 3 6774736 21321888
# 4 91274582 17393328
# 5 63445915 15215002
# 6 1097911044 9926442
# 7 1095812405 8971332
# 8 1097910557 5388376
# 9 37548920 5114854
# 10 1005153076 5093730
# 11 Other 148494118

Display top values and the sum of all other values in R

Assuming your data is already sorted, one way would be as follows. It is quite verbose, so I am curious to see other solution approaches.

df = read.table(text='NAME  TOTAL
A 100
E 85
D 75
C 55
B 30
F 25
G 20
I 15
H 10',header=T,stringsAsFactors=F)

library(dplyr)
df %>%
mutate(NAME=ifelse(NAME %in% head(NAME,5),NAME,'other')) %>%
mutate(NAME=factor(NAME,levels=unique(NAME))) %>%
group_by(NAME) %>%
summarize(TOTAL=sum(TOTAL))

Output:

# A tibble: 6 x 2
NAME TOTAL
<fctr> <int>
1 A 100
2 E 85
3 D 75
4 C 55
5 B 30
6 other 70

Hope this helps!

Select the top N values by group

# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

dplyr - Group by and select TOP x %

Or another option with dplyr:

mtcars %>% select(gear, wt) %>% 
group_by(gear) %>%
arrange(gear, desc(wt)) %>%
filter(wt > quantile(wt, .8))

Source: local data frame [7 x 2]
Groups: gear [3]

gear wt
(dbl) (dbl)
1 3 5.424
2 3 5.345
3 3 5.250
4 4 3.440
5 4 3.440
6 4 3.190
7 5 3.570

Return original data set after top_n

Similar idea using pull but with slightly different approach:

library(tidyverse)

df %>%
summarise_all(sum) %>% # Your method of selecting
gather(key, val) %>% # top three columns
top_n(3) %>% #
arrange(-val) %>% #
pull(key) %>% # pull 'key'
select(df, .) # select cols from df by `.`

# eggs bread juice
#1 4 4 3
#2 4 5 4
#3 7 2 6
#4 3 1 5
#5 5 10 2

And, developing idea from the previous question:

df[, '['(names(sort(colSums(df), T)), 1:3)]

Which gives the same result.

Passing `top_n` and `arrange` to ggplot (dplyr)

I believe what is tripping you up here is that top_n() defaults to the last variable in the table, unless you tell it what variable to use for ordering. In the examples in our book, the last variable in the dataframe is tf_idf so that is what is used for ordering. In the mtcars example, top_n() is using the last column in the dataframe for ordering; that happens to be carb.

You can always tell top_n() what variable you want to use for ordering by passing it as an argument. For example, check out this similar workflow using the diamonds dataset.

library(tidyverse)

diamonds %>%
arrange(desc(price)) %>%
group_by(clarity) %>%
top_n(10, price) %>%
ungroup %>%
ggplot(aes(cut, price, fill = clarity)) +
geom_col(show.legend = FALSE, ) +
facet_wrap(~clarity, scales = "free") +
scale_x_discrete(drop=FALSE) +
coord_flip()

Sample Image

Created on 2018-05-17 by the reprex package (v0.2.0).

These example datasets are not perfect parallels because they don't have one row per combination of characteristics in the way that the tidy text data frames do. I am pretty sure the issue with top_n() is the problem, though.

Creating an other field

Different package/different syntax version:

library(data.table)

dt = as.data.table(DF)

dt[order(-n), # your data is already sorted, so this does nothing for it
if (.BY[[1]]) .SD else list("Other", sum(n)),
by = 1:nrow(dt) <= 5][, !"nrow", with = F]
# Category n
#1: E 163051
#2: K 127133
#3: M 106680
#4: L 64868
#5: I 49701
#6: Other 217022

How to use top_n for conditional extraction

I'd probably just use head for this, and filter directly on the data frame

df1 <- head(df[df$category == "Yes",], 500)
df2 <- head(df[df$category == "No",], 500)

# to combine
out <- rbind(df1, df2)

I'm guessing top_n does something similar. I expect there is a nicer way with dplyr but this should work :)



Related Topics



Leave a reply



Submit