Getting the Top Values by Group

Getting the top values by group

From dplyr 1.0.0, "slice_min() and slice_max() select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n()."

d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
# # A tibble: 15 x 2
# # Groups: grp [3]
# x grp
# <dbl> <fct>
# 1 0.994 1
# 2 0.957 1
# 3 0.955 1
# 4 0.940 1
# 5 0.900 1
# 6 0.963 2
# 7 0.902 2
# 8 0.895 2
# 9 0.858 2
# 10 0.799 2
# 11 0.985 3
# 12 0.893 3
# 13 0.886 3
# 14 0.815 3
# 15 0.812 3

Pre-dplyr 1.0.0 using top_n:

From ?top_n, about the wt argument:

The variable to use for ordering [...] defaults to the last variable in the tbl".

The last variable in your data set is "grp", which is not the variable you wish to rank, and which is why your top_n attempt "returns the whole of d". Thus, if you wish to rank by "x" in your data set, you need to specify wt = x.

d %>%
group_by(grp) %>%
top_n(n = 5, wt = x)


Data:

set.seed(123)
d <- data.frame(
x = runif(90),
grp = gl(3, 30))

Select the top N values by group

# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

How to select top N values and group the rest of the remaining ones

You can run some tidyverse operations directly on your original dataframe:

library(tidyverse)
dummy_dataframe %>%
count(group) %>%
mutate(id = if_else(row_number() < 5, 1L, 2L)) %>%
group_by(id) %>%
arrange(id, -n) %>%
mutate(group = if_else(id == 2L, "others", group),
n = if_else(group == "others", sum(n), n)) %>%
ungroup() %>%
distinct() %>%
select(-id)

which gives:

# A tibble: 5 x 2
group n
<chr> <int>
1 A 3
2 C 2
3 D 2
4 B 1
5 others 3

Pandas get topmost n records within each group

Did you try

df.groupby('id').head(2)

Output generated:

       id  value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use

df.groupby('id').head(2).reset_index(drop=True)

to remove the MultiIndex and flatten the results:

    id  value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

How to get the top values within each group?

Create a month using dt, then group by s_name and month, then apply a function to the groups, group each group by name and do a sum over qty, sort_values descending and only get the first two rows with head:

df.Time = pd.to_datetime(df.Time, format='%d/%m/%Y')
df['month'] = df.Time.dt.month

df_f = df.groupby(['s_name', 'month']).apply(
lambda df:
df.groupby('p_name').qty.sum()
.sort_values(ascending=False).head(2)
).reset_index()

df_f
# s_name month p_name qty
# 0 A 1 DEF 4
# 1 A 1 ABC 2
# 2 B 2 ABC 3
# 3 B 2 DEF 2
# 4 B 3 ABC 3
# 5 B 3 FGH 0

Find top N values within each group

print(
df.groupby('size').apply(
lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
).reset_index()[['size', 'id', 'new_a']]
)

Prints:

     size  id  new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

Selecting top N rows for each group in dataframe

Using the dplyrpackage in the tidyverse you can do this:

library(tidyverse)

df <- tribble(
~Index, ~Country
, 4.1, "USA"
, 2.1, "USA"
, 5.2, "USA"
, 1.1, "Singapore"
, 6.2, "Singapore"
, 8.1, "Germany"
, 4.5, "Italy"
, 7.1, "Italy"
, 2.3, "Italy"
, 5.9, "Italy"
, 8.8, "Russia"
)

df %>% # take the dataframe
group_by(Country) %>% # group it by the grouping variable
slice(1:3) # and pick rows 1 to 3 per group

Output:

   Index Country  
<dbl> <chr>
1 8.1 Germany
2 4.5 Italy
3 7.1 Italy
4 2.3 Italy
5 8.8 Russia
6 1.1 Singapore
7 6.2 Singapore
8 4.1 USA
9 2.1 USA
10 5.2 USA

How to extract the top x% of rows by group and number in R?

Here is a solution. It selects the top 30% values by groups of name and then counts the rows that were selected in each group.

library(dplyr)

data %>%
group_by(name) %>%
arrange(name, value) %>%
top_frac(0.30) %>%
count(name)
#Selecting by value
## A tibble: 4 x 2
## Groups: name [4]
# name n
# <chr> <int>
#1 A 150
#2 B 300
#3 C 6
#4 D 30

It is possible to see that these numbers are in fact 30% of each group of name with

data %>% count(name) %>% mutate(n = n*0.3)
# name n
#1 A 150
#2 B 300
#3 C 6
#4 D 30

If you want the top 30% values, without considering the group the top values come from, then the above must be changed to the following code.

data %>%
arrange(name, value) %>%
top_frac(0.30) %>%
count(name)
#Selecting by value
# name n
#1 A 46
#2 B 420
#3 C 20


Related Topics



Leave a reply



Submit