How to Select Top 5 Percent from Each Group

How do I select TOP 5 PERCENT from each group?

You could use a CTE (Common Table Expression) paired with the NTILE windowing function - this will slice up your data into as many slices as you need, e.g. in your case, into 20 slices (each 5%).

;WITH SlicedData AS
(
SELECT Category, Name, COUNT(Name) Total,
NTILE(20) OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) AS 'NTile'
FROM #TEMP
GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE NTile > 1

This basically groups your data by Category,Name, orders by something else (not sure if COUNT(Name) is really the thing you want here), and then slices it up into 20 pieces, each representing 5% of your data partition. The slice with NTile = 1 is the top 5% slice - just ignore that when selecting from the CTE.

See:

  • MSDN docs on NTILE
  • SQL Server 2005 ranking functions
  • SQL SERVER – 2005 – Sample Example of RANKING Functions – ROW_NUMBER, RANK, DENSE_RANK, NTILE

for more info

select top 5 group by and order by

Maybe you want something like this?

select top 5 CITY, QNT, EXP, RATE 
from (
select *, row_number() over (partition by CITY order by RATE desc) AS RN
from (
select CITY, QNT, EXP, (QNT-EXP)*100/EXP as RATE
from tbl_city

) X
) Y
where RN = 1
order by RATE desc

I didn't test this, but it should take first the row for the city with biggest rate, and then take top 5 rows so that that the same city is not duplicated

Postgresql : How do I select top n percent(%) entries from each group/category

To retrieve the rows based on the percentage of the number of rows in each group you can use two window functions: one to count the rows and one to give them a unique number.

select gp,
val
from (
select gp,
val,
count(*) over (partition by gp) as cnt,
row_number() over (partition by gp order by val desc) as rn
from temp
) t
where rn / cnt <= 0.75;

SQLFiddle example: http://sqlfiddle.com/#!15/94fdd/1


Btw: using char is almost always a bad idea because it is a fixed-length data type that is padded to the defined length. I hope you only did that for setting up the example and don't use it in your real table.

SELECT TOP 10 of each group of a certain field with data across 2 tables

Use APPLY. Simmiliar to JOIN, but the applied sub-select (TopCaptures) is executed once for every row in Sources. So you can get top 10 captures per source.

Variant A: Using a CTE:

; WITH Sources AS (
SELECT SourceId
FROM Source
WHERE Type = 1
AND State = 'TX'
)
SELECT *
FROM Sources
OUTER APPLY (
SELECT TOP 10 *
FROM Captures
WHERE Captures.SourceId = Sources.SourceId
) AS TopCaptures
;

Variant B: Using another Sub-Select

SELECT *
FROM (
SELECT SourceId
FROM Source
WHERE Type = 1
AND State = 'TX'
) AS Sources
OUTER APPLY (
SELECT TOP 10 *
FROM Captures
WHERE Captures.SourceId = Sources.SourceId
) AS TopCaptures
;

Edit: If you want INNER JOIN-like behaviour, use CROSS APPLY instead of OUTER APPLY: Using CROSS APPLY, no Sources-rows will be returned, that do not have at least 1 Capture.

select top 30 percent of the entries for each day

You can use row_number() with partition by date and check against the 30% number of total count of each day.

select date,receipt,total 
from (select *,
ceiling(tc * 30.00 / 100.00) as under30
from (select date,
receipt,
total,
row_number() over(partition by date order by (select null)) rn,
count(*) over(partition by date order by (select null)) tc
from sales) t
) t1
where rn <= under30

DEMO

Output:

+------------+---------+-------+
| date | receipt | total |
+------------+---------+-------+
| 2018-04-21 | 325 | 600 |
+------------+---------+-------+
| 2018-04-21 | 326 | 800 |
+------------+---------+-------+
| 2018-04-26 | 330 | 600 |
+------------+---------+-------+
| 2018-04-26 | 331 | 1080 |
+------------+---------+-------+
| 2018-04-29 | 334 | 600 |
+------------+---------+-------+
| 2018-05-01 | 336 | 1500 |
+------------+---------+-------+

Note: If you want 30% of of total count in that case you need to change your count calculation logic like following in the above query.

  count(*) over(order by (select null)) tc 

Get top n records for each group of grouped results

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
select *
from mytable
where `group` = 1
order by age desc
LIMIT 2
)
UNION ALL
(
select *
from mytable
where `group` = 2
order by age desc
LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from
(
select person, `group`, age,
(@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number
from test t
CROSS JOIN (select @num:=0, @group:=null) c
order by `Group`, Age desc, person
) as x
where x.row_number <= 2;

See Demo

How to extract the top x% of rows by group and number in R?

Here is a solution. It selects the top 30% values by groups of name and then counts the rows that were selected in each group.

library(dplyr)

data %>%
group_by(name) %>%
arrange(name, value) %>%
top_frac(0.30) %>%
count(name)
#Selecting by value
## A tibble: 4 x 2
## Groups: name [4]
# name n
# <chr> <int>
#1 A 150
#2 B 300
#3 C 6
#4 D 30

It is possible to see that these numbers are in fact 30% of each group of name with

data %>% count(name) %>% mutate(n = n*0.3)
# name n
#1 A 150
#2 B 300
#3 C 6
#4 D 30

If you want the top 30% values, without considering the group the top values come from, then the above must be changed to the following code.

data %>%
arrange(name, value) %>%
top_frac(0.30) %>%
count(name)
#Selecting by value
# name n
#1 A 46
#2 B 420
#3 C 20


Related Topics



Leave a reply



Submit