How to Select a Fixed Number of Rows for Each Group

How do I select a fixed number of rows for each group?

Use:

SELECT x.a,
x.b,
x.distance
FROM (SELECT t.a,
t.b,
t.distance
CASE
WHEN @distance != t.distance THEN @rownum := 1
ELSE @rownum := @rownum + 1
END AS rank,
@distance := t.distance
FROM TABLE t
JOIN (SELECT @rownum := 0, @distance := '') r
ORDER BY t.distance --important for resetting the rownum variable) x
WHERE x.rank <= 2
ORDER BY x.distance, x.a

randomly select a fixed number of rows in each group in SQL server table

You can use ROW_NUMBER and use NEWID() to generate a random ORDER:

EDIT: I replaced CHECKSUM(NEWID()) with NEWID() since I cannot prove which is faster and NEWID() is I think the most used.

WITH CTE AS(
SELECT *,
RN = ROW_NUMBER() OVER(PARTITION BY id ORDER BY NEWID())
FROM tbl
)
SELECT
id, value1, value2
FROM Cte
WHERE RN <= 2

SQL Fiddle

The fiddle should show different result among different runs.


If you're inserting this to another table use this subquery version:

INSERT INTO yourNewTable(id, value1, value2)
SELECT
id, value1, value2
FROM (
SELECT *,
RN = ROW_NUMBER() OVER(PARTITION BY id ORDER BY NEWID())
FROM tbl
)t
WHERE RN <= 2

How to randomly select fixed number of rows (if greater) per group else select all rows in pandas?

You can choose to sample only if you have more row:

n = 2
(df.groupby('Group_Id')
.apply(lambda x: x.sample(n) if len(x)>n else x )
.reset_index(drop=True)
)

You can also try shuffling the whole data and groupby().head():

df.sample(frac=1).groupby('Group_Id').head(2)

Output:

  Name  Group_Id
5 DEF 3
0 AAA 1
2 BDF 1
3 CCC 2
4 XYZ 2

Select number of rows for each group where two column values makes one group

What a hairball! This gets progressively harder as you move from most recent, to second most recent, to third most recent.

Let's put this together by getting the list of IDs we need. Then we can pull the items from the table by ID.

This, relatively easy, query gets you the ids of your most recent items

 SELECT id FROM
(SELECT max(id) id, fromitem, toitem
FROM stuff
WHERE viewed = 'true'
GROUP BY fromitem, toitem
)a

Fiddle: http://sqlfiddle.com/#!2/f7045/27/0

Next, we need to get the ids of the second most recent items. To do this, we need a self-join style query. We need to do the same summary but on a virtual table that omits the most recent items.

select id from (
select max(b.id) id, b.fromitem, b.toitem
from stuff a
join
(select id, fromitem, toitem
from stuff
where viewed = 'true'
) b on ( a.fromitem = b.fromitem
and a.toitem = b.toitem
and b.id < a.id)
where a.viewed = 'true'
group by fromitem, toitem
)c

Fiddle: http://sqlfiddle.com/#!2/f7045/44/0

Finally, we need to get the ids of the third most recent items. Mercy! We need to join that query we just had, to the table again.

select id from
(
select max(d.id) id, d.fromitem, d.toitem
from stuff d
join
(
select max(b.id) id, b.fromitem, b.toitem
from stuff a
join
(
select id, fromitem, toitem
from stuff
where viewed = 'true'
) b on ( a.fromitem = b.fromitem
and a.toitem = b.toitem
and b.id < a.id)
where a.viewed = 'true'
group by fromitem, toitem
) c on ( d.fromitem = c.fromitem
and d.toitem = c.toitem
and d.id < c.id)
where d.viewed='true'
group by d.fromitem, d.toitem
) e

Fiddle: http://sqlfiddle.com/#!2/f7045/45/0

So, now we take the union of all those ids, and use them to grab the right rows from the table, and we're done.

SELECT * 
FROM STUFF
WHERE ID IN
(

SELECT id FROM
(SELECT max(id) id, fromitem, toitem
FROM stuff
WHERE viewed = 'true'
GROUP BY fromitem, toitem
)a
UNION
select id from (
select max(b.id) id, b.fromitem, b.toitem
from stuff a
join
(select id, fromitem, toitem
from stuff
where viewed = 'true'
) b on ( a.fromitem = b.fromitem
and a.toitem = b.toitem
and b.id < a.id)
where a.viewed = 'true'
group by fromitem, toitem
)c
UNION
select id from
(
select max(d.id) id, d.fromitem, d.toitem
from stuff d
join
(
select max(b.id) id, b.fromitem, b.toitem
from stuff a
join
(
select id, fromitem, toitem
from stuff
where viewed = 'true'
) b on ( a.fromitem = b.fromitem
and a.toitem = b.toitem
and b.id < a.id)
where a.viewed = 'true'
group by fromitem, toitem
) c on ( d.fromitem = c.fromitem
and d.toitem = c.toitem
and d.id < c.id)
where d.viewed='true'
group by d.fromitem, d.toitem
) e
UNION
select id from stuff where viewed='false'
)
order by viewed desc, fromitem, toitem, id desc

Tee hee. Too much SQL. Fiddle: http://sqlfiddle.com/#!2/f7045/47/0

And now, we need to cope with your last requirement, the requirement that your graph is unordered. That is, that from=n to=m is the same as from=m to=n.

To do this we need a virtual table instead of the physical table. This will do the trick.

 SELECT id, least(fromitem, toitem) fromitem, greatest(fromitem,toitem) toitem, data
FROM stuff

Now we need to use this virtual table, this view, everywhere the physical table used to appear. Let's use a view to do this.

CREATE VIEW 
AS
SELECT id,
LEAST(fromitem, toitem) fromitem,
GREATEST (fromitem, toitem) toitem,
viewed,
data;

So, our ultimate query is:

SELECT *
FROM stuff
WHERE ID IN
(

SELECT id FROM
(SELECT max(id) id, fromitem, toitem
FROM STUFF_UNORDERED
WHERE viewed = 'true'
GROUP BY fromitem, toitem
)a
UNION
SELECT id FROM (
SELECT max(b.id) id, b.fromitem, b.toitem
FROM STUFF_UNORDERED a
JOIN
(SELECT id, fromitem, toitem
FROM STUFF_UNORDERED
WHERE viewed = 'true'
) b ON ( a.fromitem = b.fromitem
AND a.toitem = b.toitem
AND b.id < a.id)
WHERE a.viewed = 'true'
GROUP BY fromitem, toitem
)c
UNION
SELECT id FROM
(
SELECT max(d.id) id, d.fromitem, d.toitem
FROM STUFF_UNORDERED d
JOIN
(
SELECT max(b.id) id, b.fromitem, b.toitem
FROM STUFF_UNORDERED a
JOIN
(
SELECT id, fromitem, toitem
FROM STUFF_UNORDERED
WHERE viewed = 'true'
) b ON ( a.fromitem = b.fromitem
AND a.toitem = b.toitem
AND b.id < a.id)
WHERE a.viewed = 'true'
GROUP BY fromitem, toitem
) c ON ( d.fromitem = c.fromitem
AND d.toitem = c.toitem
AND d.id < c.id)
WHERE d.viewed='true'
GROUP BY d.fromitem, d.toitem
) e
UNION
SELECT id FROM STUFF_UNORDERED WHERE viewed='false'
)
ORDER BY viewed DESC,
least(fromitem, toitem),
greatest(fromitem, toitem),
id DESC

Fiddle: http://sqlfiddle.com/#!2/8c154/4/0

How to sample different number of rows from each group in DataFrame

Artificial data generation



Dataframe

Let's first generate some data to see how we can solve the problem:

# Define a DataFrame containing employee data 
df = pd.DataFrame({'Category':['Jai', 'Jai', 'Jai', 'Princi', 'Princi'],
'Age':[27, 24, 22, 32, 15],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Noida'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd', '10th']} )

Sampling rule

# Number of rows, that we want to be sampled from each category 
samples_per_group_dict = {'Jai': 1,
'Princi':2}

Problem solving


I can propose two solutions:

  1. Apply on groupby (one-liner)

    output = df.groupby('Category').apply(lambda group: group.sample(samples_per_group_dict[group.name])).reset_index(drop = True)
  2. Looping groups (more verbose)

    list_of_sampled_groups = []

    for name, group in df.groupby('Category'):
    n_rows_to_sample = samples_per_group_dict[name]
    sampled_group = group.sample(n_rows_to_sample)
    list_of_sampled_groups.append(sampled_group)

    output = pd.concat(list_of_sampled_groups).reset_index(drop=True)

Performance should be the same for both approaches. If performance matters you can vectorize your calculation. But exact optimization depends on n_groups and n_samples in each group.

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

How to select top N rows for each group in a Entity Framework GroupBy with EF 3.1

Update (EF Core 6.0):

EF Core 6.0 added support for translating GroupBy result set projection, so the original code for taking (key, items) now works as it should, i.e.

var query = context.Set<DbDocument>()
.Where(e => partnerIds.Contains(e.SenderId))
.GroupBy(e => e.SenderId)
.Select(g => new
{
g.Key,
Documents = g.OrderByDescending(e => e.InsertedDateTime).Take(10)
});

However flattening (via SelectMany) is still unsupported, so you have to use the below workaround if you need such query shape.

Original (EF Core 3.0/3.1/5.0):

This is a common problem, unfortunately not supported by EF Core 3.0/3.1/5.0 query translator specifically for GroupBy.

The workaround is to do the groping manually by correlating 2 subqueries - one for keys and one for corresponding data.

Applying it to your examples would be something like this.

If you need (key, items) pairs:

var query = context.Set<DbDocument>()
.Where(t => partnerIds.Contains(t.SenderId))
.Select(t => t.SenderId).Distinct() // <--
.Select(key => new
{
Key = key,
Documents =
context.Set<DbDocument>().Where(t => t.SenderId == key) // <--
.OrderByDescending(t => t.InsertedDateTime).Take(10)
.ToList() // <--
});

If you need just flat result set containing top N items per key:

var query = context.Set<DbDocument>()
.Where(t => partnerIds.Contains(t.SenderId))
.Select(t => t.SenderId).Distinct() // <--
.SelectMany(key => context.Set<DbDocument>().Where(t => t.SenderId == key) // <--
.OrderByDescending(t => t.InsertedDateTime).Take(10)
);

Select number of rows from each category

Using Dynamic Top
In place of N you can use any number

SELECT Id,Content,Category,createdAt
FROM (SELECT t.*,
CASE
WHEN @category != t.category THEN @rownum := 1
ELSE @rownum := @rownum + 1
END AS rank,
@category := t.category AS var_category
FROM Table1 t
JOIN (SELECT @rownum := NULL, @category := '') r
ORDER BY t.Category,t.createdAt DESC) x
WHERE x.rank <= 'N'

In place of N you can use any number

Output

ID  Content Category    createdAt
6 test6 cat1 2018-03-26T18:23:17Z
2 test2 cat1 2018-03-26T18:22:46Z
5 test5 cat2 2018-03-26T18:23:13Z
4 test4 cat2 2018-03-26T18:23:11Z

Live Demo

http://sqlfiddle.com/#!9/00ca02/20

Get records with max value for each group of grouped SQL results

There's a super-simple way to do this in mysql:

select * 
from (select * from mytable order by `Group`, age desc, Person) x
group by `Group`

This works because in mysql you're allowed to not aggregate non-group-by columns, in which case mysql just returns the first row. The solution is to first order the data such that for each group the row you want is first, then group by the columns you want the value for.

You avoid complicated subqueries that try to find the max() etc, and also the problems of returning multiple rows when there are more than one with the same maximum value (as the other answers would do)

Note: This is a mysql-only solution. All other databases I know will throw an SQL syntax error with the message "non aggregated columns are not listed in the group by clause" or similar. Because this solution uses undocumented behavior, the more cautious may want to include a test to assert that it remains working should a future version of MySQL change this behavior.

Version 5.7 update:

Since version 5.7, the sql-mode setting includes ONLY_FULL_GROUP_BY by default, so to make this work you must not have this option (edit the option file for the server to remove this setting).

SELECT fixed number of rows by evenly skipping rows

The mistake in your first attempt is that you can't mix the aggregate function count(*) with the un-aggregated selection of rows. You can fix this by using count() as window-aggregate function instead:

SELECT * FROM (
SELECT *, ((row_number() OVER (ORDER BY "time"))
% ceil(count(*) OVER () / 500.0)::int) AS rn
FROM data_raw
) sub
WHERE sub.rn = 0;

Detailed explanation here:

  • Best way to get result count before LIMIT was applied

@Alexander has a fix for your last attempt.



Related Topics



Leave a reply



Submit