Group by Every N Records in T-Sql

Group by every N records in T-SQL

WITH T AS (
  SELECT RANK() OVER (ORDER BY ID) Rank,
    P.Field1, P.Field2, P.Value1, ...
  FROM P
)
SELECT (Rank - 1) / 1000 GroupID, AVG(...)
FROM T
GROUP BY ((Rank - 1) / 1000)
;

Something like that should get you started. If you can provide your actual schema I can update as appropriate.

Group by every N rows within a group

Hmmm . . . You can indicate where a group starts using lag() and row number arithmetic. Then do a cumulative sum. A new group starts on the odd numbered rows of each batch:

select t.*,
       sum(is_start) over (order by batch, rowid) as groupid
from (select t.*,
             (case when row_number() over (partition by batch order by rowid) % 2 = 1
                   then 1 else 0
              end) as is_start
      from t
     ) t

Assigning new group for every n records in each category

Ok found my solution with the help of this

;with FirstRank as 
(select DENSE_RANK() over (order by Subj, Gender) as rnk, * from @students),
SecondRank AS
(select (ROW_NUMBER() OVER (PARTITION BY rnk ORDER by gender)-1)
            / 2 as rn,*
    from FirstRank
)
select DENSE_RANK() OVER (ORDER BY rnk,rn) as GrpNo, Subj, Gender, StdId from SecondRank

SQL Server : create group of N rows each and give group number for each group

Use row_number() to enumerate the rows, arithmetic to assign the group and then take the minimum of the id:

SELECT t.*, MIN(ID) OVER (PARTITION BY grp) as groupnumber       
FROM (SELECT t.*,
            ( (ROW_NUMBER() OVER (ORDER BY ID) - 1) / 3) as grp
      FROM Table
     ) t
ORDER BY ID ASC;

It is possible to do this without a subquery, but the logic is rather messy:

select t.*,
       (case when row_number() over (order by id) % 3 = 0
             then lag(id, 2) over (order by id)
             when row_number() over (order by id) % 3 = 2
             then lag(id, 1) over (order by id)
             else id
        end) as groupnumber
from table t
order by id;

Selecting the first N rows of each group ordered by date

As well as the row_number solution, another option is CROSS APPLY(SELECT TOP:

SELECT m.masterid,
       d.detailid,
       m.numbers,
       d.date_time,
       d.value
    FROM masters AS m
    CROSS APPLY (
        SELECT TOP (3) *
        FROM details AS d
        WHERE d.date_time >= '2020-01-01'
        AND m.masterid = d.masterid
    ) AS d
    WHERE m.tags LIKE '%Tag2%'
    ORDER BY m.masterid DESC,
             d.date_time;

This may be faster or slower than row_number, mostly depending on cardinalities (quantity of rows) and indexing.

If indexing is good and it's a small number of rows it will usually be faster. If the inner table needs sorting or you are anyway selecting most rows then use row_number.

grouping every N values

The 1st approach that comes to mind is to use row_number() to annotate the table, then group by blocks of 16 rows.

SELECT min(id) as first_id, max(id) AS last_id, avg(rainfall) AS avg_this_16
FROM (
  SELECT id, rainfall, row_number() OVER (order by id) AS n
  FROM the_table
) x(id,rainfall,n)
GROUP BY n/16
ORDER BY n/16;

Note that this won't necessarily include 16 samples for the last group.

Alternately you can calculate a running average by using avg() as a window function:

SELECT id, avg(rainfall) OVER (ORDER BY id ROWS 15 PRECEDING)
FROM the_table;

... possibly annotating that with the row number and selecting the ones you want:

SELECT id AS greatest_id_in_group, avg_last_16_inclusive FROM (
  SELECT
    id, 
    avg(rainfall) OVER (ORDER BY id ROWS 15 PRECEDING) AS avg_last_16_inclusive,
    row_number() OVER (ORDER BY id) AS n
  FROM the_table
) x WHERE n % 16 = 0;

This will disregard the last n<16 samples, not returning a row for them.

Note that I'm assuming the IDs aren't guaranteed to be contiguous. If they are gap-less, you can just group by id/16 and avoid the window function.

Get top 1 row of each group

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

maintain status in 2 places
preserve status history
...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

GROUP BY every n records (Keeping first record's ID)

Below is for BigQuery Standard SQL (and actually works ... )

#standardSQL
WITH skus AS (
  SELECT sku, MIN(dt) AS start_date
  FROM `project.dataset.table`
  GROUP BY sku
) 
SELECT 
  MIN(dt) begin_date, 
  sku, 
  SUM(sales) sales, 
  DIV(DATE_DIFF(dt, start_date, DAY) + 7, 7) week
FROM `project.dataset.table` t
JOIN skus s USING(sku)
GROUP BY sku, week

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2019-02-13' dt, '123' sku, 10 sales UNION ALL
  SELECT '2019-02-14', '123', 10 UNION ALL
  SELECT '2019-02-15', '123', 10 UNION ALL
  SELECT '2019-02-16', '123', 10 UNION ALL
  SELECT '2019-02-17', '123', 10 UNION ALL
  SELECT '2019-02-18', '123', 10 UNION ALL
  SELECT '2019-02-19', '123', 10 UNION ALL
  SELECT '2019-02-20', '123', 10 UNION ALL
  SELECT '2019-02-21', '456', 10 UNION ALL
  SELECT '2019-02-22', '456', 10 
), skus AS (
  SELECT sku, MIN(dt) AS start_date
  FROM `project.dataset.table`
  GROUP BY sku
) 
SELECT 
  MIN(dt) begin_date, 
  sku, 
  SUM(sales) sales, 
  DIV(DATE_DIFF(dt, start_date, DAY) + 7, 7) week
FROM `project.dataset.table` t
JOIN skus s USING(sku)
GROUP BY sku, week
-- ORDER BY sku, week

with result

Row begin_date  sku sales   week     
1   2019-02-13  123 70      1    
2   2019-02-20  123 10      2    
3   2019-02-21  456 20      1

T-SQL update top n rows for each group with n variable for each group (cross apply alternative)

You could try to use simple JOIN instead of correlated subquery:

WITH cte AS
(
 SELECT d.id, d.col, d.dest, s.source
 FROM (SELECT *,
       rn = ROW_NUMBER() OVER(PARTITION BY col ORDER BY id) FROM #desttable) d
 JOIN #sourcetable s
   ON d.col = s.col
  AND d.rn <= s.rownum 
)
UPDATE cte
SET dest = source;

SELECT *
FROM #desttable;

LiveDemo

You should post your real data sample, data structures and query plans. Otherwise we could only guess how to improve it.

Group by Every N Records in T-Sql