Group by and Aggregate Sequential Numeric Values

GROUP BY and aggregate sequential numeric values

There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.

And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.

That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:

WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID

Groupby column and aggregate consecutive rows

Try with groupby aggregate with groups created via cumsum based on where 'Class Number's change:

new_df = (
df.groupby(df['Class Number'].ne(df['Class Number'].shift()).cumsum())
.aggregate({'Class Number': 'first',
'Start': 'first',
'End': 'last',
'Length': 'sum'})
.reset_index(drop=True)
)

new_df:

  Class Number   Start     End  Length
0 1 58.063 58.585 0.522
1 2 58.585 60.159 1.574
2 2M 60.159 61.499 1.340
3 2 61.499 62.156 0.657

Or if 'Start' and 'End' values instead of 'first' and 'last' should be 'min' and 'max':


new_df = (
df.groupby(df['Class Number'].ne(df['Class Number'].shift()).cumsum())
.aggregate({'Class Number': 'first',
'Start': 'min',
'End': 'max',
'Length': 'sum'})
.reset_index(drop=True)
)

Results same as above in this case.

Aggregate numbers that are in sequence

You could do it without pl/sql, using a query with some common table expressions (with clause). It would look like this:

with add_break as (
select part_no,
serial_no,
serial_no-1-lag(serial_no,1,0) over (partition by part_no order by serial_no) brk
from part_tab
),
add_group as (
select add_break.*,
sum(brk) over (partition by part_no order by serial_no) as grp
from add_break
)
select part_no,
case when min(serial_no) = max(serial_no) then to_char(min(serial_no))
else to_char(min(serial_no)) || '-' || to_char(max(serial_no))
end range
from add_group
group by part_no, grp
order by 1, 2

Output for your example data:

part_no | range
--------+------
A | 1-3
A | 5
A | 7-10

Group by sequential data in R

Here is one dplyr option -

library(dplyr)

df %>%
group_by(gene_name) %>%
mutate(grp = gene_number - lag(gene_number, default = 0) > 2) %>%
group_by(grp = cumsum(grp)) %>%
filter(n() > 1) %>%
ungroup

# gene_name gene_number grp
#
#1 ENSMUSG00000000001 4732 1
#2 ENSMUSG00000000001 4733 1
#3 ENSMUSG00000000058 7603 2
#4 ENSMUSG00000000058 7604 2
#5 ENSMUSG00000000058 8246 3
#6 ENSMUSG00000000058 8248 3

For each gene_name subtract the current gene_number value with the previous one and increment the group count if the difference is greater than 2. Drop the row if a group has a single row in it.

data

df <- structure(list(gene_name = c("ENSMUSG00000000001", "ENSMUSG00000000001", 
"ENSMUSG00000000058", "ENSMUSG00000000058", "ENSMUSG00000000058",
"ENSMUSG00000000058", "ENSMUSG00000000058"), gene_number = c(4732L,
4733L, 7603L, 7604L, 8246L, 8248L, 9001L)),
class = "data.frame", row.names = c(NA, -7L))

GROUP BY consecutive rows where columns are equal or NULL

For fixed three columns this could be a possible solution.

http://sqlfiddle.com/#!17/45dc7/137

Disclaimer: This will not work if there could be same values in different columns. E.g. One row (42, NULL, "A42", NULL) and one row (23, "A42", NULL, NULL) will end in unwanted results. The fix for that is to concatenate a column identifier with an unique delimiter to the string and remove it after the operation by string split.

WITH test_table as (
SELECT *,
array_remove(ARRAY[column1,column2,column3], null) as arr, -- A
cardinality(array_remove(ARRAY[column1,column2,column3], null))as arr_len
FROM test_table )

SELECT
s.array_agg as aggregates, -- G
MAX(tt.column1) as column1,
MAX(tt.column2) as column2,
MAX(tt.column3) as column3
FROM (

SELECT array_agg(id) FROM -- E
(SELECT DISTINCT ON (t1.id)
t1.id, CASE WHEN t1.arr_len >= t2.arr_len THEN t1.arr ELSE t2.arr END as arr -- C
FROM
test_table as t1
JOIN -- B
test_table as t2
ON t1.arr @> t2.arr AND COALESCE(t2.column1, t2.column2, t2.column3) IS NOT NULL
OR t2.arr @> t1.arr AND COALESCE(t1.column1, t1.column2, t1.column3) IS NOT NULL

ORDER BY t1.id, GREATEST(t1.arr_len, t2.arr_len) DESC -- D
) s
GROUP BY arr

UNION

SELECT
ARRAY[id]
FROM test_table tt
WHERE COALESCE(tt.column1, tt.column2, tt.column3) IS NULL) s -- F

JOIN test_table tt ON tt.id = ANY (s.array_agg)
GROUP BY s.array_agg

A: Aggregate the column values and removing the NULL values. The reason is that I check for subsets later which will not work with NULLs. This is the point where you should add the column identifier as mentioned in the disclaimer above.

B: CROSS JOIN the table against itself. Here I am checking if one column aggregate is a subset of another. The rows with only NULL values are ignored (this is the COALESCE function)

C: Getting the column array with the highest length either from the first or from the second table. It depends on its id.

D: With the ORDER BY the longest array and the DISTINCT it is assured that only the longest array is given for each id

E: Now there are many ids with the same column array sets. The array sets are used to aggregate the ids. Here the ids are put together.

F: Add all NULL rows.

G: One last JOIN against all columns. The rows are taken that are part of the id aggregation from (E). After that the MAX value is grouped per column.

Edit: Fiddle for PostgreSQL 9.3 (array_length instead of cardinality function) and added test case (8, 'A2', 'A3', 'A8')

http://sqlfiddle.com/#!15/8800d/2

GROUP BY consecutive dates delimited by gaps

create table t ("date" date, "value" int);
insert into t ("date", "value") values
('2011-10-31', 2),
('2011-11-01', 8),
('2011-11-02', 10),
('2012-09-13', 1),
('2012-09-14', 4),
('2012-09-15', 5),
('2012-09-16', 20),
('2012-10-30', 10);

Simpler and cheaper version:

select min("date"), max("date"), sum(value)
from (
select
"date", value,
"date" - (dense_rank() over(order by "date"))::int g
from t
) s
group by s.g
order by 1

My first try was more complex and expensive:

create temporary sequence s;
select min("date"), max("date"), sum(value)
from (
select
"date", value, d,
case
when lag("date", 1, null) over(order by s.d) is null and "date" is not null
then nextval('s')
when lag("date", 1, null) over(order by s.d) is not null and "date" is not null
then lastval()
else 0
end g
from
t
right join
generate_series(
(select min("date") from t)::date,
(select max("date") from t)::date + 1,
'1 day'
) s(d) on s.d::date = t."date"
) q
where g != 0
group by g
order by 1
;
drop sequence s;

The output:

    min     |    max     | sum 
------------+------------+-----
2011-10-31 | 2011-11-02 | 20
2012-09-13 | 2012-09-16 | 30
2012-10-30 | 2012-10-30 | 10
(3 rows)

Aggregate pandas dataframe by n consecutive lines

Do with groupby + agg

s = df.groupby(df.index//2).agg({'Open':'first','High':'max','Low':'min','Close':'last'})
Open High Low Close
0 1 10 0 4
1 3 8 2 7

group rows in a pandas data frame when the difference of consecutive rows are less than a value

You can do a named aggregation on groupby:

(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
.agg(col1=('col1','first'),
col2=('col2','sum'),
col3=('col3','sum'),
col4=('col1','last'))
)

Output:

   col1  col2  col3  col4
0 1 7 10 4
1 7 8 15 9
2 15 1 12 15

update without named aggregation you can do some thing like this:

groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()


Related Topics



Leave a reply



Submit