GROUP BY and aggregate sequential numeric values
There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.
And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company
and profession
, now you only need something to distinguish years that belong to different sequences.
That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:
WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID
Groupby column and aggregate consecutive rows
Try with groupby aggregate
with groups created via cumsum
based on where 'Class Number'
s change:
new_df = (
df.groupby(df['Class Number'].ne(df['Class Number'].shift()).cumsum())
.aggregate({'Class Number': 'first',
'Start': 'first',
'End': 'last',
'Length': 'sum'})
.reset_index(drop=True)
)
new_df
:
Class Number Start End Length
0 1 58.063 58.585 0.522
1 2 58.585 60.159 1.574
2 2M 60.159 61.499 1.340
3 2 61.499 62.156 0.657
Or if 'Start'
and 'End'
values instead of 'first'
and 'last'
should be 'min'
and 'max'
:
new_df = (
df.groupby(df['Class Number'].ne(df['Class Number'].shift()).cumsum())
.aggregate({'Class Number': 'first',
'Start': 'min',
'End': 'max',
'Length': 'sum'})
.reset_index(drop=True)
)
Results same as above in this case.
Aggregate numbers that are in sequence
You could do it without pl/sql, using a query with some common table expressions (with
clause). It would look like this:
with add_break as (
select part_no,
serial_no,
serial_no-1-lag(serial_no,1,0) over (partition by part_no order by serial_no) brk
from part_tab
),
add_group as (
select add_break.*,
sum(brk) over (partition by part_no order by serial_no) as grp
from add_break
)
select part_no,
case when min(serial_no) = max(serial_no) then to_char(min(serial_no))
else to_char(min(serial_no)) || '-' || to_char(max(serial_no))
end range
from add_group
group by part_no, grp
order by 1, 2
Output for your example data:
part_no | range
--------+------
A | 1-3
A | 5
A | 7-10
Group by sequential data in R
Here is one dplyr
option -
library(dplyr)
df %>%
group_by(gene_name) %>%
mutate(grp = gene_number - lag(gene_number, default = 0) > 2) %>%
group_by(grp = cumsum(grp)) %>%
filter(n() > 1) %>%
ungroup
# gene_name gene_number grp
#
#1 ENSMUSG00000000001 4732 1
#2 ENSMUSG00000000001 4733 1
#3 ENSMUSG00000000058 7603 2
#4 ENSMUSG00000000058 7604 2
#5 ENSMUSG00000000058 8246 3
#6 ENSMUSG00000000058 8248 3
For each gene_name
subtract the current gene_number
value with the previous one and increment the group count if the difference is greater than 2. Drop the row if a group has a single row in it.
data
df <- structure(list(gene_name = c("ENSMUSG00000000001", "ENSMUSG00000000001",
"ENSMUSG00000000058", "ENSMUSG00000000058", "ENSMUSG00000000058",
"ENSMUSG00000000058", "ENSMUSG00000000058"), gene_number = c(4732L,
4733L, 7603L, 7604L, 8246L, 8248L, 9001L)),
class = "data.frame", row.names = c(NA, -7L))
GROUP BY consecutive rows where columns are equal or NULL
For fixed three columns this could be a possible solution.
http://sqlfiddle.com/#!17/45dc7/137
Disclaimer: This will not work if there could be same values in different columns. E.g. One row (42, NULL, "A42", NULL)
and one row (23, "A42", NULL, NULL)
will end in unwanted results. The fix for that is to concatenate a column identifier with an unique delimiter to the string and remove it after the operation by string split.
WITH test_table as (
SELECT *,
array_remove(ARRAY[column1,column2,column3], null) as arr, -- A
cardinality(array_remove(ARRAY[column1,column2,column3], null))as arr_len
FROM test_table )
SELECT
s.array_agg as aggregates, -- G
MAX(tt.column1) as column1,
MAX(tt.column2) as column2,
MAX(tt.column3) as column3
FROM (
SELECT array_agg(id) FROM -- E
(SELECT DISTINCT ON (t1.id)
t1.id, CASE WHEN t1.arr_len >= t2.arr_len THEN t1.arr ELSE t2.arr END as arr -- C
FROM
test_table as t1
JOIN -- B
test_table as t2
ON t1.arr @> t2.arr AND COALESCE(t2.column1, t2.column2, t2.column3) IS NOT NULL
OR t2.arr @> t1.arr AND COALESCE(t1.column1, t1.column2, t1.column3) IS NOT NULL
ORDER BY t1.id, GREATEST(t1.arr_len, t2.arr_len) DESC -- D
) s
GROUP BY arr
UNION
SELECT
ARRAY[id]
FROM test_table tt
WHERE COALESCE(tt.column1, tt.column2, tt.column3) IS NULL) s -- F
JOIN test_table tt ON tt.id = ANY (s.array_agg)
GROUP BY s.array_agg
A: Aggregate the column values and removing the NULL
values. The reason is that I check for subsets later which will not work with NULL
s. This is the point where you should add the column identifier as mentioned in the disclaimer above.
B: CROSS JOIN
the table against itself. Here I am checking if one column aggregate is a subset of another. The rows with only NULL
values are ignored (this is the COALESCE
function)
C: Getting the column array with the highest length either from the first or from the second table. It depends on its id.
D: With the ORDER BY
the longest array and the DISTINCT
it is assured that only the longest array is given for each id
E: Now there are many ids with the same column array sets. The array sets are used to aggregate the ids. Here the ids are put together.
F: Add all NULL
rows.
G: One last JOIN
against all columns. The rows are taken that are part of the id aggregation from (E). After that the MAX
value is grouped per column.
Edit: Fiddle for PostgreSQL 9.3 (array_length
instead of cardinality
function) and added test case (8, 'A2', 'A3', 'A8')
http://sqlfiddle.com/#!15/8800d/2
GROUP BY consecutive dates delimited by gaps
create table t ("date" date, "value" int);
insert into t ("date", "value") values
('2011-10-31', 2),
('2011-11-01', 8),
('2011-11-02', 10),
('2012-09-13', 1),
('2012-09-14', 4),
('2012-09-15', 5),
('2012-09-16', 20),
('2012-10-30', 10);
Simpler and cheaper version:
select min("date"), max("date"), sum(value)
from (
select
"date", value,
"date" - (dense_rank() over(order by "date"))::int g
from t
) s
group by s.g
order by 1
My first try was more complex and expensive:
create temporary sequence s;
select min("date"), max("date"), sum(value)
from (
select
"date", value, d,
case
when lag("date", 1, null) over(order by s.d) is null and "date" is not null
then nextval('s')
when lag("date", 1, null) over(order by s.d) is not null and "date" is not null
then lastval()
else 0
end g
from
t
right join
generate_series(
(select min("date") from t)::date,
(select max("date") from t)::date + 1,
'1 day'
) s(d) on s.d::date = t."date"
) q
where g != 0
group by g
order by 1
;
drop sequence s;
The output:
min | max | sum
------------+------------+-----
2011-10-31 | 2011-11-02 | 20
2012-09-13 | 2012-09-16 | 30
2012-10-30 | 2012-10-30 | 10
(3 rows)
Aggregate pandas dataframe by n consecutive lines
Do with groupby
+ agg
s = df.groupby(df.index//2).agg({'Open':'first','High':'max','Low':'min','Close':'last'})
Open High Low Close
0 1 10 0 4
1 3 8 2 7
group rows in a pandas data frame when the difference of consecutive rows are less than a value
You can do a named aggregation on groupby:
(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
.agg(col1=('col1','first'),
col2=('col2','sum'),
col3=('col3','sum'),
col4=('col1','last'))
)
Output:
col1 col2 col3 col4
0 1 7 10 4
1 7 8 15 9
2 15 1 12 15
update without named aggregation you can do some thing like this:
groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()
Related Topics
How to Order by With Union in Sql
"Case" Statement Within "Where" Clause in SQL Server 2008
What Is Best Tool to Compare Two SQL Server Databases (Schema and Data)
Sql: Find the Max Record Per Group
How to Query a Tree Structure Table in MySQL in a Single Query, to Any Depth
Is There a Group_Concat Function in Ms-Access
Calculating Number of Full Months Between Two Dates in SQL
Postgresql: Which Datatype Should Be Used for Currency
Passing an Array of Parameters to a Stored Procedure
Get the Last Inserted Row Id (With SQL Statement)
Join Two Select Statement Results
SQL Query to Get Aggregated Result in Comma Separators Along With Group by Column in SQL Server
What Are Your Most Common SQL Optimizations
Find Most Frequent Value in SQL Column