Combine consecutive date ranges
The strange bit you see with my use of the date '31211231' is just a very large date to handle your "no-end-date" scenario. I have assumed you won't really have many date ranges per employee, so I've used a simple Recursive Common Table Expression to combine the ranges.
To make it run faster, the starting anchor query keeps only those dates that will not link up to a prior range (per employee). The rest is just tree-walking the date ranges and growing the range. The final GROUP BY keeps only the largest date range built up per starting ANCHOR (employmentid, startdate) combination.
SQL Fiddle
MS SQL Server 2008 Schema Setup:
create table Tbl (
employmentid int,
startdate datetime,
enddate datetime);
insert Tbl values
(5, '2007-12-03', '2011-08-26'),
(5, '2013-05-02', null),
(30, '2006-10-02', '2011-01-16'),
(30, '2011-01-17', '2012-08-12'),
(30, '2012-08-13', null),
(66, '2007-09-24', null);
/*
-- expected outcome
EmploymentId StartDate EndDate
5 2007-12-03 2011-08-26
5 2013-05-02 NULL
30 2006-10-02 NULL
66 2007-09-24 NULL
*/
Query 1:
;with cte as (
select a.employmentid, a.startdate, a.enddate
from Tbl a
left join Tbl b on a.employmentid=b.employmentid and a.startdate-1=b.enddate
where b.employmentid is null
union all
select a.employmentid, a.startdate, b.enddate
from cte a
join Tbl b on a.employmentid=b.employmentid and b.startdate-1=a.enddate
)
select employmentid,
startdate,
nullif(max(isnull(enddate,'32121231')),'32121231') enddate
from cte
group by employmentid, startdate
order by employmentid
Results:
| EMPLOYMENTID | STARTDATE | ENDDATE |
-----------------------------------------------------------------------------------
| 5 | December, 03 2007 00:00:00+0000 | August, 26 2011 00:00:00+0000 |
| 5 | May, 02 2013 00:00:00+0000 | (null) |
| 30 | October, 02 2006 00:00:00+0000 | (null) |
| 66 | September, 24 2007 00:00:00+0000 | (null) |
Merge consecutive and overlapping date ranges
- within customer
groupby("cust", as_index=False)
look for overlapping dates in rows cumsum()
sums booleans to generate a group for overlapping dates- finally simple case on min/max within groups
df.groupby("cust", as_index=False).apply(
lambda d: d.sort_values(["start_date", "end_date"])
.groupby(
["cust", (~(d["start_date"] <= (d["end_date"].shift() + pd.Timedelta(days=1)))).cumsum()],
as_index=False
)
.agg({"start_date": "min", "end_date": "max"})
).reset_index(drop=True)
cust | start_date | end_date | |
---|---|---|---|
0 | CUST123 | 2021-01-01 00:00:00 | 2021-01-31 00:00:00 |
1 | CUST123 | 2021-02-02 00:00:00 | 2021-02-28 00:00:00 |
2 | CUST456 | 2021-01-05 00:00:00 | 2021-01-31 00:00:00 |
Merging records with consecutive dates in SQL
This is a CTE so it'll all need to be executed as one, but I'll explain as I go.
First I'll set the parameters for the date range we are interested in:
DECLARE @StartDate DateTime; SET @StartDate = '2011-11-01';
DECLARE @EndDate DateTime; SET @EndDate = '2011-11-30';
Then I'll turn them into a list of dates using a recursive CTE
WITH
ValidDates ( ValidDate ) AS
(
SELECT @StartDate
UNION ALL
SELECT DateAdd(day, 1, ValidDate)
FROM ValidDates
WHERE ValidDate < @EndDate
),
By joining that with ranges in the original records I get a list of individual days absence.
Using a combination of row_number and datediff I can group consecutive dates. This assumes that there are no duplicates.
DaysAbsent AS
(
SELECT
A.RecordID
, A.EmpID
, A.AbsCode
, DateDiff(Day, @StartDate, D.ValidDate)
- row_number()
over (partition by A.EmpID, A.AbsCode
order by D.ValidDate) AS DayGroup
, D.ValidDate AS AbsentDay
FROM
dbo.Absence A
INNER JOIN
ValidDates D
ON D.ValidDate >= DateFrom
and D.ValidDate <= DateTo
)
Now it's a simple select with min and max to turn it back into ranges.
SELECT
EmpID
, AbsCode
, MIN(AbsentDay) AS DateFrom
, MAX(AbsentDay) AS DateTo
FROM
DaysAbsent
GROUP BY
EmpID
, AbsCode
, DayGroup
The DayGroup isn't needed in the output but is needed for the grouping, otherwise non consecutive groups will be collapsed into one.
How to merge consecutive date into one date?
This is a type of gaps-and-islands problem. You can use lag()
and a cumulative sum to identify the groupings. Then aggregate:
select col1, col2, min(col3), max(col4)
from (select t.*,
sum(case when prev_col4 = col3 then 0 else 1 end) over
(partition by col1, col2
order by col3
rows between unbounded preceding and current row
) as grp
from (select t.*,
lag(col4) over (partition by col1, col2 order by col3) as prev_col4
from t
) t
) t
group by col1, col2, grp;
Joining together consecutive date validity intervals
This is a gaps-and-islands problem. There are various ways to approach it; this uses lead
and lag
analytic functions:
select distinct product,
case when start_date is null then lag(start_date)
over (partition by product order by rn) else start_date end as start_date,
case when end_date is null then lead(end_date)
over (partition by product order by rn) else end_date end as end_date
from (
select product, start_date, end_date, rn
from (
select t.product,
case when lag(end_date)
over (partition by product order by start_date) is null
or lag(end_date)
over (partition by product order by start_date) != start_date - 1
then start_date end as start_date,
case when lead(start_date)
over (partition by product order by start_date) is null
or lead(start_date)
over (partition by product order by start_date) != end_date + 1
then end_date end as end_date,
row_number() over (partition by product order by start_date) as rn
from t
)
where start_date is not null or end_date is not null
)
order by start_date, product;
PRODUCT START_DATE END_DATE
------- ---------- ---------
A 01-JUL-13 30-SEP-13
B 01-OCT-13 30-NOV-13
A 01-DEC-13 31-MAR-14
SQL Fiddle
The innermost query looks at the preceding and following records for the product, and only retains the start and/or end time if the records are not contiguous:
select t.product,
case when lag(end_date)
over (partition by product order by start_date) is null
or lag(end_date)
over (partition by product order by start_date) != start_date - 1
then start_date end as start_date,
case when lead(start_date)
over (partition by product order by start_date) is null
or lead(start_date)
over (partition by product order by start_date) != end_date + 1
then end_date end as end_date
from t;
PRODUCT START_DATE END_DATE
------- ---------- ---------
A 01-JUL-13
A
A 30-SEP-13
A 01-DEC-13
A
A
A 31-MAR-14
B 01-OCT-13
B 30-NOV-13
The next level of select removes those which are mid-period, where both dates were blanked by the inner query, which gives:
PRODUCT START_DATE END_DATE
------- ---------- ---------
A 01-JUL-13
A 30-SEP-13
A 01-DEC-13
A 31-MAR-14
B 01-OCT-13
B 30-NOV-13
The outer query then collapses those adjacent pairs; I've used the easy route of creating duplicates and then eliminating them with distinct
, but you can do it other ways, like putting both values into one of the pairs of rows and leaving both values in the other null, and then eliminating those with another layer of select, but I think distinct is OK here.
If your real-world use case has times, not just dates, then you'll need to adjust the comparison in the inner query; rather than +/- 1, an interval of 1 second perhaps, or 1/86400 if you prefer, but depends on the precision of your values.
Find non consecutive date ranges
You can detect gaps with LAG()
and mark them. Then, it's easy to filter out the rows. For example:
select *
from (
select *,
case when dateadd(day, -1, start_date) >
lag(end_date) over(partition by client_id order by start_date)
then 1 else 0 end as i
from t
) x
where i = 1
Or simpler...
select *
from (
select *,
lag(end_date) over(partition by client_id order by start_date) as prev_end
from t
) x
where dateadd(day, -1, start_date) > prev_end
Related Topics
How to Group and Choose Lowest Value in SQL
How to Drop a Unique Constraint from Table Column
Trying to Sum Distinct Values SQL
Lost the Intellisense in SQL Server Management Studio
Postgres Syntax Error at or Near "If"
Efficient Way to String Split Using Cte
There Is Already an Object Named '#Columntable' in the Database
What Does "&" Means in This SQL Where Clause
How to Confirm a Database Is Oracle & What Version It Is Using SQL
How to Get a Plain Text Postgres Database Dump on Heroku
How to Sort Values in Columns and Update Table
How to Select Top 3 Values from Each Group in a Table with SQL Which Have Duplicates
Sql: Do You Need an Auto-Incremental Primary Key for Many-Many Tables
The Argument 1 of the Xml Data Type Method "Value" Must Be a String Literal
SQL Server 2008:Cannot Insert New Column in the Middle Position and Change Data Type
With Hibernate, How to Query a Table and Return a Hashmap with Key Value Pair Id>Name