Merge Duplicate Temporal Records in Database

Merge duplicate temporal records in database

Try this if you can ensure that all the start date and end date are continuous :

with  t1 as  --tag first row with 1 in a continuous time series
(
select t1.*, case when t1.column1=t2.column1 and t1.column2=t2.column2
then 0 else 1 end as tag
from your_table t1
left join your_table t2
on t1.EmployeeId= t2.EmployeeId and dateadd(day,-1,t1.StartDate)= t2.EndDate
)
select t1.EmployeeId, t1.StartDate,
case when min(T2.StartDate) is null then null
else dateadd(day,-1,min(T2.StartDate)) end as EndDate,
t1.Column1, t1.Column2
from (select t1.* from t1 where tag=1 ) as t1 -- to get StartDate
left join (select t1.* from t1 where tag=1 ) as t2 -- to get a new EndDate
on t1.EmployeeId= t2.EmployeeId and t1.StartDate < t2.StartDate
group by t1.EmployeeId, t1.StartDate, t1.Column1, t1.Column2

Merge consecutive duplicate records including time range

How about this?

create table test_table (EmployeeId int, StartDate  date, EndDate  date,   Column1 char(1),  Column2 char(1))
;
insert into test_table values
(1000 , '2009-05-01','2010-04-30','X','Y')
,(1000 , '2010-05-01','2011-04-30','X','Y')
,(1000 , '2011-05-01','2012-04-30','X','X')
,(1000 , '2012-05-01','2013-04-30','X','Y')
,(1000 , '2013-05-01','2014-04-30','X','X')
,(1000 , '2014-05-01','2014-06-01','X','X')
;
SELECT EmployeeId, StartDate, EndDate, Column1, Column2 FROM
(
SELECT EmployeeId, StartDate
, MAX(EndDate) OVER(PARTITION BY EmployeeId, RN) AS EndDate
, Column1
, Column2
, DIFF
FROM
(
SELECT t.*
, SUM(DIFF) OVER(PARTITION BY EmployeeId ORDER BY StartDate ) AS RN
FROM
(
SELECT t.*
, CASE WHEN
Column1 = LAG(Column1,1) OVER(PARTITION BY EmployeeId ORDER BY StartDate)
AND Column2 = LAG(Column2,1) OVER(PARTITION BY EmployeeId ORDER BY StartDate)
THEN 0 ELSE 1 END AS DIFF
FROM
test_table t
) t
)
)
WHERE DIFF = 1
;

Why am I getting duplicate rows with these postgresql temporal db schema & queries?

You insert data into both tables temporal.countries and history.countries with the latter being inherited from the former. That is the wrong approach. You should only insert into history.countries, with the additional attributes. When you then query temporal.countries you see a single record, but without the valid from/to information.

Once you update a record you will get duplicates. There is no way around that with your current approach. But you don't really need inheritance to begin with. You can have two separate tables and then create a view public.countries that returns currently valid rows from temporal.countries:

create table temporal.countries (
id serial primary key,
name varchar UNIQUE
);

create table history.countries (
hid serial primary key,
country integer not null references temporal.countries,
name varchar,

valid_from timestamp not null,
valid_to timestamp not null default '9999-12-31',
recorded_at timestamp not null default now(),

constraint from_before_to check (valid_from < valid_to),

constraint overlapping_times exclude using gist (
box(
point( extract( epoch from valid_from), id ),
point( extract( epoch from valid_to - interval '1 millisecond'), id )
) with &&
)
) inherits ( temporal.countries );

Now create the view to return only currently valid countries:

create view public.countries as
select c.*
from temporal.countries c
join history.countries h on h.country = c.id
where localtimestamp between h.valid_from and h.valid_to;

And your three rules:

-- INSERT - insert data in temporal.countries and metadata in history.countries
create rule countries_ins as on insert to public.countries do instead (
insert into temporal.countries ( name )
values ( new.name )
returning temporal.countries.*;

insert into history.countries ( country, name, valid_from )
values ( currval('temporal.countries_id_seq'), new.name, now() )
);

-- UPDATE - set the last history entry validity to now, save the current data in
-- a new history entry and update the current table with the new data.
create rule countries_upd as on update to countries do instead (
update history.countries
set valid_to = now()
where id = old.id and valid_to = '9999-12-31'; -- view shows only valid data

insert into history.countries ( country, name, valid_from )
values ( old.id, new.name, now() );

update only temporal.countries
set name = new.name
where id = old.id
);

-- DELETE - save the current date in the history and eventually delete the data
-- from the current table.

create rule countries_del as on delete to countries do instead (
update history.countries
set valid_to = LOCALTIMESTAMP
where id = old.id and valid_to = '9999-12-31';

-- don't delete country data, view won't show it anyway
delete from only temporal.countries
where temporal.countries.id = old.id

);

Why MERGE doesn't insert more than 277 records into a table which is configured with temporal table and a non-clustered index on history table

Azure SQL Database sometimes builds an invalid execution plan for your merge-insert.

When it decides to maintain the column store history table using a single operator (a narrow plan) everything is fine. This trivially includes the case where the history table has no secondary indexes.

When it decides to maintain the history table using separate operators for the base table and secondary indexes (a wide plan) things go wrong when using the OUTPUT INTO option. The choice of plan is sensitive to cardinality estimates.

For example, the plan for OUTPUT only (not writing to a table variable) includes a table spool. The spool saves the rows before the Filter that removes any rows from the stream where ValidTo is null. The spool then replays the (unfiltered) rows for return to the client:

Plan with spool

When OUTPUT INTO is used, the same stream is used for both maintaining the secondary index on the history table and providing rows for the output table. This creates a problem because a pure insert results in no rows added to the history, so all rows are filtered out.

No spool plan

A column store index is not required on the history table for this problem to manifest.

This is a product defect you should report directly to Microsoft Support by creating a support request in the Azure Portal.

Side note: rows arriving at the history table Index Insert are not actually inserted because the action column tells it not to. These details aren't exposed in showplan unfortunately. A possible fix would be to test the action as well as ValidTo in the filter.


The extra Filter does not appear on SQL Server 2019 CU16-GDR:

2019 plan

This feels like a bug fix for implied nullability problems that has been applied to Azure SQL Database before the box product. If so, it is a little surprising it does not react to QO compatibility level hints.


Daniel C.:

Microsoft confirmed this as a bug and also rolled-out a quick fix for one of our affected databases. I can confirm that this fix is resolving my issue.

Is there a design pattern for merging duplicate database records?

As soon as you as you say "reversibility" I think Command Pattern.

The typical example is to support Undo style behaviour but I think this would be a good fit for auditability as well - especially as the individual "steps" (for want of a better word) are so small and easily represented (e.g. {Merged "Rocky I" -> "Rocky" }).

How would I get the Command pattern to actually work for your scenario?

Well, keeping this very much in the RDBMS arena rather than OO modelling, assuming you've already got tables USER_FAVORITE and MOVIE, I'd add a new table USER_FAVORITE_MOVIE_MERGE_COMMAND with columns:

  • id
  • date
  • user_id
  • old_favorite_movie_title
  • new_favorite_movie_title

So your nightly cleanup script (or whatever) runs over the USER_FAVORITE table looking for non-standard movie titles. Each time it finds one, it corrects it and records the pertinent facts in the USER_FAVORITE_MOVIE_MERGE_COMMAND table.

Your audit trail is right there, and if you ever need to reverse the cleanup job, "play back" the rows in reverse chronological order, replacing new with old.

Notice how you've got both reversibility and auditability both in the temporal sense (e.g. last night's batch run went weird at 2.12am, let's roll back all the work done after that) and in the per-user sense.

Is this the sort of thing you're after?

Merge two records date if dates are continuous and key values are same

For scenario1 you could see this answer.

For scenario2 there also have a similar answer you could reference.

But your question can be simplified like this:

with dates as
(
select number,cast(ltrim(number*10000+1231) as date) as dt
from master..spt_values
inner join
(select min(year(startdate)) as s_year
,max(year(enddate)) as e_year
from Scenario_2) as y
on number between y.s_year and y.e_year AND TYPE='P'
)
select
s.item
,case when year(dt) = year(startdate)
then startdate
else dateadd(year,-1,dateadd(day,1,dt)) end --or cast(ltrim(year(dt)*10000+101) as date)
,case when year(dt) = year(enddate)
then enddate
else dt end
,s.value
from
Scenario_2 s
inner join
dates d
on
d.number between year(s.startdate) and year(s.enddate)

SQL FIDDLE DEMO

Merging contacts in SQL table without creating duplicate entries

--out with the bad
DELETE
FROM MailingListSubscription
WHERE PersonId = @SourcePerson
and ListID in (SELECT ListID FROM MailingListSubscription WHERE PersonID = @DestPerson)

--update the rest (good)
UPDATE MailingListSubscription
SET PersonId = @DestPerson
WHERE PersonId = @SourcePerson


Related Topics



Leave a reply



Submit