Merge Identical Databases into One

Merge identical databases into one

It's never a trivial job to integrate databases when the records don't have unique primary keys in all databases. A few weeks ago I built a similar integration script for which I decided to use Entity Framework.

First the good news. With EF's DbContext API it's ridiculously easy to insert a complete object graph and make EF take care of all newly generated primary keys an foreign keys. The reason why this is so easy is that when an object's state is changed to Added all of its adhering objects become Added as well and EF figures out the right order of inserts. This is truly great! It made me build the core of the copy routine in a few hours, which would have been many days if I should have done it in T-SQL for example. The latter is much much more error prone too.

Of course life isn't that easy. Now the bad news:

  1. This takes tons of machine resources. Of course I used a new context instance for each copy step, but still I had to execute the program on a machine with a decent processor and a fair amount of internal memory. The exact specifications don't matter, the message is: test with the largest databases and see what kind of beast you need. If the memory consumption can't be managed by any machine at your disposal, you have to split up the routine in smaller chunks, but that will take more programming.

  2. The object graph that's changed to Added must be divergent. By this I mean that there should only be 1-n associations starting from the root. The reason is, EF will really mark all objects as Added. So if somewhere in the graph a few branches refer back to the same object (because there is a n-1 association), these "new" objects will be multiplied, because EF doesn't know their identity. An example of this could be Company -< Customer -< Order >- OrderType: when there are only 2 order types, inserting one root company with 10 customers with 10 orders each will create 100 order type records in stead of 2.

    So the hard part is to find paths your class structure that are divergent as much as possible. This won't always be possible. If so, you'll have to add the leaves of the converging paths first. In the example: first insert order types. When a new company is inserted you first load the existing order types into the context and then add the company. Now link the new orders to the existing order types. This can only be done if you can match objects by natural keys (in this example: the order type names), but usually this is possible.

  3. You must take care not to insert multiple copies of master data. Suppose the order types in the previous example are the same in all databases (although their primary keys may differ!). The order types from the source database should not be reinserted in the target database. Moreover, you must fix the references in the source data to the correct records in the target database (again by matching by natural key).

So although it wasn't trivial it was doable and the job was done in a relatively short time. I'm sure that other alternatives (t-SQL, integration services, BIDS, if doable at all) would have taken more time or would have been more buggy. And the problem with bugs in this area is that they may become apparent much later.

I later found out that the issues I describe under 2) are related to fetching the source objects with AsNoTracking. See this interesting post: Entity Framework 6 - use my getHashCode(). I used AsNoTracking because it performs better and it reduces memory consumption.

How to merge two identical database data to one?

You say that both customers are using your application, so I assume that it's some kind of "shrink-wrap" software that is used by more customers than just these two, correct?

If yes, adding special columns to the tables or anything like this probably will cause pain in the future, because you either would have to maintain a special version for these two customers that can deal with the additional columns. Or you would have to introduce these columns to your main codebase, which means that all your other customers would get them as well.

I can think of an easier way to do this without changing any of your tables or adding any columns.

In order for this to work, you need to find out the largest ID that exists in both databases together (no matter in which table or in which database it is).

This may require some copy & paste to get a lot of queries that look like this:

select max(id) as maxlocationid from locations
select max(id) as maxpersonid from persons
-- and so on... (one query for each table)

When you find the largest ID after running the query in both databases, take a number that's larger than that ID, and add it to all IDs in all tables in the second database.

It's very important that the number needs to be larger than the largest ID that already exists in both databases!

It's a bit difficult to explain, so here's an example:

Let's say that the largest ID in any table in both databases is 8000.

Then you run some SQL that adds 10000 to every ID in every table in the second database:

update Locations set Id = Id + 10000
update Persons set Id = Id + 10000, LocationId = LocationId + 10000
-- and so on, for each table

The queries are relatively simple, but this is the most work because you have to build a query like this manually for each table in the database, with the correct names of all the ID columns.

After running the query on the second database, the example data from your question will look like this:

Database 1: (exactly like before)

Locations:

Id    Name         Adress   etc....
1 Location 1
2 Location 2

Persons:

Id    LocationId     Name     etc...
1 1 Alex
2 1 Peter
3 2 Lisa

Database 2:

Locations:

Id    Name         Adress   etc....
10001 Location A
10002 Location B

Persons:

Id    LocationId     Name     etc...
10001 10001 Mark
10002 10002 Ashley
10003 10001 Ben

And that's it! Now you can import the data from one database into the other, without getting any primary key violations at all.

Merging Tables from 2 Different Databases into One with Same Scheme

You can pass :id + null which will evaluate as null for the column id and it will get the appropriate value since it is defined as INTEGER PRIMARY KEY:

sql = "INSERT INTO albums VALUES (:id + null, :nr, :band, :song, :album, :duration)"
main_cursor = db_main.cursor()
for row in output:
main_cursor.execute(sql, row)

Or, with executemany() to avoid the for loop:

sql = "INSERT INTO albums  VALUES (:id + null, :nr, :band, :song, :album, :duration)"
main_cursor = db_main.cursor()
main_cursor.executemany(sql, output)

Merge Multiple Databases into a Single Database

for your first question : You have mentioned identical schema and table structure, in that case, its simply moving of data from one DB table (i.e smaller DB) to another (i.e larger DB). for this have to ensure

1) there was no duplicate of data ( at-least in PK field )

2) move data from one db to another for sql server refer

Transfer data from one database to another database

Merging databases how to handle duplicate PK's

I have no first-hand experience with this, but it seems to me like you ought to be able to uniquely map PK -> New PK for each server. For instance, generate new PKs such that data from LA server has PK % 3 == 2, SF has PK % 3 == 1, and NY has PK % 3 == 0. And since, as I understood your question anyway, each server only stores FK relationships to its own data, you can update the FKs in identical fashion.

NewLA = OldLA*3-1
NewSF = OldLA*3-2
NewNY = OldLA*3

You can then merge those and have no duplicate PKs. This is essentially, as you already said, just generating new PKs, but structuring it this way allows you to trivially update your FKs (assuming, as I did, that the data on each server is isolated). Good luck.

Merge two MySql databases except duplicate PKs

Use the --insert-ignore option to mysqldump. That will cause it to write INSERT IGNORE commands to the dump file, instead of ordinary INSERT statements. This causes duplicate keys to be skipped when inserting, instead of causing an error.

Merge databases with duplicate object id's

(The queries used in this method are for sql server. You need similar queries in mysql to run correctly.)

Database merge requires a thorough understanding of the data and the design of the database.

For example, the following solution can be used to merge these two databases:

1- With the following command, you can remove all the restrictions of the second database in sql server:

use db2;
DECLARE @sql NVARCHAR(MAX);
SET @sql = N'';

SELECT @sql = @sql + N'
ALTER TABLE ' + QUOTENAME(s.name) + N'.'
+ QUOTENAME(t.name) + N' DROP CONSTRAINT '
+ QUOTENAME(c.name) + ';'
FROM sys.objects AS c
INNER JOIN sys.tables AS t
ON c.parent_object_id = t.[object_id]
INNER JOIN sys.schemas AS s
ON t.[schema_id] = s.[schema_id]
WHERE c.[type] IN ('D','C','F','PK','UQ')
ORDER BY c.[type];

--PRINT @sql;
EXEC sys.sp_executesql @sql;

2- In the first step, we start merge the tables that do not have an external key with the first database, and at the same time enter the data into the tables of the first database, with the newly generated values ​​(I assume that your tables have identity), We update the related tables in the second database. For example, look at the following command:

declare @I int = 0
declare @count_t1 int = (select count(*) from table1)

DECLARE @LAST_object_id INT = 0;
DECLARE @OLD_object_id INT = 0;

WHILE @I < @count_t1
BEGIN
SET @LAST_object_id = 0;
SET @OLD_object_id = 0;
INSERT INTO db1.dbo.table1
(
name
)
SELECT
name
FROM db2.dbo.table1
ORDER BY db2.dbo.table1.object_id
OFFSET @I ROWS FETCH FIRST 1 ROWS ONLY
SET @LAST_object_id = (SELECT TOP(1) object_id FROM db1.dbo.table1 ORDER BY db1.dbo.table1.object_id DESC)
SET @OLD_object_id = (SELECT object_id FROM db2.dbo.table1 ORDER BY db2.dbo.table1.object_id OFFSET @I ROWS FETCH FIRST 1 ROWS ONLY)

UPDATE db2.dbo.table2
SET object_id = @LAST_object_id
WHERE object_id = @OLD_object_id

SET @I = @I + 1
END

3- In this step, we will merge the tables with the first database that have foreign keys, but we know that their foreign keys have been updated in step 2.

4- Repeat step 3 for the depth of the database to reach tables whose primary key is not an foreign key for other tables. Then we merge those tables with the first database.

Remember: If the values ​​of the first database tables are less than the values ​​of the second database tables, the probability of error in that method increases. So you need to control how identity grows in sql server with the following command:

DECLARE @MAX_0_object_id INT;
DECLARE @MAX_1_object_id INT;
SET @MAX_0_object_id = IDENT_CURRENT('db1.dbo.table1')
SET @MAX_1_object_id = IDENT_CURRENT('db2.dbo.table1')
IF @MAX_1_object_id > @MAX_0_object_id
BEGIN
DBCC CHECKIDENT ('db1.dbo.table1', RESEED, @MAX_1_object_id)
END


Related Topics



Leave a reply



Submit