SSIS with millions of data to compare from source and target
If you do this with a Lookup, neither row set will be stored completely in memory, unless you use full cache. If you use cache, then the Target data will be stored in memory, and sure, you could get memory exceptions if you don't have enough available memory.
A lookup is a terrible idea because for every row in the source data you are going to query the target data. So you are going to issue 12 million individual queries against the target before you are done. This is the worst performing option.
A Merge Join is faster because your data is pre-sorted on the matching key so the matching is much faster. Also neither dataset needs to be held in memory. The rows flow freely without waiting for the entire dataset to be loaded.
Here is a comparison between Lookup and Merge Join.
The fastest option would be to load your target data directly to a staging table on the same server as your source data, and index that table on the joining key. Then you can do the comparison in SQL, joining on indexed columns, which would give you the fastest performance.
How to run SSIS packages dynamically?
You can look into BiMLScript, which lets you create packages dynamically based on metadata.
Incremental load across two different servers with different source DB’s
The general pattern I would take would be something like this.
My control flow will identify the databases on the server associated with our project (Connection Manager = Source)
Here I show a query against sys.databases
because maybe you can apply a criteria like AND D.Name IN ('DB1', 'DB2', 'DB3');
On S1, that query would return 2 values, on S2, only 1.
We'd use that list of databases as the source of a ForEach Loop Enumerator to "shred" the results. For each value we identified in the original query (DB1, DB2) we're going to update the InitialCatalog
property of our Source ConnectionManager. In the reference answers below, I set the ConnectionString
property but you will only want to modify the InitialCatalog. So each loop around, the database pointed to will change.
The Data Flow inside the ForEach enumerator is then simplified to only deal with the current database and not have to worry about whether this server has 3 source databases or 1.
Caveats
The source query and data types must be compatible across all the associated databases. The structure of a Data Flow is set at design-time and cannot be changed during run-time.
If the entities are consistent across the databases and it's just columns being called something different, create a view across each database to ensure the entity name is consistent and then you can avoid dynamic SQL.
You will need to provide the initial value of the Source connection string when the package begins. This can be accomplished via the SET attribute on invocation.
Reference answers
Some relevant SSIS answers that explore these concepts
- Expression on Connection Manager https://stackoverflow.com/a/15556895/181965
- Shred recordset https://stackoverflow.com/a/13976990/181965
- DTEXEC and SET https://stackoverflow.com/a/8386338/181965
- SET still relevant https://stackoverflow.com/a/50535776/181965
Related Topics
How to Copy a Record in a SQL Table But Swap Out the Unique Id of the New Row
Why Can't I Use an Alias for an Aggregate in a Having Clause
Tsql - How to Use Go Inside of a Begin .. End Block
How to Select the First N Rows of Each Group
How Universal Is the Limit Statement in SQL
Generate Default Values in a Cte Upsert Using Postgresql 9.3
SQL . the Sp or Function Should Calculate the Next Date for Friday
How to Select a Record and Update It, with a Single Queryset in Django
SQL Join on Multiple Columns in Same Tables
How to Select Only the First Rows for Each Unique Value of a Column
Ora-01652: Unable to Extend Temp Segment by 128 in Tablespace System: How to Extend
How to Execute a Native SQL Script in JPA/Hibernate
Get Values from First and Last Row Per Group
Determine What User Created Objects in SQL Server
Difference Between === Null and Isnull in Spark Datadrame