Ssis Best Practice to Load N Tables from Source to Target Server

SSIS with millions of data to compare from source and target

If you do this with a Lookup, neither row set will be stored completely in memory, unless you use full cache. If you use cache, then the Target data will be stored in memory, and sure, you could get memory exceptions if you don't have enough available memory.

A lookup is a terrible idea because for every row in the source data you are going to query the target data. So you are going to issue 12 million individual queries against the target before you are done. This is the worst performing option.

A Merge Join is faster because your data is pre-sorted on the matching key so the matching is much faster. Also neither dataset needs to be held in memory. The rows flow freely without waiting for the entire dataset to be loaded.

Here is a comparison between Lookup and Merge Join.

The fastest option would be to load your target data directly to a staging table on the same server as your source data, and index that table on the joining key. Then you can do the comparison in SQL, joining on indexed columns, which would give you the fastest performance.

How to run SSIS packages dynamically?

You can look into BiMLScript, which lets you create packages dynamically based on metadata.

Incremental load across two different servers with different source DB’s

The general pattern I would take would be something like this.

My control flow will identify the databases on the server associated with our project (Connection Manager = Source)

Sample Image

Here I show a query against sys.databases because maybe you can apply a criteria like AND D.Name IN ('DB1', 'DB2', 'DB3');

On S1, that query would return 2 values, on S2, only 1.

We'd use that list of databases as the source of a ForEach Loop Enumerator to "shred" the results. For each value we identified in the original query (DB1, DB2) we're going to update the InitialCatalog property of our Source ConnectionManager. In the reference answers below, I set the ConnectionString property but you will only want to modify the InitialCatalog. So each loop around, the database pointed to will change.

The Data Flow inside the ForEach enumerator is then simplified to only deal with the current database and not have to worry about whether this server has 3 source databases or 1.

Sample Image

Caveats

The source query and data types must be compatible across all the associated databases. The structure of a Data Flow is set at design-time and cannot be changed during run-time.

If the entities are consistent across the databases and it's just columns being called something different, create a view across each database to ensure the entity name is consistent and then you can avoid dynamic SQL.

You will need to provide the initial value of the Source connection string when the package begins. This can be accomplished via the SET attribute on invocation.

Reference answers

Some relevant SSIS answers that explore these concepts

  • Expression on Connection Manager https://stackoverflow.com/a/15556895/181965
  • Shred recordset https://stackoverflow.com/a/13976990/181965
  • DTEXEC and SET https://stackoverflow.com/a/8386338/181965
  • SET still relevant https://stackoverflow.com/a/50535776/181965


Related Topics



Leave a reply



Submit