Ssis - Performing a Lookup on Another Table to Get Related Column

SSIS - Performing a Lookup on another Table to get Related Column

Actually, this is a case for Lookup. It seems you want to do a lookup by name and return id. Pretty simple. Here's how I created an example of this:

  1. Drag a Data Flow Task onto the design surface. Double-click it to switch to it.
  2. Create a Connection Manager for my database
  3. Drag onto the design surface:

    • an OLE DB Source
    • A Lookup Transform
    • An OLE DB Destination
  4. Connect the Source to the Lookup to the Destination. It's the "Lookup Match Output" we want going to the destination. See figure 1.
  5. Configure the source. My source table just had id and name columns.
  6. Configure the lookup

    • General tab: Use an OLE DB Connection
    • Connection tab: specify the same connection, but use the Lookup table. My lookup table was just id and name, but name was made unique, so it makes better sense as a lookup column.
    • On the columns tab, configure name to map to name, with "id" as an output. Configure the lookup operation to be "add new column", and name that column "lookupId". See figure 2.
    • Ignore the other two tabs
  7. Configure the output to take all three columns. See figure 3.

That's all. For each row from the source, the name column will be used to match the name column of the lookup table. Each match will contribute its id column as the new lookupId column. All three columns will proceed to the destination.

Figure 1:

alt text

Figure 2:

alt text

Figure 3:

alt text

SSIS Lookup of another table

Start with both tables as source components and do a Merge Join on employee number to match the rows from the two tables.

Same thing to match these records with a third source component to get the supervisor emails.

How to lookup data coming from two different datasets with SSIS

To answer your question you could use Merge Join Transformation in SSIS to do that. Here is a blog post with a walk through on how to do that.

I would avoid doing that though. Instead, land both files into two separate tables and then join them with a query. This will be much simpler to do and it will run faster,

SSIS Lookup multi-columns to one table to retrieve surrogate key

Without a design change there's no better way. In T-SQL it would just be 21 joins, so not a great deal better.

</answer>

<commentary>

But do you actually need a surrogate key for your calendar dimension? If the datatype of full_date is date, then that's only 3 bytes (vs 4 for a typical int surrogate), and it's already naturally sorted, stable, necessarily conformed across all possible sources (it will never be the case that '2022-07-30' means two different dates in two different applications*), and it is never implemented as a slowly changing dimension. Unknown and missing values can still be represented by picking date values that you know will never be in the domain (9999-12-31 = unknown, 9999-12-30 = missing, etc). Even Kimball exempts date dimensions from the surrogate key rule.

If you let the date value itself be the key, you don't need to do the lookup during loading at all.

The lookup is needed at the moment because fact data comes in from some source application with actual dates, but to populate the star schema you need to populate the fact table with the value of the surrogate. For example, suppose a source application provides fact data with a date value of 2022-07-30. In order to populate your data warehouse fact table you need to know the value of the date key in the calendar dimension associated with this date. So you have to go look it up in order to populate the fact table. Oh, it's the integer value 4213, cool, write that into the date_key column on the fact table.

But if the calendar dimension key is just the date, then you don't need to look up the key values when populating the fact table. The application is already sending a date, and the calendar dimension key is also a date. There's nothing to look up.

* unless you are working across time zones, but then that complication applies to both implementations anyway, and you would be using datetimeoffset somewhere

SSIS Lookup Transform use Table or Query

On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?

It is best to specify which columns you want.

Using table in Drop down, would this be like doing Select * From T1

Yes, it is a SELECT *.

or is SSIS clever enough to know I only need 2 columns?

Nope.

Keep in mind that Lookups are good for pulling data from Dimension Tables where the row count and record set is small. If you are dealing with large amounts of unique data, then it will be better to perform a MERGE JOIN, instead. The performance difference can be substantial. For example, when using a Lookup on 20K rows of data, you could experience run times in the tens of minutes. A MERGE JOIN, however, would run within seconds.

Lookups have the drawback of behaving like correlated sub-queries in that they fire off a query to the server for every row passing through it. You can have the Lookup cache the data, which means SSIS will store the results in memory and then check the memory before going to the server for all subsequent rows passing through the Lookup. As a result, this is only effective if there are a large number of matching records for a small cache set. In other words, Lookups are not optimal when there is large amount of Distinct ID's to lookup. To that point, caching data is almost pointless.

This is where you would switch over to using a MERGE JOIN. Note: you will need to perform a SORT on both of the data flows before the MERGE JOIN because the MERGE JOIN component requires the incoming rows to be sorted.

When handled incorrectly, a single poorly placed Lookup can bring an entire package to its knees - lookups can be huge performance bottlenecks. Though, handled correctly, a Lookup can simplify the design of the dataflow and speed development by removing the extra development required to MERGE JOIN data flows.

The bottom line to all of this is that you want the Lookup performing the fewest number of queries against the server.

SSIS Lookup with filtered reference table

One possibility is to first stream the distinct IDs to a permanent/temporary table in one data flow and then use it in your lookup (with a join) in a later data flow (you probably have to defer validation).

In many of our ETL packages, we first stream the data into a Raw file, handling all the type conversions and everything on the way there. Then, when all these conversions were successful, then we handle creating new dimensions and then the facts linking to the dimensions.



Related Topics



Leave a reply



Submit