In the previous part of this article, the concept of slowly changing dimensions was introduced and the built-in slowly changing dimension wizard was explained in detail. This part will show you three different alternatives to the wizard and how they improve performance for your packages.

Using the Dimension Merge Component

If your company’s policy allows third-party tools in your development environment, you might want to try out the dimension merge component. This component is a free open-source SSIS transformation that can be downloaded from Codeplex. It is developed by Todd McDermid, a SQL Server MVP. However, it is only available for SSIS 2005 and 2008, so it cannot be used in more recent development environments, such as SQL Server Data Tools 2010 and 2012.

MCSE Training – Resources (Intense)

Installation

The installers are available in 32- and 64-bit versions. If you want to use the component in BIDS, you need to install the 32-bit version on your development machine. If the server that will eventually run the SSIS packages is a 64-bit server, you’ll need to install the 64-bit component.

After the set-up is finished, the component needs to be added to the toolbox in BIDS. This can be done by following these easy steps:

  1. Add a data flow to the package.
  2. Open the data flow, right-click in the toolbox and select Choose items…

  3. In the Choose Items… dialog, go to the SSIS Data Flow Items tab and select Dimension Merge Slowly Changing Dimension from the list and click OK.

Implementing the slowly changing dimension

The dimension merge component is considerably more complex to implement than the slowly changing dimension wizard. There’s no easy wizard guiding you through the process, but rather a complex editor with multiple tabs.

In contrast with the wizard, the connection to the dimension must be configured as an input in the data flow. This gives us fine grained control on how the dimension is accessed. There are three possible inputs for the transformation:

  • Source System: this input provides the new and/or updated rows from the source system. This is the same input as used with the wizard.
  • Existing Dimension: this is the connection with the slowly changing dimension. When the package is executed, the dimension will be loaded into memory for maximizing performance.
  • Special Members: this is an optional input for specifying “special” members for the dimension, such as an Unknown member.

The component has a whole list of different outputs, with one more important than the other.

  • Unchanged: This output contains all rows that haven’t been changed. Typically, these rows are discarded.
  • New: All rows that did not find a match for the business key in the dimension go to this output.
  • Deleted: Rows that no longer exist in the source system are sent to this output. This is only possible if type 2 changes are not tracked. Since one of the objectives of a data warehouse is to retain history, this output is rarely used.
  • Expired SCD2 with SCD1 changes: These are the history rows for a certain business key. The IsCurrent column and the EndDate column are updated in this output.
  • New SCD2: When a type 2 change occurs, two rows are sent to the outputs. The old row is sent to the previous output, and the new current row is sent to this output.
  • SCD1 Updated: This output contains all rows with a type 1 change.
  • Invalid Input: Any row where problems are detected are sent to this output. This can be a NULL business key but also, for example, rows from the existing dimension with incorrect EndDates, causing an overlap between history rows. This output is important for troubleshooting the dimension merge component.
  • Auditing: one row with various auditing information is sent to this output.
  • Statistics: This output has rows with performance information directed to.

A detailed description on how to configure the component is beyond the scope of this article. Please refer to the product documentation for more details. It suffices to say that this component offers very detailed control over the handling of a slowly changing dimension and its type 2 changes. The component offers also suggestions on how to improve performance, such as sorting the inputs according to the business key.

The final data flow might look like this:

Although more complex in its use, the dimension merge component offers several advantages over the slowly changing dimension wizard:

  • Performance improvements—The documentation claims 100 times faster execution times.
  • Maintainability—You never lose changes by editing the transformation.
  • You can use any data connection you like for retrieving the dimension members.
  • It has fine-grained auditing, troubleshooting, and diagnostic options available.

Using the T-SQL MERGE Statement

SQL Server 2008 introduced the MERGE statement, which can update, delete, or insert a target based on the join with a source table. This statement can be used to implement a slowly changing dimension and is a pure T-SQL solution. In order to execute the MERGE statement, both source and target are preferably on the same SQL Server instance. For more information about MERGE, please refer to the product documentation or to this introductory article: The MERGE Statement in SQL Server 2008.

With the clever use of the output of the MERGE statement as a derived table in an INSERT statement, the implementation of a slowly changing dimension type 2 can be done in one long T-SQL statement.

INSERT INTO [dbo].[Customer]
	([ClientID]
	,[Name]
	,[E-mail]
	,[Phone]
	,[AddressLine]
	,[State]
	,[Country]
	,[BirthDate]
	,[Gender]
	,[IsCurrent]
	,[StartDate]
	,[EndDate])
SELECT
	 [ClientID]
	,[Name]
	,[E-mail]
	,[Phone]
	,[AddressLine]
	,[State]
	,[Country]
	,[BirthDate]
	,[Gender]
	,[IsCurrent]
	,[StartDate]
	,[EndDate]
FROM
	(
	MERGE INTO [dbo].[Customer] AS Target -- the target is the slowly changing dimension itself
	USING
	(
		SELECT
			 [ClientID]
			,[Name]
			,[E-mail]
			,[Phone]
			,[AddressLine]
			,[State]
			,[Country]
			,[BirthDate]
			,[Gender]
		FROM [dbo].[Customer_Source]
	) AS Source -- the source contains the new and changed records
		ON (Target.[ClientID] = Source.[ClientID]) -- Join on the business key of the dimension
	WHEN MATCHED AND (Target.[AddressLine] <> Source.[AddressLine] OR Target.[State] <> Source.[State] OR Target.[Country] <> Source.[Country])
	-- if a match is found between the target and source and one of the SCD type 2 columns has changed, the record must be closed off and become a historical record
	THEN UPDATE SET
		 EndDate = GETDATE() - 1 -- take the date of yesterday to retire the historical record
		,IsCurrent = 0
	WHEN NOT MATCHED -- no match is found, so the source record is a new record which must be inserted into the dimension
	THEN INSERT
		([ClientID]
		,[Name]
		,[E-mail]
		,[Phone]
		,[AddressLine]
		,[State]
		,[Country]
		,[BirthDate]
		,[Gender]
		,[IsCurrent]
		,[StartDate]
		,[EndDate])
	VALUES
		(Source.[ClientID]
		,Source.[Name]
		,Source.[E-mail]
		,Source.[Phone]
		,Source.[AddressLine]
		,Source.[State]
		,Source.[Country]
		,Source.[BirthDate]
		,Source.[Gender]
		,1 -- set the IsCurrent column to True
		,GETDATE() -- use the current datetime for the start time of the new record
		,NULL) -- use NULL to indicate that the record has not expired yet
		-- every source row affected by the MERGE statement is sent to the output. This will be used in the INSERT statement
		-- to insert the new current SCD type 2 records
	OUTPUT	 $action AS [Action] -- possible values are INSERT/UPDATE/DELETE
			,Source.[ClientID]
			,Source.[Name]
			,Source.[E-mail]
			,Source.[Phone]
			,Source.[AddressLine]
			,Source.[State]
			,Source.[Country]
			,Source.[BirthDate]
			,Source.[Gender]
			,1 AS IsCurrent -- set the IsCurrent column to True
			,GETDATE() AS StartDate -- use the current datetime for the start time of the new current record
			,NULL AS EndDate -- use NULL to indicate that the record has not expired yet
	) tmp
WHERE [Action] = 'UPDATE';

Although it looks very complex and convoluted, the concept is actually quite simple:

  1. Take the source rows and match them against the target dimension using the business key.
  2. When the row doesn’t find a match, it is a new row and is thus inserted into the dimension.
  3. When there is a match and one or more of the type 2 attributes have changed, update the current record to an historic record by setting the IsCurrent column and the EndDate column.
  4. Send all the source rows to the output.
  5. Insert the rows from the output with action equals to UPDATE into the dimension using a separate INSERT statement. These rows are the new current rows for the historical members.

The only downside of this method is that it doesn’t handle type 1 updates. There’s no way to distinguish type 1 updates from type 2 updates with the $action attribute, so these need to be handled with a separate MERGE statement:

MERGE INTO [dbo].[Customer] AS Target -- the target is the slowly changing dimension itself
USING
(
	SELECT
		 [ClientID]
		,[Name]
		,[E-mail]
		,[Phone]
		,[AddressLine]
		,[State]
		,[Country]
		,[BirthDate]
		,[Gender]
	FROM [dbo].[Customer_Source]
) AS Source -- the source contains the new and changed records
	ON (Target.[ClientID] = Source.[ClientID]) -- Join on the business key of the dimension
WHEN MATCHED AND (Target.[AddressLine] <> Source.[E-mail] OR Target.[Phone] <> Source.[Phone])
-- if a match is found between the target and source and one of the SCD type 2 columns has changed, the record must be closed off and become a historical record
THEN UPDATE SET
	 [Phone]	= Source.[Phone]
	,[E-mail]	= Source.[E-mail];

The biggest advantage of the MERGE statement is its performance. It is one of the fastest solutions available. If needed, checksums of the columns can be used to detect changes even more efficiently. However, because the statement runs inside a transaction, it is an all-or-nothing approach. If one row fails—for example, by trying to set a non-nullable column to NULL—the entire batch will fail. With SSIS you have more detailed error handling and logging on the row level, which is very hard to incorporate in T-SQL.

Using a Custom SSIS Package

If the previous options weren’t satisfying, we can always try it ourselves. With a custom developed SSIS package, we can have the best of all worlds: great performance, good maintainability, and decent error handling and logging. A custom SSIS package has full flexibility and it can be adapted to any change you want. It takes some time, though, to develop it the first time, but afterwards it can be used for other dimensions as well by using simple copy-paste.

Let’s walk through a possible package design. The control flow is pretty simple: a data flow followed by an Execute SQL Task.

The data flow does the heavy lifting: it reads the source, decides which rows are inserts and updates and writes the rows to the corresponding destinations. The Execute SQL Task will do the actual type 1 and type 2 updates in the dimension.

Let’s take a closer look at the data flow.

  1. The incoming rows are read from the source and sent to a lookup component. This lookup component has cached the current rows of the dimension and compares these with the incoming rows using the business key. When no match is found, the rows are inserted (step 4). If a match is found, the rows are possible updates (step 2).

    The following screenshot is an example of a SQL query to retrieve the current dimension rows:

    The source rows are matched against the dimension rows with the business key. Every column subject to change is retrieved from the dimension. They are suffixed with “_target” to distinguish them from the source columns. The surrogate key is also retrieved in order to facilitate easier updates in a later stage.

  2. The possible updates from the lookup match output are sent to a condition split. This transformation checks if columns have changed. If a type 2 column has changed, the row is sent to the “type 2” output. If a type 1 column has changed, the row is redirected to the “type 1” output. If none of the columns have changed, the row is sent to the default output—renamed to “unchanged rows”—which is discarded.
  3. The type 2 rows are duplicated with a multicast transformation. One output will be used to insert the new current rows, the other output will be used to update the formerly current row to an historical record.
  4. One output of step 3 is merged with the match output from the lookup component (the new dimension rows).
  5. The SCD attributes are set for the new dimension rows using a derived column. The start time of the package is used to set the StartDate.

  6. Finally the new rows are sent to the dimension table. In contrast with the slowly changing dimension wizard, this OLE DB destination uses the fast load option, which results in optimal performance. The generation of the surrogate keys for the new dimension members is handled by an IDENTITY constraint on the surrogate key column of the dimension.

  7. The other output of the multicast of step 3 is sent to another derived column, which sets the SCD attributes for the type 2 updates. As in step 5, the start date of the package is used to set the EndDate. A new column, Action, is added to the flow in order to distinguish the two different types of updates from each other.

  8. The type 1 output of the conditional split is sent to yet another derived column, which only adds the Action column.

  9. We’re almost there. The two different types of updates are merged together using a union all.
  10. In the final step, the updates are written to a staging table. This allows us to do a set-based update in the Execute SQL Task later on instead of doing row by row updates with an OLE DB command. As with the inserts, the fast load option is used.

When the data flow is finished, the Execute SQL Task is executed. This will issue two update statements against the dimension table: each one for the two different types of updates. The different types of updates are easily distinguished from each other by using the Action column. Because the surrogate key is stored in the staging table, a very efficient inner join is possible between the staging table and the dimension table.

-- close off historical records
UPDATE dim
SET	 EndDate	= stg.EndDate
	,IsCurrent	= stg.IsCurrent
FROM		[dbo].[Customer]			dim
INNER JOIN	[dbo].[Customer_Staging]	stg ON dim.Customer_Key = stg.Customer_Key -- join on surrogate key for performance
WHERE [Action] = 'SCD2';

-- update rows with SCD type 1 changes
UPDATE dim
SET	 [E-mail]	= stg.[E-mail]
	,[Phone]	= stg.[Phone]
FROM		[dbo].[Customer]			dim
INNER JOIN	[dbo].[Customer_Staging]	stg ON dim.Customer_Key = stg.Customer_Key -- join on surrogate key for performance
WHERE [Action] = 'SCD1';

This package closely resembles the logic behind the MERGE statement and it also performs very well. The data flow here is still simplified, as there is no error handling present. The package can, however, be expanded to suit every need.

Conclusion

This two-part article introduced the concepts of a slowly changing dimension and four different methods of implementing them, using integration services. Each method has its own strengths and weaknesses. The SCD wizard is easy to use, but can have potential poor performance and maintainability. The third-party Dimension Merge component is more robust and efficient than the wizard, but comes with a learning curve. The T-SQL MERGE statement is very powerful, but it’s quite a complicated statement with no additional error or event handling. A customized SSIS package can addresses all of those weaknesses, but it can take some development effort to get it right.

The following table gives an overview of the advantages and disadvantages of each method.

Subject

Wizard

Dimension Merge

T-SQL MERGE

Custom SSIS

Ease of implementation

+++

Performance / scalability

++

+++

++

Maintainability

+

+

+

Error handling on row level ++

+++

++

Auditing

+

+++

+

References