One of the more common requirements when designing a data warehouse is the implementation of slowly changing dimensions (SCD). This article will provide you with a short introduction to the concepts of slowly changing dimensions, as well as four different methods on how to implement these using integration services (SSIS). You will learn about the advantages and disadvantages of each method and which pitfalls to avoid. Integrations services is the data manipulation and data integration tool of the Microsoft BI stack and is quite commonly used to implement extract-transform-load processes, also known as ETL.
Introduction to Slowly Changing Dimensions
When using the dimensional modelling design technique for data warehouses, introduced by Ralph Kimball in the Data Warehouse Toolkit, you basically have two types of tables: fact tables and dimension tables. A fact table has a multi-part key and one or more numerical “measures” or “facts” for each combination of this key. Each part of the key references a dimension table. A dimension table is a table with typical textual attributes and it gives context to the facts. Let’s make this clearer with a simple example.
Suppose we are building a data warehouse for an online retailer and we want to analyze the items sold. A typical purchase goes as follows: A client logs into the website, selects one or more products, goes to the checkout, and pays with a credit card. This gives us the following information about the transaction:
The date on which the transaction occurred.
The customer who bought the items. Since the customer needs an account to purchase items, we have personal information about the customer which we can use in the analysis.
The products the customer bought. We know how many he/she bought and for which price.
The words in bold will become our dimensions, the words in italic will be our facts. This gives us the following model, which is called a star-schema because of its shape:
With this model, you can now easily answer questions like
How many items did we sell in the last quarter?
Which product is the most popular for the age group 25-30?
Which country buys the most items?
On which day sells product A the best?
Do the seasons have an influence on our sales figures?
The keys in the fact table who reference the primary keys of the dimensions, are called surrogate keys. They do not have any “business value,” they are just integer keys pointing to a certain dimension record. Using surrogate keys is a best practice in dimensional modelling, but a detailed treatment is beyond the scope of this article.
This is all very nice, but what are slowly changing dimensions? Well, typically we want to record some detail of history into our data warehouse. For example, let’s suppose we have a client, John Doe, who lives in New York. This would be the dimensional record with his information:
What if John moves to another state? If we just update his information in the dimension table, it would seem like John has always lived in Miami!
Since the facts from the fact table are connected with surrogate key #25, every purchase John ever did will be linked to his Miami address, while actually some of his purchases might be done from New York. This can lead to false assumptions in our analysis.
To avoid these types of issues and to have a correct historical analysis, the concept of slowly changing dimensions is introduced. Why slowly? Normally, a dimension record doesn’t change every day. A customer doesn’t relocate every week, but he/she might move a few times in a lifetime. There are a few types of slowly changing dimensions, but in order to keep this introduction short, only the two most popular are described.
SCD Type I
This is the easiest type of slowly changing dimension: it doesn’t keep a history at all! You just update the field with its new value and you’re done. This is useful, for example, when a customer changes his e-mail address. There’s no value in keeping the old one around and analyzing data over e-mail addresses probably won’t give any interesting insights.
Every dimension has some fields where it is not necessary to keep history. These are the SCD Type 1 attributes of the dimension.
SCD Type II
For other attributes, such as addresses, it is useful to keep track of history. In order to facilitate such a setup, a few extra columns are added to the dimension:
IsCurrent: this Boolean field keeps track of the current records. Historical records have the value false for this column.
StartDate: a date value indicating when the record became active. For the first record of a customer, this is when the customer profile was created on the site. For any subsequent record, this is the date following the date of the move.
EndDate: a date value indicating when the record was retired. This column is empty for current records.
Let’s walk over the example of John moving to Florida to see which changes are made to the dimension table:
1. The old record that has John living in New York is “retired.” It is marked as a historical record and an EndDate is assigned. This EndDate is typically the date on which the change was entered into the system.
2. A new record is inserted for John, with the same ClientID—called the business key—but with a different surrogate key. The StartDate is one day later than the EndDate of the previous record of John and the EndDate is empty.
When we analyze sales data with the surrogate keys on the rows, we get a correct breakdown of the quantities sold. The data behaves if there are two different customers: number 25 and 26, but in reality these are, of course, the same person.
Looking at surrogate keys doesn’t give us any useful insights. In fact, in well-designed models, the surrogate key shouldn’t be visible to the end user. Let’s do a historical analysis based upon location:
Because of the Type 2 slowly changing dimension, we get two different rows for John Doe: one for when he lived in New York, and one for his current location in Miami. This allows us, for example, to see if John’s relocation had any influence on his shopping behavior.
When we perform an analysis on John himself, we only get one row with the cumulative totals, because John is a single person with one client ID.
This concludes the introduction to slowly changing dimensions. The following sections will guide you through the implementation process in integration services.
Using the Slowly Changing Dimension Wizard
The slowly changing dimension wizard is a built-in data flow component of SSIS. In Microsoft’s official documentation, this wizard is proposed as the method for implementing slowly changing dimensions. The wizard is quite powerful, but the result can sometimes be inefficient for a large number of updates.
Let’s take a look at this component. You can find it in the SSIS toolbox when you are designing a data flow.
Note: The screenshots for this article are taken using Integration Services 2012 using Visual Studio 2012. Depending on your version, the layout can be different.
When you drag it onto the canvas of the data flow, you need to connect the source of your data—the rows with changed customer information in our example—to the wizard component.
When you double-click on the component, the wizard will automatically start.
You can skip the welcome page and go straight to the next step. There you’ll need to specify the connection manager to the database that stores the dimension table and the dimension itself. Furthermore, you also need to indicate which columns constitute the business key of your dimension. To be clear: This is not the surrogate key, but the columns with actual data values that can uniquely identify a dimension member. In our example, this is the ClientID.
In the next step you need to specify which columns are Type 1 SCD (changing), which are Type 2 SCD (historical) and which cannot change at all (fixed). In our example, address columns are historical, E-mail and phone are changing, and the rest of the columns are fixed. We will assume here that a customer will not change his name.
The following screen gives you some behavioral options. The first one lets you decide if the package should fail if a fixed attribute does change. The second one gives you the option to update “outdated” records if a SCD Type 1 attribute has changed. For example: John moves from New York to Miami. As a result of this, he also changes his phone number. Do you want only the record from Miami to be updated, or do you also want to update the New York record? This option gives you control over these two different possibilities.
Now we have to configure how to handle historical attributes. There are two possible configurations: you either use a column with a current indicator, or you use two columns with start and end dates.
If you pick the first option, you have to decide if the column will hold true or current for the current value and if the column will hold false or expired for expired records. When picking the second option, you need to specify the date columns and a variable to set the date values. For example, if you choose the package start time (variable System::StartTime), this date value will be used in the date columns. You might want to specify one of your own variables instead of a system variable, such as for example the extraction date of the source data.
A downside of the SCD wizard is that you cannot combine the two options: you cannot use a column indicator and start/end dates at the same time.
The last step gives you the choice of enabling inferred member support. Inferred members are beyond the scope of the current article, but will be treated in an upcoming article.
Finally you get an overview of all the choices you have made.
When the wizard is finished, it will automatically create a data flow with different output paths.
Let’s break this data flow down to its individual pieces:
After the data is read from the source, it flows to the slowly changing dimension component. This component acts as an advanced conditional split. It will check the incoming rows against the rules defined during the wizard and decide if a row is either an update or an entirely new row. If it is an update, the updated columns are compared against the list of historical and changing attributes established in the wizard. If a row has SCD Type 1 changes, it will be sent to the changing attribute updates output. If a row has SCD Type 2 changes, it will be sent to the historical attribute inserts output. If you have both types of updates in one row, it will be sent to only the historical output or to both outputs, depending on the choice you made for changing attributes in the wizard.
The changing attribute rows are sent to an OLE DB Command that will update the attributes in the destination table. A simple update statement updating every attribute marked as changing will be sent to the database engine. The components use the business key to identify the row that needs to be updated.
The historical attribute rows are sent to a derived column. This transformation will calculate the new value for EndDate using the start time of the package. Or if an IsCurrent column is used, it will set this column to False.
After the derived column the rows are sent to another OLE DB Command. This component will update the EndDate or the IsCurrent column of the to-be retired record.
Once the row is updated, it is sent to a UNION ALL component. Here the rows of the historical attributes are merged with the new rows. The historical attributes will be inserted as the new current rows of the Type 2 slowly changing dimension.
Following the merge of the two flows, the resulting flow is sent to a final derived column. Here the value for the StartDate column is calculated using the start time of the package or the IsCurrent column is set to True.
The final component in the data flow inserts the rows into the dimension table.
The biggest advantage of the slowly changing dimension wizard is its simplicity and easy-to-follow steps to set up a fairly complex data flow. However, its biggest disadvantage is its performance. OLE DB Commands, for example, are used to send the UPDATE statements to the database engine. This component does this, however, row by row. In other words, an individual transaction is started and closed for every single row that needs to be updated. When a large number of rows needs to be updated, this can result in substantial locking and logging, which badly affects the performance. If the number of changed rows is small, the performance hit can be neglected.
Another disadvantage is that only OLE DB connection managers can be used for this component. Usually this isn’t a big issue, as OLE DB providers are commonly used in SSIS.
The choice for the data access mode in the OLE DB Destination component of step 7 also badly affects performance. The table or view mode is chosen, which inserts rows individually into the database table. This leads to the same performance issues as for the OLE DB Command. A better choice would have been the table or view (fast load), which bulk inserts the rows into the destination and therefore has superior performance.
This setting can easily be changed, of course, but this leads us the final disadvantage: the maintainability. Every time there’s a change in the setup of the slowly changing dimension, you need to run the wizard again. At the end of the wizard, the data flow is regenerated, thereby losing any modifications made to the configuration of the components.
The slowly changing dimension wizard can be useful—thanks to its simple wizard—but it needs to be used with caution, especially when dealing with larger dimensions. This concludes the first part of the article on how to implement a slowly changing dimension in SSIS. Please stay tuned for Part 2, where different alternatives to the wizard are examined.