5 tips to make you a pro SSIS developer
The road to SSIS stardom is a long and slippery road. This article will give you 5 tips to get you started on becoming a pro SSIS developer. It is, by all means not a complete list, but in my opinion, five important tips one should not miss.
1. Pay attention to the formatting of the package
This is a personal pet peeve of mine and that’s why we will tackle it first. Paying attention to the formatting and layout of your SSIS package doesn’t really make it faster or more efficient, but it certainly makes it more readable and easier to understand.
Let’s not forget: you may be the one who is developing the package right now, but are you still going to be the developer in a year from now? Are you going to be the one to support it in the future and do maintenance? Chances are good that someone else will take a look at your package at some point in the future and you don’t want them to be scratching their head looking at a mess like this:
A quick solution to the messy layout is to use the Auto Layout feature of Visual Studio.
Using this feature results in the following layout:
It’s easier now to get a grip on the “flow” of the package and to find the starting points, but it is still very difficult to understand what the package is supposed to do without opening every single task. The example might seem extreme, but I’ve seen plenty of packages where the developer didn’t bother to rename the tasks.
Take the following steps to create an easy readable and understandable package:
* Rename every task and component and give it a meaningful name which describes the purpose of the object. Use naming conventions, such as the one proposed by Jamie Thomson: Suggested Best Practices and naming conventions (read also the best practices part). Naming conventions can be expanded to connection managers and variables/parameters.
* Group tasks, that logically belong together and possibly should fail or succeed together, use either sequence containers or group them by right-clicking the selection of tasks and choosing Group. Personally, I prefer sequence containers, as they are actual objects. Groups are merely a visual aid. You can change properties on a sequence container, for example transaction support.
* Comment your code as any programmer would. You can put comments in the T-SQL scripts or in the .NET code inside script tasks/comments, but you can also add annotations to your package. Simply right-click on the design canvas and choose Add Annotation. You can format the annotation by right-clicking it and choosing Set Text Annotation Font… One small annoyance is that annotations themselves cannot be copy-pasted; you need to copy the text itself. Tip: you can go to a new line in an annotation by pressing CTRL+Enter.
* Structure the tasks and components in a logical manner. This is usually top-to-bottom – especially for data flows – or left-to-right (at least in the Western world).
* This might seem a bit obsessive (and it is), but I try to give my packages a symmetrical look as it is easier on the eyes. Tasks and components usually have the same size and they are all outlined as if they were on a grid. Visual Studio has a toolbar that can help you with this: the Layout toolbar. With this toolbar, it only takes a matter of seconds to properly format your package.
Keeping these steps in mind, the package is transformed to the layout depicted below.
It is now clear what each task is supposed to do and the package has become less confusing.
2. Develop a framework
A whole book could be written about this topic, so I’ll try to keep it short. In order to develop an enterprise-ready efficient ETL solution, a framework for developing SSIS should be present. This framework includes the following parts:
* A logging and event handling framework.
* The use of templates.
* A deployment strategy.
* The use of source control.
Logging and event handling
The standard built-in logging provided by SSIS is adequate to find more information about the start and end times of the packages and to troubleshoot packages using logged warnings and errors. However, it still has it shortcomings: not every error is properly logged and it is very hard to retrieve crucial information, such as, how many rows are written to the destination. It is possible to retrieve the number of rows written using the standard logging, but you’d have to enable certain events in the logging which lead to a substantial amount of logging. Furthermore, the queries to extract the needed information are usually fairly complex.
To work around these limitations, you can implement your own custom logging mechanism. Using Execute SQL Tasks you can write information about the status of the package to a central logging database. There are a number of system variables which can give you essential information, such as the package name, the package/task start time, etc.
Typically, such a logging framework would do the following tasks:
* Log the start and end of the package using Execute SQL Tasks on the control flow.
* Log the start of a task using an OnPreExecute event handler.
* Log the end of a task using an OnPostExecute event handler.
* Log the numbers of rows transferred. You can use the rowcount transformation to store this number into a variable. Use the OnPostExecute event handler to write the info to the database.
* Log errors to the database using an OnError event handler.
For more information about this type of set-up, please refer to the following article: Custom Logging Using Event Handlers.
The introduction of the SSIS catalog in SQL Server 2012 makes such a framework largely redundant, but in order to get the maximum information out of the catalog, the logging level should be set to Verbose, which gives quite some overhead.
In order to promote re-use, you could develop templates for packages that are commonly developed. Typically, you would already include the logging components of the logging framework described in the previous section. These templates can be used in Visual Studio as the starting point for the development of new packages: Save a Package as a Package Template.
Deploying packages manually to a production server should, of course, be avoided. Therefore, deployment scripts should be created that allow someone with the appropriate permissions to deploy SSIS packages or projects to a server. The following two articles describe all the various possibilities: SSIS deployments and SSIS Deployments with SQL Server 2012. For automating deployments of packages developed in SQL Server 2005, 2008 or 2008R2, dtutil is recommended.
Aside from getting packages and project on a server, it is also important to put configurations in the picture. After deployment, values inside the package need to be altered to reflect the new environment. After all, you don’t want to update your development server when the package is running in production. My personal choice is to use SQL Server configuration tables, but XML configuration files are just as good.
There’s a great chance the package will be scheduled to be executed after its deployment, probably with SQL Server Agent. Make sure that the account executing the package has the necessary permissions to execute every single task inside the package. For SQL Server Agent, proxies can be used.
Unless you like losing all your hard work after a hard drive crash, you check-in your code into some sort of source control. This will also prevent your work from being overwritten by an over enthusiastic colleague. Due to the nature of the XML behind the SSIS package, it is very difficult to merge packages, so I advise that only one developer works at a single package at a time. Most source control software can check-in and check-out object, but my personal favorite is Team Foundation Server, due to its great integration with Visual Studio.
3. Use BIDSHelper
BIDSHelper is a free add-in for Visual Studio which can be downloaded from Codeplex and it is an absolute must-have. BIDSHelper is available for all versions of SSIS. It has several useful features: such as one-click deployment to a server (great for deploying packages to a test server), visually indicating if a variable or connection manager is configured by an expression or package configuration, allowing the use of relative paths for configuration, visualizing the package performance and so on.
With the introduction of SSIS 2012, some of the BIDSHelper functionality was incorporated into the main product. However, it still remains an invaluable tool, especially because it includes the BIML engine. This engine allows developers to dynamically create SSIS packages based on metadata. For more information about BIML: Stairway to BIML.
4. Avoid blocking components
The components used inside the data flow can be divided in two groups:
* Synchronous components. They are also called row transformations and each input row corresponds exactly with one output row. These components can reuse buffers and do not require data to be copied between buffers, making them the fastest components in SSIS. Examples are the derived column, the conditional split and the data conversion components.
* Asynchronous components. When a component has to acquire multiple buffers before it can process rows, or when multiple inputs have to be combined or when there is no one-to-one mapping between the input rows and the output rows, we are talking about asynchronous transformations.
The asynchronous components affect the performance of the SSIS package. They can be split up in two types: partially blocking and full blocking transformations. When datasets need to be combined, such as in the UNION ALL or the MERGE JOIN, we have a partially blocking transformation.
Some components however need to read the entire dataset before a single output row can be created. These are the blocking transformations. Examples are the SORT and AGGREGATE transformation. These transformations have a devastating impact on the performance. If you want to sort 10 million rows, you’d need to read them all into memory.
To avoid blocking transformations, it’s advised to do sorting and grouping in the source query with GROUP BY and ORDER BY clauses, if applicable. The database engine is much better suited to do these kinds of operations. If the source is not a relational database, it might be an option to write the data to a staging table and do the transformations there.
For more information about the types of transformations: Understanding Synchronous and Asynchronous Transformations.
5. Change the defaults every now and then
When you create a SSIS package, a lot of properties get default values. In 90% of the cases, these default values work just fine. The packages are fast, they do not crash and everyone is happy. However, in certain cases the default values are suboptimal and in rare cases, it is sometimes counterproductive, and a little tweaking of the package might be necessary.
One property you should always change is the protection level of the package, which decides how sensitive data in the package is protected. The default value is EncryptSensiviteWithUserKey, which encrypts the sensitive data using user credentials. This can make the package useless after deployment to a server, as most likely another credential is used to run the package. A better option is to set the protection level to DontSaveSensitive. This option doesn’t store any sensitive data at all in the package. The developer is forced to take care of the sensitive data using package configurations.
Another thing worth changing sometimes is the RetainSameConnection property of certain connection managers. This property tells the connection manager to create a single connection instead of multiple ones, which is useful when dealing with transactions or temporary tables.
Performance wise, it can be useful to fiddle with the fast load properties of the OLE DB Destination. For example, if you want more commits when transferring data, you can change the Maximum insert commit size. If you really want to make an impact on the data flow performance, you can change the buffer size properties: DefaultBufferMaxRows andDefaultBufferSize. Most of the times it is recommended to enlarge the buffers, so that more rows can be processed at the same time. However, in some cases you get better performance with smaller buffers, for example with a slow source. Carefully test the performance of your packages using different buffer sizes before finally deciding which one to take. For more information: Data Flow Performance Features.