A skilled database developer might find it difficult to accept that other tools can increase productivity. When Microsoft released SQL Server 2005 Integration Services (SSIS 2005), I did not think that any tool could possibly make me more productive than the C# and T-SQL code I was writing by hand. After some reconsideration (and subtle persuasion from peers), I discovered that SSIS 2005 contained many features that indeed reduced my development time - WITHOUT sacrificing flexibility. Microsoft added new functionality in SSIS 2008 to make a strong product even better. In this article, I’ll present 13 different examples that demonstrate the power of Integration Services.

What’s on the Menu?

It’s impossible to cover all of the functionality in SSIS 2008 in one article, but I’ll show thirteen examples to demonstrate different SSIS functionality:

  • Basic SSIS package configurations
  • Joining two flat files by a common column
  • Creating Excel output with multiple worksheets
  • The SSIS Lookup transformation
  • The Fuzzy Lookup and Fuzzy Grouping Transformations, for inexact matches
  • The new Data Profiling task in SSIS 2008, to analyze column value distribution
  • Retrieve a variable number of CSV files from an FTP server and process the rows inside each file
  • Using SSIS tasks to control OLAP databases
  • The Baker’s Dozen spotlight: Performing a random sampling
  • Using stored procedures with SSIS to build a data extract scenario
  • Using the enhanced scripting capability in SSIS 2008 to call Web services
  • Performing basic database maintenance tasks
  • Handling variable runtime lists of master data

The SSIS Project for this Article

The SSIS project that I built for this article (Figure 1) contains 13 pre-built SSIS packages. Each package contains several tasks/components that demonstrate the power of SSIS. These tasks (Execute SQL Task, Foreach Enumerator, Aggregate, Conditional Split) offer built-in functionality that can save me development time.

Figure 1: The overall SSIS solution.

I’m assuming that the reader has basic knowledge of the Business Intelligence Development Studio (BIDS), knows the difference between BIDS and the full-blown version of Visual Studio, and understands the difference between an SSIS control flow and an SSIS data flow. However, if you don’t, the download project on my website (www.KevinSGoff.net, in the Downloads area) for this article contains some additional notes for those who need some additional introductory information.

Tip 1: The First SSIS 2008 Package and Package Configurations

Scenario: I want to create a basic SSIS package that will create a new SQL Server database called DemoEmptyDB (from an existing SQL script). The package will drop the database if it already exists. Additionally, I want to configure the database server in an SSIS configuration file so that the package will create the new database on the server name that I configured.

SSIS features I’ll use in this package:

  • Execute SQL task
  • OLE DB Connection Manager
  • SSIS Package Configuration Wizard
  • A free XML editor from AgileBI

At first, this sounds like an extremely basic SSIS package. However, it does demonstrate the concept of configuring the database server so that I don’t have to change the package if I want to create a database on a different server.

The package starts with two separate Execute SQL tasks on the control flow (Figure 2). In the Execute SQL task editor, I’ve set three properties:

Figure 2: Control flow for Package 01.
  • The Connection property (to my local instance of SQL Server 2008, which I named DBServerConnection). Note that in the Connection Manager for the connection, I specified a server name but left the default database blank. This is because I want the tasks to drop the DemoEmptyDB if it exists and then recreate it.
  • The SQLSourceType to File Connection.
  • The FileConnection property, which I set to an existing SQL script file, (DropDemoEmptyDB.SQL for the first Execute SQL Task, and CreateDemoEmptyDB.SQL for the second one). These scripts reside in the \SQLCode subfolder underneath the main project folder.

Note that the connection manager at the bottom of the control flow in Figure 3 lists three connections: one for the SQL database and two for the SQL scripts. SSIS 2008 creates a separate connection manager for every data source, file, folder, email server, and/or FTP server that I specify in a package.

Figure 3: SSIS Package Configuration Wizard for Package 01.

In the DBServerConnection, I set the server name to my local instance of SQL Server 2008. However, I might want to run the package for a different server, without needing to open the package and modify the database server name. This is where SSIS Configurations can greatly help. I can use the SSIS Configurations wizard to take almost any element of an SSIS package and build and external XML configuration file. In this case, I’ll start small and merely configure the server name of DBServerConnection:

  • First, select the Package Configurations option from the main SSIS menu pull-down, click the Add button, and then specify the name of the XML configuration file (Package01_Config.dtsConfig, stored in the CONFIG project subfolder).
  • Second, the next screen shows all the package properties that are available for configuration. I’ve selected the Servername property for the DBServerConnection (Figure 2). After clicking Next and Finish, I’ve finished the steps for creating a package configuration file.

From this point on, when I want to run this package (from a .NET program, or a SQL Server Agent Job, etc.) for a different database server, I only need to open the XML configuration file and change the server name. Since the XML file is just a text file, I could use Notepad. However, there’s a great (and free) XML configuration editor from AgileBI. You can download it from http://agilebi.com/cs/files/folders/17/download.aspx. Figure 4 shows an example of opening the configuration file to change the value of the server name. If I change the server name, the package will use that configuration file to determine the server on which the empty demo database will be created.

Figure 4: Free SSIS Configuration Editor from AgileBI.

Tip 2: Joining CSV Files

Scenario: I want to take two external CSV files and join them together by a common column. After I join the files together, I can write out the “joined” file to a new output CSV file.

SSIS features I’ll use in this package:

  • Flat file source
  • Sort transformation
  • Merge Join transformation
  • Flat file destination

SSIS allows me to join two flat files (CSV files) without the need to load the data into temporary SQL tables. In this example, I’ll take a CSV file called RegionList.CSV (containing region codes and region names) and a CSV file called StateList.CSV (containing state names and the corresponding region code). I want to join the two files, create one data flow in the process, and send the output to a new CSV file that will contain state names, region codes, and the region names.

Figure 5 shows the entire data flow for the package. (The control flow merely contains a reference to the data flow.) To create this package, I did the following:

Figure 5: Data flow for Package 02.
  • Step 1: I created two Flat File Sources (using the Flat File Source task) that pointed to the RegionList.CSV and StateList.CSV files.
  • Step 2: I used two sort transformations and sorted both input pipelines on region code. I’ll explain in Step 5 why this step is necessary.
  • Step 3: I dragged in a Merge Join transformation and connected the two input pipelines (from the two sort transformations) into the Merge Join transformation.
  • Step 4: Inside the Merge Join editor, I visually joined the two pipelines on the region code (the common column), and specified the columns from both input pipelines (Figure 6). Essentially, this dialog box is the virtual equivalent of writing a SQL query that joins a column from two tables and selects columns from both tables.
  • Step 5: So why did I say (back in Step 2) that it was necessary to sort both pipelines on region code? The reason is because the Merge Join transformation expects that both input pipelines will be explicitly sorted on the column I use in the Merge Join. Even if the two original CSV files were sorted on region code to begin with, the Merge Join transformation still expects me to sort the two input pipelines on the column used for the sort.
  • Step 6: Finally, I dropped an instance of a Flat File Destination into the package and connected the data pipeline from the Merge Join transformation to the Flat File Destination. The Flat File Destination prompts me to define a connection manager (i.e., a description and an output filename), whether to overwrite the output file if it already exists, the text delimiter, and whether column names should appear in the first row of the output (usually a good idea).
Figure 6: Merge Join Transformation Editor for Package 02.

Note that in Step 6, I hard-wired the output file name. Later I’ll show an example where the output will be dynamic (i.e., unknown until runtime).

Tip 3: Creating Excel Output

Scenario: I want to read from the AdventureWorks purchase order table and then write out a single Excel spreadsheet with two worksheets (one summarized by vendor and one detailed by Purchase Order).

SSIS features I’ll use in this package:

  • OLE DB source
  • Multicast transformation
  • Aggregation transformation
  • Excel destination

Figure 7 shows the data flow for this rather interesting package. First I’ll start with an OLE DB source that points to AdventureWorks 2008 and the following SQL command, which returns detail PO Header orders:

Figure 7: Data flow for Package 03.
select Vendor.Name, POH.PurchaseOrderID,
       POH.OrderDate, POH.ShipMethodID,
       POH.Freight, POH.TaxAmt, POH.TotalDue
FROM Purchasing.PurchaseOrderHeader POH
    JOIN Purchasing.Vendor
      ON POH.VendorID= Vendor.BusinessEntityID
ORDER BY Name, OrderDate

The goal is to create an Excel output file with two worksheets: one that summarizes these orders by Vendor, and a second that contains the same level of detail as the query above. So how can I do this?

The answer involves creating multiple copies of the input pipeline using the Multicast transformation. Newcomers to SSIS find the Multicast transformation to be one of the more confusing tasks - not because the transformation has so many options, but because it essentially has NONE! The transformation allows developers to take a copy of a single pipeline and make as many copies of the data in the pipeline as necessary.

In some instances, the developer will perform different tasks on the copies of the pipeline and then join the data BACK; in other situations, the developer will write different tasks on the copies of the data and output each copy to an independent destination. In this situation, I’ll do the former. So the steps are as follows:

  • Step 1: As mentioned above, I dropped an instance of an OLE DB source into the data flow and used the SQL query against the PurchaseOrderHeader and Vendor tables.
  • Step 2: I dragged in the Multicast transformation and connected it to the data pipeline. I’ll use the MultiCast to create two copies of the data pipeline from the original SQL query.
  • Step 3: For the summarized version of the output, I’ll drag in an Aggregate transformation. This transformation allows developers to perform the equivalent of a “group by” on any data in the pipeline. Figure 8 shows that I’ve grouped the first copy of the pipeline by Vendor Name, and I’m summarizing Tax Amount, Freight, and Total Due.
  • Step 4: Next I dragged in an Excel destination task and created an Excel connection. The Excel connection manager requires me to create a definition for a new worksheet (by clicking the New button), which includes modifying the CREATE TABLE statement to specify the actual worksheet name. I’ve called the new Excel spreadsheet PurchaseOrderOutput.xls (in the ExcelFiles project subfolder) and I’ve called the first worksheet name VendorSummary.
  • Step 5: Now comes the “cool” part - I want to create a second copy of the pipeline that will take the non-summarized data from the original pipeline and add it as a second worksheet to the spreadsheet I created in the previous step. This is actually quite easy - all I need to do is drag a second instance of an Excel destination into the data flow and connect the pipeline from the Multicast transformation to the second instance of the Excel destination. In the second Excel destination, I can select the Excel connection manager I created in Step 4, and then click NEW to create a second worksheet. Just like in the previous step, I have to modify the CREATE TABLE statement to specify the new worksheet name, which I called OrderDetails.
Figure 8: Aggregation Transformation Editor for Package 03.

Note: If you are using SSIS 2008 64-bit, you must turn off 64-bit mode in the SSIS project if you wish to run this package interactively inside of BIDS.

Tip 4: Lookup Task

Scenario: I have a large input CSV file where I need to perform lookups into existing database tables. If the lookup succeeds, I want to take values from the lookup “hit” and place them into the data pipeline for subsequent tasks.

SSIS features I’ll use in this package:

  • Data Conversion transformation
  • Lookup transformation

Perhaps the most common and popular SSIS data flow transformation is the Lookup. The Lookup transformation is actually quite simple: I can take data from any input pipeline (where the data originated in another database table, or a flat file, or a spreadsheet, etc.), and use the lookup transformation to see if one of the columns from the pipeline (or a composite list of columns) exists in a SQL Server table. There are, of course, a few rules:

Rule #1: The destination lookup table must be a SQL Server database table.

Rule #2: The lookup column(s) from the input source pipeline must be the exact same data type as the corresponding lookup columns in the destination lookup table.

Rule #3: By default, the lookup task expects to succeed. So if you attempt one thousand lookups and even one of the thousand fails, the entire lookup component will fail. Therefore, a developer must define some level of error handling in the lookup for instances where a lookup fails. The Lookup transformation in SSIS 2008 has four possible options for handling lookups with no matching entries:

  • Ignore failure (if you expect that some lookups might fail, such as an incoming list of products where some are new and some are existing).
  • Redirect rows to Error Output (if you expect that all lookups should succeed, and wish to redirect any lookup failures to an error output).
  • Fail component (while this is the default behavior, it should be avoided for production work).
  • Redirect rows to no match output (new in SSIS 2008, this is similar to the approach in SSIS 2005 of ignoring a failure and checking in a conditional split if a specific lookup output column contains a NULL value. The advantage is that this provides a split pipeline without the need for a conditional split.)

In this example (Figure 9), I went through the following steps:

Figure 9: Data flow for Package 04.

Step 1: I opened the CSV file InputProductNames.csv using a flat file source. The input file contains a product number that I’ll use to lookup into the Product table in AdventureWorks.

Step 2: Because the product number in the AdventureWorks Product table is stored as an nVarchar, I need to convert the product number in the input pipeline to the same data type (Unicode string). I drag in a Data Conversion Transformation and I convert the product number in the pipeline to a Unicode string.

Step 3: Next I dropped in a Lookup Transformation. The Lookup Transformation has several page options (General, Connection, and Columns).

The first page (General) allows me to set the Cache and Connection type options (I’ve kept the defaults) and how to handle lookup failures (I chose Redirect rows to no match output).

The second page (Connection) allows me to specify the “target” table in the lookup. I can select either an entire table, or a SQL query representing a table. I’ve chosen a simple query that retrieves just the ProductNumber, ProductID, and ProductName from the Product table. Why didn’t I just simply point to the regular Product table? Because it’s generally more efficient to only bring back the necessary column for the lookup.

The third page (Figure 10) allows me to specify the actual lookup criteria. Recall that I converted the input product column to a Unicode string (and I called the new column ConvertedProductNumber). In Figure 10, I perform the equivalent of a join into the destination (the derived table from the previous query). Note that at the bottom of the screen, I’m retrieving the ProductID and ProductName from the lookup destination and placing them into the pipeline. This allows me to continue the data flow with the results of the lookup as part of the pipeline.

Figure 10: Lookup Transformation options.

Step 4: At this point, I can drag two “split” pipelines out of the lookup transformation - those where the lookup matched and those with no match.

I’m going to show the lookup transformation again in this article - but even then, there’s an entire “other story” to the lookup transformation that I’ll cover in Part II of this article.

Tip 5: Fuzzy Lookup and Fuzzy Grouping Transformations

Scenario: I have an input file of names where names are duplicated but with incorrect spellings (for example, row 1 is Kevin Goff, row 2 is Kevin Golf). I want to identify the duplicates and synthesize the input into a smaller number of unique rows. Then I want to perform a lookup of the input rows into a database - where some of the names may exist in a database table with an exact match or an inexact match.

SSIS features I’ll use in this package:

  • Fuzzy Grouping transformation
  • Fuzzy Lookup transformation
  • Conditional split
  • Union transformation

When some developers hear the words “fuzzy lookup”, it almost sounds like a magic bullet. While the fuzzy transformations are great tools, they won’t solve ALL the problems of misspellings. However, both transformations will still help greatly in managing input files with spelling errors.

First, Figure 11 shows an example of a Fuzzy Grouping transformation. Suppose I have an input file with the following names and addresses (the full CSV file is FuzzyGroupingInput.CSV, in the CSVFiles project subfolder):

Figure 11: Fuzzy Grouping output for Package 05.
Figure 12: Fuzzy Lookup results for Package 05.
Kevin,Goff, 111 Main Street, Harrisburg, 17111
Kevin S.,Goff, 111 Main St., Harrisburg, 17111
Kevin,Goff, 111 Main St., Harrisburg, 17111
Kevin,Goff, 67 Albert Avenue, Phoenix, 87644
James,Goff, 643 Wilson Lane, Detroit, 93234
Jenny,Wilson, 561 Hamilton St, New York, 30091
Jennifer,Wilson,561 Hamilton St, New York, 30091

In the listing above, the “core” name of “Kevin Goff” appears four different times: three times (with slight variations) for the same zip code and slightly different spellings for street address, and one time for a Kevin Goff in a different state. Then I see the name James Goff. Then I see two variations for Jenny Wilson.

Common sense dictates that I am looking at four separate individuals. But how do I automate “common sense”? This is where the Fuzzy Grouping Transformation can help. Essentially, the Fuzzy Grouping will provide a similarity score (where I provide a minimum similarity threshold) for close matches with respect to a base, or canonical row. The transformation also provides key number pairs that allow developers to “link” the canonical row with other close matches - thus a way to group inexact matches. Here are the actual steps:

Step 1: I drag an instance of the Fuzzy Grouping transformation into the data flow and connect it to an existing pipeline (which could contain data from any source). The transformation contains three pages of options: Connection Manager, Columns, and Advanced.

Step 2: In the Connection Manager page, the developer specifies the SQL Server database where the transformation will temporarily store data. (So even though the source data could be a CSV file, the transformation still needs SQL database storage internally to perform the work.)

Step 3: In the Columns page, I define how I want the transformation to match the columns. I’ve defined a minimum similarity of .25 for all columns except the zip code. (For the zip code, I’m requiring an exact match.)

Step 4: In the Advanced page, I define an overall similarity threshold (the score for the row as a whole) along with the names of the columns that the transformation will add to the pipeline (a “key in” and “key out” key pair combination to group similar rows, and the overall similarity score). I’ve kept all the defaults except for the similarity threshold, which I’ve set to .35.

Step 5: Again, Figure 11 shows the results of the Fuzzy Grouping transformation. To interpret the results, note that the first three instances of Kevin Goff (in zip code 11111) all have a Key_Out column value of 2, but separate Key_In column values. The row with the same value for both columns (value of 2) represents the canonical row. Note that the two rows with different values have similar scores of .876 and .719, representing the overall similarity score with respect to the canonical row (which will always have a perfect score of 1). All three rows have a “clean” version of First and Last Name that comes from the canonical row.

Why did I toss a Multicast at the end? This is a popular debugging technique in SSIS if I want to merely see the results of the pipeline after a certain transformation. All I have to do is drop a MultiCast into the pipeline and then add a Data Viewer in between.

OK, so that’s the Fuzzy Grouping transformation - what about the Fuzzy Lookup transformation? Suppose I have the following input file of Product names and new list Prices, and I need to update the prices in the AdventureWorks product table based on the product names in this list.

LL Road Seat Assembly, 146.67
ML Touring Seat Assembly, 54.98
"Sport-100 Helmet, Blue" ,161.85
"Long-Sleeve Logo Jersy, XL" ,38.49

Initially, this looks like something I could accomplish with the lookup task from earlier. But there’s one small problem: the lookup for the last product will fail because the word Jersey is spelled as “Jersy”.

The approach to this problem represents a specific design pattern in SSIS when dealing with lookups on inexact matches. Because a “fuzzy” lookup is more resource-intensive than a regular lookup, the SSIS package will perform a regular (exact) lookup first - and only perform a fuzzy lookup for those lookups that failed. So here are the steps:

Step 1: Bring in a Flat File Source to read the CSV file FuzzyGroupingInput.CSV, and then use a Data Conversion task to convert the incoming Product Name to a Unicode string and the incoming ListPrice to currency. This is because the underlying columns in the SQL Server Product table use the same data types, respectively.

Step 2: Use a regular Lookup task to see if the incoming (converted) product name exists in the AdventureWorks product table. In the general page, set the No matching entries output to Redirect Rows to No Matching Output. In the columns page, set the Lookup Column/Output alias to be the ProductID (as OutputProductID). This will place the ProductID (from the resulting lookup) into the data flow pipeline.

Step 3: If the lookup fails (no matching output), then I can drop in an instance of the Fuzzy Lookup into the data flow. In the reference table page of the fuzzy lookup, I specify the SQL Server table that will be the fuzzy lookup destination. (Note: The destination of the fuzzy lookup must always be a database table.) In the columns page of the fuzzy lookup, I specify essentially the same lookup information at the top of the page (that I want to lookup into the Product table’s Name column based on the same incoming and converted product name that failed the first lookup). At the bottom of the columns page, I will specify the ProductID as the output column for the pipeline, with an alias of FuzzyOutputProductID. This is essentially the same process that I took in the lookup task (of specifying an output column), except that SSIS does not allow me to name the output column alias the same as I did in Step 2. So instead of calling the output alias OutputProductID, I’ve called it FuzzyOutputProductID.

Step 4: Finally, in the Advanced page of the fuzzy lookup, I specify a minimum similarity threshold of .30.

After SSIS performs the fuzzy lookup, the transformation places two new columns into the pipeline: a similar score and a confidence score. These scores represent how similar the input column is to the actual database row with the closest match, and the confidence level SSIS has that the row represents the “actual” row.

After performing the fuzzy lookup, I need to evaluate the score. If the score for both the similarity level and the confidence level exceeds a certain amount (e.g. 70%, 75%, etc.), I want to put those fuzzy matches back in the pipeline, along with the FuzzyOutputProductID. The fuzzy lookup transformation adds two new columns into the pipeline: _Similarity and _Confidence. So I drop an instance of a Conditional Split transformation, and I add a condition called “Successful Fuzzy Lookup” with a condition syntax as follows (note that SSIS uses the && logical operator for a logical AND).

_Similarity >= .7 && _Confidence >= .7

Step 5: Finally, I want to use a UNION transformation to append the Successful Fuzzy Lookup transformations with the Successful Lookups from Step 2. In the UNION transformation, I need to map each of the columns from the pipeline from Step 2 with the pipeline from Step 6. All of the columns from the two pipelines have the same name, except for the last column (OutputProductID and FuzzyOutputProductID). Fortunately, the UNION transformation allows developers to map two different columns names into a common column name moving forward.

Finally, I need to look at the OLE DB command to perform an update. This will require three steps: First, in the OLE DB editor, I set the connection on the Connection Manager tab. Second, I specify a SQL command in the SqlCommand property of the Component properties tab. Note that in the SQL command, I need to write the syntax as follows:

_Update <table> set ColumnA = ? ,
       ColumnB = ?
       where ColumnPK = ?

Third, in the Column Mappings page, I map (thus the name) each of the three parameter question mark placeholder above to the columns from the pipeline.

Tip 6: Data Profiling Task

Scenario: I receive the results of a survey application as an Excel table. I want to produce some basic analysis of each survey question and the % of people who replied to each possible response. For instance, I want to see that 62% of people responded with answer ‘A’ on question 1, 18% responded with answer ‘B’ on question 1, etc.

SSIS feature I’ll use in this package:

  • Data Profiling task (new in SSIS 2008)

Many database developers need to write queries to “get a handle” on the values in certain database tables. I know that when I work with a database for the first time, I might want to examine how values are distributed, what percentage of rows are NULL for a particular column, etc. This is known as writing profile queries.

SSIS 2008 provides a new task called the Data Profiling task. This task can reduce the amount of work necessary to generate the output for these profile-type queries. I can provide the name of a database table to the new task, and the task offers many different profile options. I’ll focus on one, the Column Value distribution. If you want information on the others, SSIS Expert Jamie Thomson has written some great information on his development blog (just Google on “Jamie Thomson” and “Data Profiling task”).

One of my personal favorites is the Column Value distribution. Periodically I need to query survey and questionnaire tables and produce results based on the shakeout of values (e.g. 65% of respondents answered “A” to question 1, 15% answered “B”, etc.).

In this example, I’ll take a CSV file (SurveyQuestions.csv, from the project CSVFile folder) that contains 10 survey questions and 20 surveys. The data looks something like this:

SurveyID, Q01, Q02, Q03, Q04 (up to Q10)
1, 5, 1, 5, 4
2, 4, 2, 4, 4
Up to surveyID 20

To generate a result (to an XML file) that shows the breakdown of values for each of the ten questions, I’ll perform the following steps (the overall flow of the package is in Figure 13):

Figure 13: Control flow for Package 06.

Step 1: Use an Execute SQL task in the control flow to create a table (SurveyTable) in SQL Server. The table will hold the results of the surveys in the CSV file.

Step 2: Create a simple data flow that opens the CSV file with a flat file source connection and inserts the data into the SQL SurveyTable using an OLE DB destination. (The Data Profiling task I’ll use in the next step will only read database tables, so I can’t use the CSV file directly.)

Step 3: Drop an instance of the Data Profiling task into to the control flow. The Data Profiling task contains three page options: General, Profile Requests, and Expressions.

In the General page, set the DestinationType property to FileConnection, and set the Destination filename to SurveyXMLProfileResults.XML.

In the Profile Requests page (Figure 14), add a Profile type of Column Value Distribution, create a connection manager to the database containing the survey entries, select the Table (SurveyResults), and select all columns (“*”).

Figure 14. Data Profile Request options page.

After executing the package, SSIS will generate the output file SurveyXMLProfileResults.XML. I could certainly develop a report in SQL Server Reporting Services against the XML file. But if I’m merely looking for a quick top-line analysis of the file, SQL Server 2008 provides a great utility called DataProfileViewer.EXE, to read the results of a Data Profile task (Figure 15).

Figure 15. Data Profile Viewer for viewing results of Data Profiling Task in Package 06.

Tip 7: Reading Data from FTP and Performing Database Inserts/Updates

Scenario: I have a nightly routine where I read CSV data from an FTP server - I might pull down one file or ten files. I want to open each file, one by one, read the contents, and either insert or update existing tables based on the input rows. At the end, I want to send an email to a manager with the number of files processed, the number of rows inserted, and the number of rows updated. As the email requirements exceed the capability of the built-in SSIS Send mail task, I will have to write a mail handling task via an SSIS script in C#.

SSIS features I’ll use in this package:

  • FTP task
  • Foreach task and Foreach File enumerator
  • Script task
  • Row Count transformation
  • OLE DB destination
  • OLE Command transformation
  • Precedence Constraint options
  • Script task
  • DelayValidation, for situations where I don’t know a filename until runtime

Whether the year is 1995 or 2005, or even now (2010), FTP (File Transfer Protocol) remains a tried-and-true method for sending/receiving files between computers over TCP/IP.

Figures 16-19 inclusive cover a rather lengthy package for a fairly common scenario: receiving and processing multiple files from an FTP source. Figure 16 shows the list of package variables, Figure 17 and Figure 18 show the control flow and data flow, and Figure 19 shows the script editor interface for one of the required package scripts.

Figure 16. Variables for Package 7.
Figure 17: Control flow for Package 07.
Figure 18: Data flow for Package 07.
Figure 19: Script Task Editor for defining script variables (Package 07).

I’ll jump right to the steps:

Step 1: Because this package must process many pieces of information that aren’t known until runtime, I’ll start by creating package-wide variables (in the SSIS…Variables menu drop down). Figure 16 shows the list of variables, data types, and default values (where applicable).

Step 2: In the control flow, I want to create two local folders (a location to save files from an FTP server, and a location to place the files after processing them). I’ll drop two instances of the File System task. This task allows me to create folders, delete folders, copy and move files, etc. In both instances, I’ll set the Operation property to Create directory and I’ll use the package variables FTPLocalWorkArea and FTPLocalProcessed for the values of the two folders in the two File System tasks.

Step 3: Next, I’ll drop an instance of the FTP task into the control flow. The FTP task requires an FTP connection manager, where I’ve used the two variables for FTPServer and FTPAccount for the FTP connection manager. The interface is such that I almost have to initially place static values in the interface - and then go to the property sheet for the FTP connection manager, bring up the Expressions property, and map the properties ServerName and ServerUserName to the two variables listed above.

Step 4: In the FTP task itself, I’ve set the operation to Receive Files, the Remote path (on the FTP server to /SSIS2008Test/Download/*.csv), the IsLocalPathVariable property to True, and the LocalVariable property to the package variable FTPLocalWorkArea.

Note that in Figure 17, I’ve placed the first three tasks in a Sequence Container. This allows me to treat all three tasks as a single operation (though each task will run in sequence), and only proceed to the next step if every step in the first three tasks runs successfully. (In package 10, I’ll show how I can also apply transaction isolation levels to a sequence container.)

Step 5: Next, I'll drop an instance of the For Each Loop Container into the control flow. Why? Well, suppose I ran this package right now - if there were five CSV files (waiting to be processed) on the FTP server, I’d pull down five files into the folder specified by the value in the FTPLocalWorkArea folder. Of course, tomorrow it could be ten files. So I want to scan through the files in the FTPLocalWorkArea folder. To do this, I must drop an instance of the For Each Loop Container into the control flow. The For Each Loop Container is a highly versatile task that allows developers to iterate through different object types. In this situation, I can define the task’s Enumerator property as a Foreach File enumerator. Also, I need to set the scan folder to C:\FTPCurrencyfiles\WorkArea and the File Mask to “*.CSV”. (Later I’ll configure these.)

(In the final package of this article, I’ll demonstrate another valuable enumerator, the ADO enumerator.)

Step 6: The next step is arguably one of the most important parts of this package (yes, “all” parts are important). The Data Flow task will need to open every incoming CSV file, one at a time, and read individual rows. The challenge is that I can’t provide a static file name for the opening step of the data flow (since the task will be processing many files with different names). I can address this by mapping my package variable ExchangeRateFile in the third page of the ForeEach Loop editor. This step will effectively populate my package variable with the name of each CSV file as SSIS processes it (thus the concept of a “ForEach” loop). I’m not finished with the processing of “dynamically” specifying the name of the input file, but this was an important first step.

In the Data Flow task, I drop an instance of a Flat File Source. The interface initially requires me to specify a filename. So my design-time approach is to point to “one” of the files that will be processed (so that SSIS can “discover” the columns that will be common for every input file encountered at runtime). However, after completing that task, I open the full-blown property sheet for the connection manager and I set the Expression for the Connection String property to the same package variable (ExchangeRateFile) that I worked with in the previous step. Additionally, I also set the DelayValidation property to TRUE - this is a common property to set (to TRUE) when a specific SSIS component should not be validated until runtime.

Step 7: Because I’ve covered the process of data conversions and lookups, I’ll cover the next few steps quickly. I need to take each incoming instance of From Currency, To Currency Code, and Exchange Rate Date and perform a lookup into the Adventure Works Sales.CurrencyRate table. Since the incoming data is in a CSV file and the destination table lookup columns are Unicode and Date columns, I need to perform the necessary data conversions. Also, if the lookup succeeds, I want to use a conditional split to determine if either of the incoming rates (ConvertedAverageRate or ConvertedEndOfDayRate) differ from the rates in the database table (OutputAverageRate and OutputEndOfDayRate, the two columns I defined as output columns in the Lookup transformation). If the values are the same for any row, I don’t want to perform an update (there would be no reason to!)

Step 8: Before performing an insert (for a new From/To Currency and Exchange Rate Date) or an update (for an existing one), I want to capture the # of instances. I will place two Row Count transformations into the pipeline and map the row count to the two package variables RowsAdded and RowsModified). Note: SSIS will overwrite the values in these variables when the Row Count transformation re-fires for every file; therefore, I’ll need to accumulate these variables back in the control flow with a simple SSIS script (in Step #12).

Step 9: Back in the control flow (now that I’ve designed the entire data flow), I need to add a simple File System task to move the file that SSIS just processed (stored in the variable ExchangeRateFile) to the folder whose name is in the variable FTPLocalProcessed.

Step 10: The next step may come as a surprise to some people (it was certainly a surprise to me). A single run of this package might process 10 files, with 5 new rows in each file and 6 modified rows: so the grand total number of new rows and modified rows would be 50 and 60, respectively. As I mentioned in Step 10, the SSIS Row Count transformation will not accumulate variable values - so each time the transformation fires, it will overwrite the value of a row count variable and not retain the older value. My only recourse is to write a simple SSIS script to capture the value of the two variables (RowsAdded and RowsModified) in the middle of each ForEach iteration, and accumulate the values into accumulator variables TotalRowsAdded and TotalRowsModified. (I’ve continued this on the next step.)

Step 11: I drop an instance of an SSIS script task into the control flow. In the Script editor (Figure 19), I need to tell SSIS which variable the script will read and which variables the script will write. (A minor enhancement in SSIS 2008 is that I can select the variables from a list: in SSIS 2005, I had to type in the variables). I can also define whether the target language for the SSIS script will be Visual Basic 2008 or Visual C# 2008. (SSIS 2005 forced developers to write scripts using Visual Basic.) After I specify the script language and the variables that the script will use, I can click the Edit Script button in Figure 19. SSIS will launch Visual Studio Tools for Applications 2.0, which represents the new script engine and script environment in SSIS 2008. The script editor loads a shell of a Main function and I can enter the entire SSIS script to accumulate the variables (Listing 1). Note the general pattern: I set/read the values of the accumulator variables by referencing them in the Dts.Variables collection.

Step 12: This is a long package, but I’m almost done! I’ve already shown a few common SSIS patterns that I can reuse for future packages. There’s just one more task to go: sending an email at the end. SSIS provides a Send Mail task that is very helpful for sending emails to Exchange Server and other mail server recipients. However, it is one of the few SSIS Control Flow tasks that often falls short as many times as it shines. In this instance, I need to send a notification email to a Gmail recipient. The Gmail SMTP server has some authentication requirements that the Send mail task cannot accomplish. (Another instance where the SSIS Send Mail task fails is when the SMTP server requires a custom port number.) Fortunately, I can write another SSIS script that will use .NET to create an SMTP object manually. Listing 2 shows an example where I create new SMTP and MailMessage objects, and set the necessary network credentials for Gmail.

Tip 8: OLAP Processing Tasks

Scenario: I have an OLAP database cube (built using Microsoft Analysis Services). The cube utilizes a staging area as its data source. Any time the import jobs encounter new master rows (for example, new products, new customers) and new sales data, I want to process these as new dimensions/facts in our OLAP cube.

SSIS feature I’ll use in this package:

  • Analysis Services Processing task

When database developers create Microsoft OLAP database cubes using Analysis Services, they often ask about different ways to update a cube when new dimensions or facts (measures) come in.

For instance, suppose I build an OLAP cube for the first time with two years of historical data. When new data comes in (daily, weekly, etc.), I certainly do not want to launch the SSAS project and rebuild the cube, just for one day/week of new data. Instead, I’d like to incrementally update (or “trickle-feed”) the cube.

Fortunately, SSIS provides a very special task for this very issue: the Analysis Services Processing task. Assuming that I’ve built an OLAP database, and that OLAP database has a data source defined (for example, a data warehouse, a staging area, and so on), I can drop an instance of an SSAS processing task and define what parts of the OLAP database I wish to process, and how to process them (Figure 20).

Figure 20: Updating OLAP dimensions.

There are many possible options I can set in this interface: here are three fairly common scenarios:

Process Update on dimensions: suppose I have five new customer accounts and three customer accounts with important attribute (column) changes, all in my data source/staging area. A process update on that specific dimension will update the OLAP dimension with the new/changed accounts. (Note: A Process Index on the dimension may be needed on a regular basis.)

Process Incremental on fact tables: In the example above, if I have two years worth of data in my OLAP fact tables and I need to incrementally “trickle-feed” a day or week’s worth of new data, I can specify Process Incremental for that specific fact table. (Note: It may be necessary to also specify a T-SQL query to extract just those rows that are new.)

Process Full on fact tables or the database as a whole: if I am doing a full restatement or full load of the database (and any existing data is to be wiped out), I would chose Process Full.

A developer might run such a package conditionally, based on whether a load of new customers, new orders/transactions actually yielded any data. (If a job yielded no new data, then there probably would be no need to run one of these SSAS processing tasks.)

In a future Baker’s Dozen article, I plan to walk through the steps for building an OLAP database.

Tip 9: Baker’s Dozen Spotlight: A Complex Random Sampling Package

Scenario: I want to retrieve a random sampling of 30% of orders from AdventureWorks. From the sampling, I want to determine the average order amount. Any sampling orders with a dollar amount exceeding the average will go to an output called BIGORDERS.CSV, and all other sampling orders will go to an output called SMALLORDERS.CSV. Finally, I want to send both files via email with information on the two files in the email body.

SSIS features I’ll use in this package:

  • Random Sampling task
  • Script component (in the data flow)

Sometimes SSIS developers joke that they need to apply some “elbow grease” to accomplish a certain SSIS task. (This phenomenon certainly isn’t restricted to SSIS!) This package (Figure 21) illustrates how I need to “bend” SSIS - the good news is that SSIS is indeed bendable. The tricky part of this package is performing an aggregation (the average dollar amount from a random sampling) and “jamming” the average back into the pipeline as well as into an SSIS variable. So I’ll dive right into the tasks:

Figure 21: Data flow for Package 09.

Step 1: In the data flow, I’ll start with a basic OLE DB source that reads Adventureworks for Orders by Vendor (Listing 3). Note a few things. First, I’ve included a tinyint column called ColJoin and I’ve defaulted the value to 1: I’ll use this later when I need to aggregate orders by average amount and then jam the average back into the pipeline, through a Merge Join transformation (down in Step 6). Second, I need to make sure the pipeline data is explicitly sorted by ColJoin (for the Merge Join to work): So I went to the OLE DB Source Advanced Editor and clicked on the Input and Output properties tab. In this tab, I set the IsSorted property to true, then expanded the Output columns and set the SortKey position for ColJoin to 1. (Alternatively, I could have used a Sort transformation.)

Step 2: Next, I drag in a Percentage Sampling transformation and configure the transformation to create a pipeline based on 25% of the rows. The transformation will split the pipeline into two pipelines: one for the 25% random sampling and another for the unused 75%. Note that I could have used the TABLESAMPLE clause back in the SQL query but chose to show it here for demonstration purposes.

Step 3: Next, I want to determine the average order amount from the 25% random sampling and then do two things: jam that average back into the pipeline, and also place the average into an SSIS variable (AverageOrder). This is where I need to apply the “elbow grease”: SSIS doesn’t provide a built-in manner for me to accomplish it, but doesn’t stand in my way. In plain English, I’ll use a Multicast (in Step 4) to create two copies of the pipeline. In Pipeline copy A, I’ll run an aggregation task (also in Step 4) to average the sampling of orders and group by ColJoin (and recall, ColJoin has a value of 1 for every row). The aggregation transformation will output a single row pipeline containing the average order amount and the single ColJoin value of 1. Then I’ll create a data flow script component (in Step 5) to “capture” the single aggregation average into the SSIS variable AverageOrder. Finally, I’ll use a Merge Join (in Step 6) to join the single row back with PipeLine Copy B from the Multicast (thus “jamming” the single value back into the pipeline).

Step 4: So to begin my little odyssey, I’ll drag in a Multicast and will eventually create two data pipelines from the 25% random sampling.

I’ll take one copy from the Multicast and will drag in an Aggregation transformation. The Aggregation transformation allows me to implement the equivalent of a GROUP BY on data in the pipeline. The transformation lists all columns in the pipeline: I select TotalDue and define the Operation as Average (as opposed to Sum), and then I select ColJoin and define the Operation as Group. Since every row contains the same value for ColJoin (value of 1), the Aggregation task will create a one-row, two-column pipeline.

Step 5: Next, I’ll drag in a script transformation to capture the value of the sampling average order from the one-row pipeline into a variable. In the script transformation, I need to perform three tasks. First, I must define the variable AverageOrder as a ReadWrite variable, and I need to define the AverageTotalDue from the pipeline as an Input Column. Second, I need to enter in a script (Listing 4) that will capture the value from the pipeline input row (using a hook into the SSIS pipeline function Input0_ProcessInputRow, and place into a local temporary variable. In the same script, I’ll tap into the PostExecute script function to read the local temporary variable into the actual SSIS variable. The reason for the two functions (and the local temporary variable) is because neither function will recognize both the input row and the SSIS variable. Finally, in the Input and Output properties tab, I go to the Aggregate Output, define IsSorted as True, and then select the ColJoin column inside the Output columns and set the KeyPosition to 1. (Once again, I need to do this to eventually join the ColJoin back into the main pipeline, as the Merge Join requires an explicit internal sort.)

Step 6: And to end the odyssey, I’ll drag in a Merge Join transformation so that I can force (I’ve grown tired of using the word “jam”) the average sampling order value back into the pipeline. This is why I created the single value ColJoin column to begin with, so that I could join the average order value back into the pipeline. In the Merge Join editor, I join all the columns from the second copy (PipeLine Copy B) of the Multicast with the single row copy (PipeLine Copy A), on the ColJoin column.

Step 7: For this point on, it’s “business as usual” to finish off the package. I drop in a Conditional Split transformation and I split the pipeline two ways (Less than Sampling Average and Greater than Sampling Average), where the Condition is the TotalDue column either being less than or greater than/equal to the AverageTotalDue (which is now in the pipeline).

Step 8: For the two split pipelines, I use two Row Count transformations to capture the number of rows into the SSIS variables NumberAboveAverage and NumberBelowAverage. Finally in the data flow, I use two Flat File Destination components to output the two pipelines into the two CSV files LargeOrders.csv and SmallOrders.CSV. An important note on the two output files: the Vendor Names (all the way back in the original pipeline result) sometimes have commas, which can wreck havoc on a CSV file (since every column is delimited by a comma). I can address this issue in the Flat File connection Manager Editor (for both output files) by specifying a double quote in the Text Qualifier.

Step 9: Finally, back in the control flow, I can send an email with both files. Once again, I’ll use a script task back in the control flow to send the email, given the limitations of the built-in Send Mail task. As I already covered this back in Tip 7, I won’t repeat all the specifics. The only thing to remember is that I can create an attachment object in the script code (see below), and then create a message body that references the variables for the order counts and the sampling order average.

Attachment myAttachment = new
// then create the MailMesage object, and
// add any attachment objects
int NumberBelowAverage =
// repeat this to create local variables
// for NumberAboveAverage and AverageOrder
string EmailMessage =
   "# Rows Below Avg: " +
       NumberBelowAverage.ToString() + "\r\n" +
   "# Rows Above Avg: " +
       NumberAbove.ToString() + "\r\n" +
   "\r\n" +
   "$ Average from Sampling: " +

Tip 10: Using Stored Procedures with SSIS

Scenario: I want to read all of the orders that have been processed since the date of our last extract (using a stored procedure). For all the orders read, I want to identify the unique list of vendors and write out a separate CSV file for each vendor, summarizing the orders by ship method (using a 2nd stored procedure). Finally, I want to write out an extract log entry so that the next run of the package will only retrieve orders processed since this most recent run.

SSIS features I’ll use in this package:

  • Execute SQL task
  • Sequence Container task and transaction isolation levels
  • Microsoft Distributed Transaction Coordinator
  • Foreach task with a Foreach ADO enumerator

Note: This package requires several stored procedures and UDFs that I’ve included in the folder C:\SQL2008IntegrationServicesExamples\SQLScripts. Another package (Package10_a_) will run the scripts to create the necessary stored procedures and UDFs in the DemoEmptyDB database that I initially created back in the first package. The procedures and functions are as follows:

  • dbo.GetLatestExtractDate(): a scalar UDF that will read the OrderExtractLog table and return the Max(ExtractDate). If the maximum extract date is NULL (i.e., the OrderExtractLog table is empty), the UDF will return an extremely old date that would precede the earliest transaction in the database. Note that the UDF returns a string and not a datetime, which I’ll explain at the end.
  • dbo.GetOrdersSinceLastExtract: a stored procedure that receives a datetime as a parameter, and retrieves all orders from the OrderDemo table with a LastUpdateDate that is more recent than the datetime parameter. (Essentially, this procedure retrieves all orders that have been added/modified since the last extract.)
  • dbo.GetCurrentServerDate: a scalar UDF that returns a string representation of the SQL Server GetDate() function.
  • dbo.InsertOrderExtractLog: a stored procedure that inserts a new row into the OrderExtractLog, based on a date parameter.

Some people initially think that SSIS replaces database stored procedures. As it turns out, nothing could be further from the truth - I can incorporate stored procedures into SSIS packages for a basic but functional incremental extract scenario.

Step 1: In the control flow, I’ll create two string variables: LastExtractDate and NewExtractDate. I’ll explain later why I created these variables as string variables and not DateTime variables.

Step 2: Next, I’ll drag in a Sequence Container from the toolbox. In the Sequence Container properties, I’ll set the Transaction Option to Required and the Isolation Level to Serializable. (I covered the SQL Server transaction isolation levels in the September/October 2008 issue of CODE Magazine).

Note that I also must turn on the Microsoft Distributed Transaction Coordinator Service (DTC) in Windows Services.

Step 3: Now I’ll place a series of execute SQL tasks inside the Sequence Container. SSIS will run these tasks inside the sequence container as a single transaction. For the first Execute SQL task, I’ll call the UDF that determines the last extract date and I’ll return it into the SSIS variable LastExtractDate. Specifically, I’ll place the syntax below into the SQL statement property, I’ll set the ResultSet property to Single Row, and I’ll go to the Result Set tab to map the name of the query result set column (LastExtractDate) to the SSIS variable of the same name.

select dbo.GetLatestExtractDate ()
             as LastExtractDate

Step 4: In this step I’ll add a Data Flow task where I’ll retrieve all orders with a modified date more recent than the date of the last extract. In Figure 23, I drop an instance of an OLE DB source into the data flow, and inside I call the stored procedure dbo.GetOrdersSinceLastExtract. Note that I need to specify a parameter token placeholder (the question mark) in the expression, because the SQL expression does not directly recognize SSIS variables. I can resolve the question mark token in the Parameters section, where I reference the actual stored procedure parameter name (@ExtractDate) with the SSIS variable. This is a general issue in SSIS: I cannot call stored procedures and directly reference SSIS variables or input pipeline columns, and must resolve them in the Parameters tab.

Figure 22: Control flow for Package 10.
Figure 23: Data flow and executing stored procedures for Package 10.

Step 5: The remainder of the data flow task will capture the number of extracted rows into the SSIS variable, NumRowsExtracted, and will write out a basic CSV file.

Step 6: Back in the control flow, I’ll add another Execute SQL task, and I’ll call the scalar UDF dbo.GetCurrentServerDate(), which I’ll map to the variable NewExtractDate. Note that I also set an option in the precedence constraint into this task so that the task will only fire if the @NumRowsExtracted variable has a value greater than zero. (I only want this task and the next task to run if the data flow extracted rows.)

Step 7: Finally, I want to insert the newest extract date and the number of extracted rows back into the Extract Log. So I’ll call one last instance of the Execute SQL task. I’ll set the SQL statement to the following code (below), I’ll set the Result Set property to None, and in the Parameter mapping I’ll map the two stored procedure parameters (@ExtractDate and @NumberRowsExtracted) to the two SSIS variables (NewExtractDate and NumRowsExtracted).

exec InsertOrderExtractLog ? , ?

OK, so why did I treat the datetime variables as strings? Simple - the SSIS datetime variable doesn’t handle milliseconds properly when passed between SSIS and SQL Server. I can resolve this by casting a SQL datetime as a varchar and then using the datetime style of 21 (ODBC canonical milliseconds):

SELECT convert( varchar, getdate(),21)

Tip 11: SSIS 2008 with Web Services

Scenario: I want to call a web service from within an SSIS package and bring the results into the SSIS data pipeline.

SSIS feature I’ll use in this package:

  • Script task

Consuming XML web services in SSIS 2008 is much easier than it was in SSIS 2005. This is due to Microsoft’s rewrite of their scripting engine. The SSIS 2008 scripting engine is powered by Visual Studio 2008 Tools for Applications. I could devote an entire Baker’s Dozen article to examples that are now possible in SSIS 2008. For now, I’ll show an example where I want to call a public web service (www.webservicex.net/CurrencyConvertor.asmx) to obtain current exchange rates between the U.S. dollar and other currencies (Figure 24). Here are the steps:

Figure 24: Calling a web service in the data flow for Package 11.

Step 1: In the Data Flow task, I’ll query the AdventureWorks Sales.Currency table for collection of currency codes (AUD, CAD, EUR, GBP, JPY, MXN, etc.).

Step 2: I’ll drop an instance of a script component transformation into the data flow. In the script component editor, I’ll define the Input columns as the FromCurrencyCode (which will be the U.S. Dollar for every row) and the ToCurrencyCode. In the output columns area, I’ll add a new decimal column ConversionRate for the pipeline. (The script will call the web service and populate the conversion rate for each FromCurrencyCode/ToCurrencyCode instance.)

Step 3: In the actual script (which I’ll write in C#), I’ll open Project Explorer, right-click on web references, and use the interface to discover the web service www.webservicex.net/CurrencyConvertor.asmx. I’ll call the web service reference ConvertorWebReference. SSIS creates a new internal namespace that I’ll reference in the next step.

Step 4: At the very top of the script, I’ll need to reference the namespace that SSIS created. To determine the actual namespace name, I need to open up the Class View (click the main View dropdown and select Class View). The Class View will list the full name, which you can reference as a using statement in the top of your script code (see the top of Listing 5).

Step 5: In the script code (Listing 5), I create local variables from the input row (in the hook for the function Input0_ProcessInputRow), create an object instance of the web reference, call the web service, and return the value of the conversion rate to the output column (ConversionRate) back in Step 2.

Tip 12: Scheduling a Backup Package with SQL Server Agent

Scenario: I want to schedule a backup and then run it on a schedule.

SSIS features I’ll use in this package:

  • Backup task
  • Update statistics

Figure 25 shows this very basic package where I can drag several control flow tasks from the Database Maintenance Plan tasks category of the toolbox. These tasks include not only backing up a database and updating database statistics, but also rebuilding/reorganizing indexes, shrinking a database, and several others.

Figure 25: Control flow for Package 12.

To schedule this package (or any package), I can do the following:

Step 1: Open up SQL Server Agent (in SQL Server Management Studio) and right-click to create a new job.

Step 2: Provide a job owner for the job to run unattended. This is often a system account, special admin account, depending on the decision of the I.T. manager or DBA.

Step 3: Go to the Step tab, click NEW to add a new step, and select SQL Server Integration Services for the step type (note that Agent provides many different step types). At the bottom of the screen in the General area, specify the location of the package.

In the schedule tab, define when the package/job should be executed (for example, every night, Sunday night at 4 AM, etc.).

Tip 13: Dealing with a Variable Number of Data-Driven Iterations

Scenario: Suppose I have a product category table with a different category manager email address for each category. I want to retrieve orders for each category, generate a summarized output file, and email the output file to the category manager. If I add any new categories and managers to the product category table, the package should pick up these new entries and continue to work without modifying the package.

SSIS features I’ll use in this package:

  • Execute SQL task with a result set
  • Object variables
  • Foreach ADO enumerator
  • SQL query with parameter markers

I nearly made this tip the “Baker’s Dozen spotlight” for one reason. Just like developers often view examples that deviate wildly from best/recommended practices as “fingernails on a chalkboard,” one of the common SSIS examples that makes me want to scream is defining enumerations in a conditional split. For instance, suppose I read a set of rows into the data flow pipeline and need to split the input pipeline five ways based on customer status. What happens if my client adds a sixth customer status? I’d need to modify the package to handle it.

Or take another example - I once saw an SSIS package that needed to read transactions and organize them into four category managers (and send an email to each category manager). The package hard-coded the four categories into the conditional split transformation. Yes, this “works” - but there is a better way that will continue to work if new categories and category managers are added to the database.

The solution (using AdventureWorks) involves a small level of abstraction (Figure 26): I’ll use the Execute SQL task to read all of the category rows from the Category table into an object variable (which is similar to an in-memory array). Then I’ll iterate through each item in the object/array using an ADO enumerator (and place the value of each item into a variable). In the data flow, I’ll read all the orders where the category equals the value in the variable, and write out the results to a new CSV file that contains the name/value. Finally, I’ll send an email to the category manager that was part of each category in the original object variable. This approach is bit more complicated but requires less maintenance. So I’ll dive right into the steps:

Figure 26: Control flow for Package 13.

Step 1: I’ll create some variables: an object variable called ProductCategoryRecord, an Int32 variable called ProductCategoryID, a string variable called ProductCategoryName, and a string variable called ManagerEmailAddress.

Step 2: I’ll drag in an instance of an Execute SQL task into the control flow. The query will simply bring forward the ProductCategoryID, ProductCategoryName, and ManagerEmailAddress from the Production.ProductCategory table.

Step 3: I need to configure the Execute SQL task to read multiple rows into the ProductCategoryRecord object. So I’ll set the ResultSet property to Full Result Set. Additionally, in the Result Set tab, I’ll set the variable to the ProductCategoryRecord object, and the Result Name to 0. (The reason for the zero is because SQL result set indexers are zero-based.)

Step 4: Next, I’ll drop an instance of the Foreach loop into the control flow. In the Foreach loop, I’ll define the enumerator as the Foreach ADO enumerator, and I’ll define the object source variable as the very same ProductCategoryRecord that I just populated. Finally, since the object variable array actually holds three columns (ProductCategoryID, ProductCategoryName, and ManagerEmailAddress), I need to map them (in the Foreach loop Variables Mapping page) to the three SSIS variables of the same name. The index will be 0, 1, and 2, and I MUST reference the variables in the same order as I referenced them in the initial SQL query!

Step 5: At this point, the package is pretty much “business as usual.” The only other major topic to mention is that in the data flow, when I retrieve orders for each individual category ID, my SQL query in the OLE DB source (Figure 27) cannot reference the ProductCategoryID variable (for the current product category in the loop): once again, I must use a question mark token placeholder and then reference the variable in the parameters page.

Figure 27: Data flow for Package 13.

Step 6: The remainder of the package will utilize the ProductCategoryName and EmailAddress variables to dynamically define the flat file connection manager connection string and email recipients.

My Next Article

Well, after nearly two years, I made it back. And I owe a ton of gratitude to CODE Magazine’s Editor-in-Chief, Rod Paddock, for allowing me to come back. Some have asked what happened to me - well, I became a father and had to shift my priorities for a period of time. But I’m back.

This is the first in a series of articles I plan to write on using the tools in the Microsoft SQL Server Business Intelligence area (often known as the “BI stack”). In my next article, I’ll show some more advanced uses of SSIS, will cover some Transact-SQL in SQL 2008, and will walk through creating some packages for a custom project. (Essentially, over the next year, I plan to build out a modest-sized data warehouse and BI project.) Some may read this article and ask, “Why didn’t he cover SSIS logging?....Why didn’t he cover some of the more advanced Lookup functions?...etc.) Stay tuned!

With each passing year, I talk to more and more .NET developers who tell me that they want/need to learn more about the SQL BI tools. I hope readers will find this article and future Baker’s Dozen articles to be useful.