Preview only show first 10 pages with watermark. For full document please download
Professional Microsoft Sql Server 2014 Integration Services | Alfonso ...
Jessica has authored technical content for multiple magazines, websites, and books, including the Wrox book Professional Microsoft SQL Server 2012 Integration Services, and has spoken internationally at conferences such as the PASS Community Summit, SharePoint Connections, and the SQLTeach International ...
Retrieving data into a file is good, but using it in an SSIS package is even better. The next example demonstrates how you would use the XML Task to retrieve this same zip code data and use it in a Data Flow.
Retrieving Data Using the Web Service Task and XML Source Component In this example, you’ll configure the data retrieved from the Web Service Task to be read through the XML source in the Data Flow. Don’t worry if the Data Flow is a little confusing at this time. You’ll see much more about it in the next chapter.
1.
Set up a project and package in the directory c:\ProSSIS\tasks\websvc or download the complete package from www.wrox.com/go/prossis2014.
2.
Drop a Web Service Task onto the Control Flow design surface and configure the task to use the GetInfoByZipCode method on the web service USZip as shown in the preceding section.
3.
Go to the Output tab and set the OutputType to store the results of the web service method to a file of your choosing, such as C:\ProSSIS\Tasks\WebSVC\Output.xml.
4.
Drop a Data Flow Task onto the Control Flow design surface and connect the Web Service Task to the Data Flow.
5.
In the Data Flow, drop an XML source component on the design surface.
www.it-ebooks.info
c03.indd 59
3/22/2014 8:03:29 AM
60╇
❘╇ CHAPTER 3╇ SSIS Tasks
If the XML source contained schema information, you could select the Use Inline Schema option — the Data Access Mode should be set to “XML file location” — and you’d be done. However, you’ve seen the data we are getting from the web service, and no schema is provided. Therefore, you need to generate an XML Schema Definition language file so that SSIS can predict and validate data types and lengths. Note ╇ Here’s a little trick that will save you some time. To demonstrate the Web Service Task initially, you set the XML output to go to a file. This was not by accident. Having a concrete file gives you a basis to create an XSD, and you can do it right from the design-time XML Source Component. Just provide the path to the physical XML file you downloaded earlier and click the Generate XSD button. Now you should have an XSD file that looks similar to this:
Notice that the XSD generator is not perfect. It can only predict a data type based on what it sees in the data. Not to give the generator anthropomorphic qualities, but the ZIP and AREA_CODE data elements “look” like numeric values to the generator. You should always examine the XSD that is created and edit it accordingly. Change the sequence element lines for ZIP and AREA_CODE to look like this:
Now if you refresh the XML Source and select the Columns tab, as shown in Figure 3-16, you should be able to see the columns extracted from the physical XML file.
www.it-ebooks.info
c03.indd 60
3/22/2014 8:03:29 AM
Data Preparation Tasks╇
❘╇ 61
Figure 3-16
6.
To complete the package, add a Flat File Destination to dump the data into a commaseparated file (CSV file).
Connect the output pipeline of the XML source to the Flat File Destination.
7. 8.
9.
Click OK to go back to the Flat File Destination and then click the Mappings tab to confirm that the columns map appropriately (straight arrows between the left and right). If you save and run the package, it will download the XML file into a variable, and then export the columns and rows to a flat file.
Click the New button next to the Connection Manager dropdown box to create a new Flat File Connection Manager. Place the file somewhere in the C:\ProSSIS\Tasks\WebSVC directory and call it whatever you’d like, as shown in Figure 3-17.
This is hardly a robust example, but it demonstrates that the Web Service Task makes retrieving data from a web service a very simple point-and-click task. However, the Web Service Task can retrieve only the results of a web service call. You may find that you need to prepare, extract, or validate your XML files before running them through your ETL processes. This is where the XML Task comes in.
www.it-ebooks.info
c03.indd 61
3/22/2014 8:03:29 AM
62╇
❘╇ CHAPTER 3╇ SSIS Tasks
Figure 3-17
XML Task The XML Task is used when you need to validate, modify, extract, or even create files in an XML format. Earlier we used a Web Service Task to retrieve data in an XML-formatted web service response. In terms of validating this type of XML result, the WSDL that you copy down locally is your contract with the web service, which will break if the XML contents of the results change. In other situations, you may be provided with XML data from a third-party source outside of a contractual relationship. In these cases, it is a good practice to validate the XML file against the schema definition before processing the file. This provides an opportunity to handle the issue programmatically. If you look at the task in Figure 3-18, the editor looks simple. There are two tabs: only one for General configuration and the obligatory Expressions tab. The current OperationType is set in this example to the Diff operation. This option is one of the more involved operations and requires two XML sources, one as the Input and the other as the Second Operand. However, these properties change based on the selection you make for the OperationType property. The options are as follows:
www.it-ebooks.info
c03.indd 62
3/22/2014 8:03:29 AM
Data Preparation Tasks╇
❘╇ 63
Figure 3-18 ➤➤
Validate: This option allows for the schema validation of an XML file against Document Type Definition (DTD) or XML Schema Definition (XSD) binding control documents. You can use this option to ensure that a provided XML file adheres to your expected document format.
➤➤
XSLT: The Extensible Stylesheet Language Transformations (XSLT) are a subset of the XML language that enables transformation of XML data. You might use this operation at the end of an ETL process to take the resulting data and transform it to meet a presentation format.
➤➤
XPATH: This option uses the XML Path Language and allows the extraction of sections or specific nodes from the structure of the XML document. You might use this option to extract data from the XML document prior to using the content. For example, you might want to pull out only the orders for a specific customer from an XML file.
➤➤
Merge: This option allows for the merging of two XML documents with the same structure. You might use this option to combine the results of two extracts from disparate systems into one document.
➤➤
Diff: This option uses difference algorithms to compare two XML documents to produce a third document called an XML Diffgram that contains the differences between them. Use this option with another XML Task using the Patch option to produce a smaller subset of data to insert into your data store. An example use of this task is extracting only the prices that have changed from a new price sheet in XML format.
➤➤
Patch: This option applies the results of a Diff operation to an XML document to create a new XML document.
www.it-ebooks.info
c03.indd 63
3/22/2014 8:03:30 AM
64╇
❘╇ CHAPTER 3╇ SSIS Tasks
As you might expect, you can configure the task to use either a file source or a variable. The option to input the XML directly is also available, but it’s not as practical. The best way to get an idea of how this task can be used is to look at a few examples.
Validating an XML File First up is a basic use case that demonstrates how to validate the internal schema format of an XML file. To make sure you are clear on what the XML Task does for you, the validation is not about whether the XML file is properly formed but about whether it contains the proper internal elements. If an XML file is malformed, then simply attempting to load the XML file in the task will generate an error. However, if a missing node is defined within the XSD contract, the XML Task Validation option will inform you that the XML file provided doesn’t meet the conditions of the XSD validation. For this example, we’ll borrow the information from the XML and XSD files in the Web Service Task example. Recall that we had an XSD that validated a string node for City, State, Zip, Area_Code, and Time_Zone. (See the Web Service Task example to view the XSD format.) You can download this complete example at www.wrox.com/go/prossis2014. We’ll use three files to exercise this task. The first is a valid XML file named MyGetZipsData.xml that looks like this:
Saint AugustineFL32084904E
The second file is an invalid XML file named MyGetZipsData_Bad.xml. This file has an improperly named node that doesn’t match the XSD specification:
Saint AugustineFL32084904E
The last file is a malformed XML file named MyGetZipsData_ReallyBad.xml. This file has an empty
node and is not a valid XML format:
Saint Augustine
www.it-ebooks.info
c03.indd 64
3/22/2014 8:03:30 AM
Data Preparation Tasks╇
❘╇ 65
FL32084904E
For this example, follow these steps: Create a package and add a new XML Task to the Control Flow surface.
1. 2.
3.
Expand the OperationResult property in the Output section to configure an additional text file to capture the results of the validation. The result values are only true or false, so you only need a simple text file to see how this works. Typically, you store the result in a variable, so you can test the results to determine the next action to take after validation.
4.
Set the OverwriteDestination property to True to allow the result to be overwritten in each run of the task.
5.
In the Second Operand, you’ll need to create another file connection to the XSD file. This will be used for validation of the schema.
6. 7.
Create another file connection using an existing file that points to this XSD file.
Select the OperationType of Validate, set the Input Source Type to a new file connection, and browse to select the MyGetZipsData.xml file.
Finally, set the validation type to XSD, as we are using an XSD file to validate the XML. The editor at this point should look like Figure 3-19.
Figure 3-19
www.it-ebooks.info
c03.indd 65
3/22/2014 8:03:30 AM
66╇
❘╇ CHAPTER 3╇ SSIS Tasks
This completes the happy path use case. If you execute this task, it should execute successfully, and the results file should contain the value of true to indicate that the XML file contains the correct schema as defined by the XSD file. Now on to the true test:
8. 9.
10.
Change the source to a new connection for the MyGetZipsData_Bad.xml file. Execute the task again. This time, although the task completes successfully, the result file contains the value of false to indicate a bad schema. This is really the whole point of the Validation option. Finally, change the source to create a new connection to the poorly formatted XML file MyGetZipsData_ReallyBad.xml to see what happens. In this case, the task actually fails — even though the Validation option’s FailOnValidationFail property is set to False. This is because the validation didn’t fail — the loading of the XML file failed. The error message indicates the problem accurately: [XML Task] Error: An error occurred with the following error message: "The 'NewDataSet' start tag on line 2 does not match the end tag of 'Table'. Line 9, position 5.". [XML Task] Error: An error occurred with the following error message: "Root element is missing.".
Just be aware of the difference between validating the schema and validating the XML file itself when designing your package Control Flows, and set up accordingly. You need to have a Control Flow for the failure of the task and for the successful completion with a failure result. This is just one example demonstrating how you can use the XML Task for SSIS development. There are obviously several other uses for this task that are highly legitimate and useful for preparing data to feed into your SSIS ETL package Data Flows. The next section turns to another set of data preparation tasks, which we have separated into their own category, as they deal specifically with retrieval and preparation of RDBMS data.
RDBMS Server Tasks These tasks could also be considered data preparation tasks, as they are responsible for bringing data sources into the ETL processes, but we have separated the Bulk Insert Task and the Execute SQL Task into this separate category because of the expectation that you will be working with data from relational database management systems (RDBMS) like SQL Server, Oracle, and DB2. The exception to this is the Bulk Insert Task, which is a wrapper for the SQL Server bulk-copy process.
Bulk Insert Task The Bulk Insert Task enables you to insert data from a text or flat file into a SQL Server database table in the same high-octane manner as using a BULK INSERT statement or the bcp.exe commandline tool. In fact, the task is basically just a wizard to store the information needed to create and execute a bulk copying command at runtime (similar to BCP from a command line). If you aren’t familiar with using BCP, you can research the topic in detail in Books Online. The downside of the Bulk Insert Task is its strict data format, and it precludes being able to work with data in a Data Flow within one action. This can be seen as a disadvantage in that it does not allow any
www.it-ebooks.info
c03.indd 66
3/22/2014 8:03:30 AM
RDBMS Server Tasks╇
❘╇ 67
transformations to occur to the data in flight, but not all ETL processes are efficiently modified in the Data Flow. In high-volume extracts, you may be better served by loading the initial extract down in a staging table and extracting data in discrete chunks for processing within specific Data Flows. The Bulk Insert Task has no ability to transform data, and this trade-off in functionality gives you the fastest way to load data from a text file into a SQL Server database. When you add a Bulk Insert Task to your Control Flow, follow these steps:
1.
Open the Bulk Insert Task Editor to configure it. As in most tasks, use the General tab to name and describe the task. Make sure you name it something that describes its unit of work, like Prepare Staging. This will be helpful later when you deploy the package and troubleshoot problems. The next tab, Connection, is the most important. This tab enables you to specify the source and destination for the data.
2. 3. 4.
Select the destination from the Connection dropdown in the Destination Connection group. Specify a destination table from the next dropdown, below the destination connection. While you’re specifying connections, go to the bottom to specify the source connection’s filename in the File dropdown. Both the source and the destination connections use the Connection Manager. If you haven’t already created the shared connections, you’ll be prompted to create them in either case by selecting . Note ╇ Both the source and the optional format file must be relative to the destination SQL Server, because the operation occurs there when a Bulk Insert Task is used. If you are using a network file location, use the UNC path (\\MachineName\ShareName\FileName.csv) to the source or format file.
5.
6.
After you specify the connections, you need to provide file specifications for the format of file you’re importing. If you created the file using the BCP utility, you can use the -f option to create a format file as well. The Bulk Insert Task can then use the BCP format file to determine how the file is formatted, or you can select the column and row delimiters in the Format property of the task. The two options are: ➤➤
Use File: This uses the BCP format (.fmt) file.
➤➤
Specify: This enables you to select the file delimiters. The available delimiters are New Line ({CR}{LF}), Carriage Return ({CR}), Line Feed ({LF}), Semicolon (;), Comma (,), Tab, or Vertical Bar (|). Note that the defaults are for the row to be {CR} {LF} delimited, and the column tab-delimited.
In the Options tab of the Bulk Insert Task Editor, you can use some lesser-known options: ➤➤
Code page: You can specify the code page for the source file. You will rarely want to change the code page from RAW, which is the default. Using RAW is the fastest data-loading option because no code page conversion takes place.
➤➤
OEM: You should use this when copying from one SQL Server to another.
www.it-ebooks.info
c03.indd 67
3/22/2014 8:03:30 AM
68╇
❘╇ CHAPTER 3╇ SSIS Tasks
➤➤
ACP: This converts non-Unicode data to the ANSI code page of the SQL Server you are loading the data into, or you can specify a specific code page mapping.
➤➤
DataFileType: Specifies the type of the source file. Options here include char, native, widechar, and widenative. Generally, files you receive will be the default option, char, but in some cases, you may see a file with a native format.
A file (myzips_native.txt) in native format was created from SQL Server by using the bcp.exe program with the –n (native) switch and supplied with the download from www.wrox.com/go/prossis2014. You’ll see how to import it later in an example. You can also use the Options tab to specify the first and last row to copy if you want only a sampling of the rows. This is commonly used to set the first row to two (2) when you want to skip a header row. The BatchSize option indicates how many records will be written to SQL Server before committing the batch. A BatchSize of 0 (the default) means that all the records will be written to SQL Server in a single batch. If you have more than 100,000 records, then you may want to adjust this setting to 50,000 or another number based on how many you want to commit at a time. The adjustment may vary based on the width of your file. The Options dropdown contains five options that you can enable/disable: ➤➤
Check Constraints: This option checks table and column constraints before committing the record. It is the only option enabled by default.
➤➤
Keep Nulls: By selecting this option, the Bulk Insert Task will replace any empty columns in the source file with NULLs in SQL Server.
➤➤
Enable Identity Insert: Enable this option if your destination table has an identity column into which you’re inserting. Otherwise, you will receive an error.
➤➤
Table Lock: This option creates a SQL Server lock on the target table, preventing inserts and updates other than the records you are inserting. This option speeds up your process but may cause a production outage, as others are blocked from modifying the table. If you check this option, SSIS will not have to compete for locks to insert massive amounts of data into the target table. Set this option only if you’re certain that no other process will be competing with your task for table access.
➤➤
Fire Triggers: By default, the Bulk Insert Task ignores triggers for maximum speed. When you check this option, the task will no longer ignore triggers and will instead fire the insert triggers for the table into which you’re inserting.
There are a few other options you can set in the Options tab. The SortedData option specifies what column you wish to sort by while inserting the data. This option defaults to sort nothing, which equals false. If you need to set this option, type the column name that you wish to sort. The MaxErrors option specifies how many errors are acceptable before the task is stopped with an error. Each row that does not insert is considered an error; by default, if a single row has a problem, the entire task fails.
www.it-ebooks.info
c03.indd 68
3/22/2014 8:03:30 AM
RDBMS Server Tasks╇
❘╇ 69
Note ╇ The Bulk Insert Task does not log error-causing rows. If you want bad records to be written to an error file or table, it’s better to use the Data Flow Task.
Using the Bulk Insert Task Take time out briefly to exercise the Bulk Insert Task with a typical data load by following these steps:
1.
Create a new package called BulkInsertTask.dtsx. If you haven’t already downloaded the code files for this chapter from www.wrox.com/go/prossis2014, do so. Then extract the file for this chapter named myzips.csv.
2.
Create a table in the AdventureWorksDW database using SQL Management Studio or the tool of your choice to store postal code information (code file Ch03SQL.txt): CREATE TABLE PROSSIS_ZIPCODE ( ZipCode CHAR(5), State CHAR(2), ZipName VARCHAR(16) )
3.
Back in your new package, drag the Bulk Insert Task onto the Control Flow design pane. Notice that the task has a red icon on it, indicating that it hasn’t been configured yet.
4.
Double-click the task to open the editor. In the General tab, provide the name Load Zip Codes for the Name option and Loads zip codes from a flat file for the description.
5.
Click the Connection tab. From the Connection dropdown box, select. This will open the Configure OLE DB Connection Manager dialog.
6.
Now, you’re going to create a connection to the AdventureWorksDW database that can be reused throughout this chapter. Click New to add a new Connection Manager. For the Server Name option, select the server that contains your AdventureWorksDW database. For the database, select the AdventureWorksDW database.
7.
Click OK to go back to the previous screen, and click OK again to return to the Bulk Insert Task Editor. You’ll now see that the Connection Manager you just created has been transposed into the Connection dropdown box.
8.
Now you need to define the destination. For the DestinationTable option, select the [dbo] .[PROSSIS_ZIPCODE] table. For the first attempt, you’ll import a comma-delimited version of the zip codes. This simulates importing a file that would have been dumped out of another SQL Server (with the same table name) using this bcp command: bcp AdventureWorksDW.dbo.prossis_zipcode out c:\ProSSIS\tasks\bulkInsertTask\myzips.csv -c -t, -T
9.
Leave the remaining options set to the defaults. The RowDelimiter property option will be {CR}{LF} (a carriage return) and the ColumnDelimiter property should be set to Comma {,}.
www.it-ebooks.info
c03.indd 69
3/22/2014 8:03:30 AM
70╇
❘╇ CHAPTER 3╇ SSIS Tasks
10.
For the File option, again select to create a new Connection Manager. This will open the File Connection Manager Editor.
11.
For the Usage Type, select Existing File. Then point to myZips.csv for the File option. Click OK to return to the editor. Your final Bulk Insert Task Editor screen should look similar to Figure 3-20.
Figure 3-20
If you open the myzips.csv file, you’ll notice there is no header row with the column names before the data. If you had a column header and needed to skip it, you would go to the Options tab and change the FirstRow option to 2. This would start the import process on the second row, instead of the first, which is the default. 12.
You should be able to run the package now. When it executes, the table will be populated with all the postal codes from the import file. You can verify this by selecting all the rows from the PROSSIS_ZIPS table.
As you can see, the Bulk Insert Task is a useful tool to load staging files quickly, but you may need to further process the file. One reason is because this task provides no opportunity to divert the data into a transformation workflow to examine the quality of the data. Another reason is that you have to import character-based data to avoid raising errors during the loading process. The Bulk Insert Task handles errors in an all-or-nothing manner. If a single row fails to insert, then your task may fail (based on your setting for the maximum number of allowed errors). These problems can be easily solved by using a Data Flow Task if the data is unreliable.
www.it-ebooks.info
c03.indd 70
3/22/2014 8:03:31 AM
RDBMS Server Tasks╇
❘╇ 71
Execute SQL Task The Execute SQL Task is one of the most widely used tasks in SSIS for interacting with an RDBMS Data Source. The Execute SQL Task is used for all sorts of things, including truncating a staging data table prior to importing, retrieving row counts to determine the next step in a workflow, or calling stored procedures to perform business logic against sets of staged data. This task is also used to retrieve information from a database repository. The Execute SQL Task is also found in the legacy DTS product, but the SSIS version provides a better configuration editor and methods to map stored procedure parameters to read back result and output values. This section introduces you to all the possible ways to configure this task by working through the different ways you can use it. You’ll work through how to execute parameterized SQL statements or execute batches of SQL statements, how to capture single-row and multiple-row results, and how to execute stored procedures.
Executing a Parameterized SQL Statement The task can execute a SQL command in two basic ways: by executing inline SQL statements or by executing stored procedures. The resulting action can also result in the need to perform one of two options: accepting return values in parameters or a result set. You can get an idea of how the task can be configured to do these combinations in the General tab of the Execute SQL Task Editor, shown in Figure 3-21. Here, the Execute SQL Task is set to perform an Update operation on the DimProduct table using an inline SQL statement with a variable-based parameter. This is the easiest use of the Execute SQL Task because you don’t need to configure the Result Set tab properties.
Figure 3-21
www.it-ebooks.info
c03.indd 71
3/22/2014 8:03:31 AM
72╇
❘╇ CHAPTER 3╇ SSIS Tasks
Notice in Figure 3-21 that the General tab contains the core properties of the task. Here the task is configured to point to an OLE DB connection. The other options for the ConnectionType include ODBC, ADO, ADO.NET, SQLMOBILE, and even EXCEL connections. The catch to all this connection flexibility is that the Execute SQL Task behaves differently depending upon the underlying data provider. For example, the SQLStatement property in Figure 3-21 shows a directly inputted T-SQL statement with a question mark in the statement. The full statement is here: UPDATE DimProduct Set Color = 'Red' Where ProductKey = ?
This ?, which indicates that a parameter is required, is classic ODBC parameter marking and is used in most of the other providers — with the exception of the ADO.NET provider, which uses named parameters. This matters, because in the task, you need to configure the parameters to the SQL statement in the Parameter Mapping tab, as shown in Figure 3-22.
Figure 3-22
Here the parameter mapping collection maps the first parameter [ordinal position of zero (0)] to a user variable. When mapping parameters to connections and underlying providers, use the following table to set up this tab in the Task Editor:
www.it-ebooks.info
c03.indd 72
3/22/2014 8:03:31 AM
RDBMS Server Tasks╇
❘╇ 73
If Using Connec tion of T ype
Par ameter Marker to Use
Par ameter Name to Use
ADO
?
Param1, Param2, ...
ADO.NET
@
@
ODBC
?
1, 2, 3 (note ordinal starts at 1)
OLEDB and EXCEL
?
0, 1, 2, 3 (note ordinal starts at 0)
Because we are using an OLE DB provider here, the parameter marker is ?, and the parameter is using the zero-based ordinal position. The other mapping you would have needed to do here is for the data type of the parameter. These data types also vary according to your underlying provider. SSIS is very specific about how you map data types, so you may need to experiment or check Books Online for the mapping equivalents for your parameters and provider. We’ll cover many of the common issues in this regard throughout this section, but for this initial example, we mapped the System::ContainerStartTime to the OLE DB data type of DATE. At this point, the Execute SQL Task with this simple update statement could be executed, and the ModifyDate would be updated in the database with a current datetime value. A variation of this example would be a case in which the statement can be dynamically generated at runtime and simply fired into the Connection Manager. The SQLSourceType property on the General tab allows for three different types of SQL statement resolution: either directly input (as we did), via a variable, or from a file connection. Another way to build the SQL statement is to use the Build Query action button. This brings up a Query-By-Example (QBE) tool that helps you build a query by clicking the tables and establishing the relationships. The variable-based option is also straightforward. Typically, you define a variable that is resolved from an expression. Setting the SQLSourceType property in the Execute SQL Task to Variable enables you to select the variable that will resolve to the SQL statement that you want the task to execute. The other option, using a file connection, warrants a little more discussion.
Executing a Batch of SQL Statements If you use the File Connection option of the Execute SQL Task’s SQLSourceType property, typically you are doing so to execute a batch of SQL statements. All you need to do is have the file that contains the batch of SQL statements available to the SSIS package during runtime. Set up a File Connection to point to the batch file you need to run. Make sure that your SQL batch follows a few rules. Some of these rules are typical SQL rules, like using a GO command between statements, but others are specific to the SSIS Execute SQL Task. Use these rules as a guide for executing a batch of SQL statements: ➤➤
Use GO statements between each distinct command. Note that some providers allow you to use the semicolon (;) as a command delimiter.
➤➤
If there are multiple parameterized statements in the batch, all parameters must match in type and order.
➤➤
Only one statement can return a result, and it must be the first statement.
www.it-ebooks.info
c03.indd 73
3/22/2014 8:03:31 AM
74╇
❘╇ CHAPTER 3╇ SSIS Tasks
➤➤
If the batch returns a result, then the columns must match the same number and properly named result columns for the Execute SQL Task. If the two don’t match and you have subsequent UPDATE or DELETE statements in the batch, these will execute even though the results don’t bind, and an error results. The batch is sent to SQL Server to execute and behaves the same way.
Returning results is something that we haven’t explored in the Execute SQL Task, so let’s look at some examples that do this in SSIS.
Capturing Singleton Results On the General tab of the Execute SQL Task, you can set up the task to capture the type of result that you expect to have returned by configuring the ResultSet property. This property can be set to return nothing, or None, a singleton result set, a multi-line result, or an XML-formatted string. Any setting other than None requires configuration of the Result Set tab on the editor. In the Result Set tab, you are defining the binding of returned values into a finite set of SSIS variables. For most data type bindings, this is not an issue. You select the SSIS variable data type that most closely matches that of your provider. The issues that arise from this activity are caused by invalid casting that occurs as data in the Tabular Data Stream (TDS) from the underlying provider collides with the variable data types to which they are being assigned. This casting happens internally within the Execute SQL Task, and you don’t have control over it as you would in a Script Task. Before you assume that it is just a simple data type–assignment issue, you need to understand that SSIS is the lowest common denominator when it comes to being able to bind to data types from all the possible data providers. For example, SSIS doesn’t have a currency or decimal data type. The only thing close is the double data type, which is the type that must be used for real, numeric, current, decimal, float, and other similar data types. The next example sets up a simple inline SQL statement that returns a single row (or singleton result) to show both the normal cases and the exception cases for configuring the Execute SQL Task and handling these binding issues. First, we’ll use a simple T-SQL statement against the AdventureWorks database that looks like this (code file Ch03SQL.txt): SELECT TOP 1 CarrierTrackingNumber, LineTotal, OrderQty, UnitPrice From Sales.SalesOrderDetail
We’ve chosen this odd result set because of the multiple data types in the SalesOrderDetail table. These data types provide an opportunity to highlight some of the solutions to difficulties with mapping these data types in the Execute SQL Task that we’ve been helping folks with since the first release of SSIS. To capture these columns from this table, you need to create some variables in the package. Then these variables will be mapped one-for-one to the result columns. Some of the mappings are simple. The CarrierTrackingNumber can be easily mapped to a string variable data type with either nvarchar or varchar data types in the Execute SQL Task. The OrderQty field, which is using the smallint SQL Server data type, needs to be mapped to an int16 SSIS data type. Failure to map the data type correctly will result in an error like this:
www.it-ebooks.info
c03.indd 74
3/22/2014 8:03:32 AM
RDBMS Server Tasks╇
❘╇ 75
[Execute SQL Task] Error: An error occurred while assigning a value to variable "OrderQty": "The type of the value being assigned to variable "User::OrderQty" differs from the current variable type. Variables may not change type during execution. Variable types are strict, except for variables of type Object."
The other two values, for the SQL Server UnitPrice (money) and LineTotal (numeric) columns, are more difficult. The closest equivalent variable data type in SSIS is a double data type. Now the parameters can simply be mapped in the Execute SQL Task Result Set tab, as shown in Figure 3-23. The Result Name property maps to the column name in your SQL statement or its ordinal position (starting at 0).
Figure 3-23
Just use the Add and Remove buttons to put the result elements in the order that they should be returned, name them according to the provider requirements, and get the right data types, and you’ll be fine. If these are in the incorrect order, or if the data types can’t be cast by the Execute SQL Task from the TDS into the corresponding variable data type, you will get a binding error. This should give you a general guide to using the Execute SQL Task for capturing singleton results.
Multi-Row Results Typically, you capture multi-row results from a database as a recordset or an XML file (particularly between SQL Server Data Sources) to use in another Script Task for analysis or decision-making purposes, to provide an enumerator in a Foreach or Looping Task, or to feed into a Data Flow Task
www.it-ebooks.info
c03.indd 75
3/22/2014 8:03:32 AM
76╇
❘╇ CHAPTER 3╇ SSIS Tasks
for processing. Set up the SQLSourceType and SQLStatement properties to call either an inline SQL statement or a stored procedure. In either case, you would set the ResultSet property in the General tab to Full Result Set, and the Result Set tab is set up to capture the results. The only difference from capturing a singleton result is that you need to capture the entire result into a variable, rather than map each column. The data type you should use to capture the results varies according to what you are capturing. The XML file can be captured in either a string or an object data type. The recordset can only be captured in a variable with the object data type. An example of the Execute SQL Task configured to create an object data type to store the results of a selection of rows from the Sales. SalesOrderDetail table is shown in Figure 3-24. Note that the Result Set tab shows the capturing of these rows with the required zero-ordinal position.
Figure 3-24
Once the recordset is stored as a variable, you can do things like “shred” the recordset. The term shredding means iterating through the recordset one row at a time in a Foreach Loop operation. For each iteration, you can capture the variables from, and perform an operation on, each row. Figure 3-25 shows how the Foreach Loop Container would look using the variable-based recordset. This container is covered in detail in Chapter 6. Another way to use the variable-based recordset is to use it to feed a data transform. To do this, just create a Source Script Transform in a Data Flow and add to it the columns that you want to realize from the stored recordset and pass in the recordset variable. Then add code (code file Ch03SQL.txt) similar to the following to turn the column data from the recordset into the output stream (to save time and space, only two columns are being realized in the recordset):
www.it-ebooks.info
c03.indd 76
3/22/2014 8:03:32 AM
RDBMS Server Tasks╇
❘╇ 77
Figure 3-25
C# public override void CreateNewOutputRows() { System.Data.OleDb.OleDbDataAdapter oleDA = new System.Data.OleDb.OleDbDataAdapter(); System.Data.DataTable dT = new System.Data.DataTable(); oleDA.Fill(dT, Variables.RecordSetResult); foreach (DataRow dr in dT.Rows) { Output0Buffer.AddRow(); //by Name Output0Buffer.CarrierTrackingNumber = dr["CarrierTrackingNumber"].ToString(); //by Ordinal Output0Buffer.UnitPrice = System.Convert.ToDecimal(dr[6]); } }
VB Public Overrides Sub CreateNewOutputRows() Dim oleDA As New System.Data.OleDb.OleDbDataAdapter() Dim dT As New System.Data.DataTable() Dim row As System.Data.DataRow oleDA.Fill(dt, Variables.RecordSetResult)
www.it-ebooks.info
c03.indd 77
3/22/2014 8:03:32 AM
78╇
❘╇ CHAPTER 3╇ SSIS Tasks
For Each row In dT.Rows Output0Buffer.AddRow() Output0Buffer.CarrierTrackingNumber = _ row("CarrierTrackingNumber").ToString() Output0Buffer.UnitPrice = System.Convert.ToDecimal(row(6)) Next End Sub
The XML version of capturing the result in a string is even easier. You don’t need to use the Script Component to turn the XML string back into a source of data. Instead, use the out-of-the-box component called the XML Source in the Data Flow. It can accept a variable as the source of the data. (Review the example demonstrating how to do this in the “Web Service Task” section of this chapter.) You can see that the Execute SQL Task is really quite useful at executing inline SQL statements and retrieving results, so now take a look at how you can use stored procedures as well in this task.
Executing a Stored Procedure Another way to interact with an RDBMS is to execute stored procedures that can perform operations on a data source to return values, output parameters, or results. Set up the SSIS Execute SQL Task to execute stored procedures by providing the call to the proc name in the General tab’s SQLStatement property. The catch is the same as before. Because the Execute SQL Task sits on top of several different data providers, you need to pay attention to the way each provider handles the stored procedure call. The following table provides a reference to how you should code the SQLStatement property in the Execute SQL Task: If Using Connec tion
And
Code the SQL Statement Propert y
T ype
IsQueryStoredProcedure
Like This
OLEDB and EXCEL
N/A
EXEC usp_StoredProc ?, ?
ODBC
N/A
{call usp_StoredProc (?, ?)}
ADO
false
EXEC usp_StoredProc ?, ?
ADO
true
usp_StoredProc
ADO.NET
false
EXEC usp_StoredProc @Parm1, @Parm2
ADO.NET
true
usp_StoredProc @Parm1, @Parm2
Returning to the earlier example in which you used an inline SQL statement to update the modified date in the sales order detail, create a T-SQL stored procedure that does the same thing (code file Ch03SQL.txt): CREATE PROCEDURE usp_UpdatePersonAddressModifyDate( @MODIFIED_DATE DATETIME ) AS BEGIN Update Person.Address Set ModifiedDate = @MODIFIED_DATE where AddressId = 1 END
www.it-ebooks.info
c03.indd 78
3/22/2014 8:03:33 AM
RDBMS Server Tasks╇
❘╇ 79
In the online downloads for this chapter, we’ve created a package that demonstrates how to call this procedure using both the OLE DB and the ADO.NET Connection Managers. In the General tab (shown in Figure 3-26), the SQLStatement property is set up as prescribed earlier in the guide, with the ? parameter markers for the one input parameter. Note also that the IsQueryStoredProcedure property is not enabled. You can’t set this property for the OLE DB provider. However, this property would be enabled in the ADO.NET version of the Execute SQL Task to execute this same procedure. If you set the IsQueryStoredProcedure for the ADO.NET version to true, the SQLStatement property would also need to change. Remove the execute command and the parameter markers to look like this: Usp_UpdatePersonAddressModifyDate. In this mode, the Execute SQL Task will actually build the complete execution statement using the parameter listing that you’d provide in the Parameter Mapping tab of the Task Editor.
Figure 3-26
The Parameter Mapping tab of the Task Editor varies according to the underlying provider set on the Execute SQL Task, as shown in Figure 3-27. For brevity, this figure just shows an OLE DB connection with parameters. With ADO.NET connections though, the parameter names follow the same rules you used when applying parameters to inline SQL statements earlier in this chapter by changing the Parameter Name option to @MODIFIED_DATE, for example.
www.it-ebooks.info
c03.indd 79
3/22/2014 8:03:33 AM
80╇
❘╇ CHAPTER 3╇ SSIS Tasks
Figure 3-27
Retrieving Output Parameters from a Stored Procedure Mapping input parameters for SQL statements is one thing, but there are some issues to consider when handling output parameters from stored procedures. The main thing to remember is that all retrieved output or return parameters have to be pushed into variables to have any downstream use. The variable types are defined within SSIS, and you have the same issues that we covered in the section “Capturing Singleton Results” for this task. In short, you have to be able to choose the correct variables when you bind the resulting provider output parameters to the SSIS variables, so that you can get a successful type conversion. As an example, we’ll duplicate the same type of SQL query we used earlier with the inline SQL statement to capture a singleton result, but here you’ll use a stored procedure object instead. Put the following stored procedure in the AdventureWorks database (code file Ch03SQL.txt): CREATE PROCEDURE usp_GetTop1SalesOrderDetail ( @CARRIER_TRACKING_NUMBER nvarchar(25) OUTPUT, @LINE_TOTAL numeric(38,6) OUTPUT, @ORDER_QTY smallint OUTPUT, @UNIT_PRICE money OUTPUT ) AS BEGIN
www.it-ebooks.info
c03.indd 80
3/22/2014 8:03:33 AM
RDBMS Server Tasks╇
❘╇ 81
SELECT TOP 1 @CARRIER_TRACKING_NUMBER = CarrierTrackingNumber, @LINE_TOTAL = LineTotal, @ORDER_QTY = OrderQty, @UNIT_PRICE = UnitPrice From Sales.SalesOrderDetail END
In this contrived example, the stored procedure will provide four different output parameters that you can use to learn how to set up the output parameter bindings. (Integer values are consistent and easy to map across almost all providers, so there is no need to demonstrate that in this example.) One difference between returning singleton output parameters and a singleton row is that in the General tab of the Execute SQL Task, the ResultSet property is set to None, as no row should be returned to capture. Instead, the Parameters in the Parameter Mapping tab will be set to the Direction of Output and the Data Types mapped based on the provider. To get the defined SQL Server data type parameters to match the SSIS variables, you need to set up the parameters with these mappings: Par ameter Name
SQL Server Data T ype
SSIS Data T ype
@CARRIER_TRACKING_NUMBER
nvarchar
string
@LINE_TOTAL
numeric
double
@ORDER_QTY
smallint
int16
@UNIT_PRICE
money
double
You might assume that you would still have an issue with this binding, because, if you recall, you attempted to return a single-rowset from an inline SQL statement with these same data types and ended up with all types of binding and casting errors. You had to change your inline statement to cast these values to get them to bind. You don’t have to do this when binding to parameters, because this casting occurs outside of the Tabular Data Stream. When binding parameters (as opposed to columns in a data stream), the numeric data type will bind directly to the double, so you won’t get the error that you would get if the same data were being bound from a rowset. We’re not quite sure why this is the case, but fortunately stored procedures don’t have to be altered in order to use them in SSIS because of output parameter binding issues. The remaining task to complete the parameter setup is to provide the correct placeholder for the parameter. Figure 3-28 is an example of the completed parameter setup for the procedure in OLE DB. At this point, you have looked at every scenario concerning binding to parameters and result sets. Stored procedures can also return multi-row results, but there is really no difference in how you handle these rows from a stored procedure and an inline SQL statement. We covered multi-row scenarios earlier in this section on the Execute SQL Task. Now we will move away from tasks in the RDBMS world and into tasks that involve other controlling external processes such as other packages or applications in the operating system.
www.it-ebooks.info
c03.indd 81
3/22/2014 8:03:33 AM
82╇
❘╇ CHAPTER 3╇ SSIS Tasks
Figure 3-28
Workflow Tasks So far, we’ve been focused on tasks that are occurring within the immediate realm of ETL processing. You’ve looked at tasks for creating control structures, preparing data, and performing RDBMS operations. This section looks at being able to control other processes and applications in the operating system. Here we sidestep a bit from typical ETL definitions into things that can be more enterprise application integration (EAI) oriented. SSIS packages can also be organized to execute other packages or to call external programs that zip up files or send e-mail alerts, and even put messages directly into application queues for processing.
Execute Package Task The Execute Package Task enables you to build SSIS solutions called parent packages that execute other packages called child packages. You’ll find that this capability is an indispensable part of your SSIS development as your packages begin to grow. Separating packages into discrete functional workflows enables shorter development and testing cycles and facilitates best development practices. Though the Execute Package Task has been around since the legacy DTS, several improvements have simplified the task: ➤➤
The child packages can be run as either in-process or out-of-process executables. In the Package tab of the Execute Package Task Editor is the ExecuteOutOfProcess property; set to the default value of false, it will execute the package in its own process and memory
www.it-ebooks.info
c03.indd 82
3/22/2014 8:03:34 AM
Workflow Tasks╇
❘╇ 83
space. A big difference in this release of the task compared to its 2008 or 2005 predecessor is that you execute packages within a project to make migrating the code from development to QA much easier. ➤➤
The task enables you to easily map parameters in the parent package to the child packages now too.
The majority of configurable properties are in the Package tab of the Execute Package Task Editor. The first option provides the location of the child package. The ReferenceType option can be either External or Project References. This means you can point to a package inside your current project or outside the project to a SQL Server or file system. The best (easiest) option is to refer to a package in a project, as this option will easily “repoint” the reference as you migrate to production. If you point to an External Reference, you’ll need to create a Connection Manager that won’t automatically repoint as you migrate your packages from development to production. The configured tab will look like Figure 3-29.
Figure 3-29
Next, go to the Parameter Bindings tab to pass parameters into the child package. First, select any parameters in the child package from its dropdown box, and then map them to a parameter or variable in the parent package. Parameters will only work here with Project Referenced packages. You can see an example of this in Figure 3-30, or download the entire example from www.wrox.com/go/prossis2014.
www.it-ebooks.info
c03.indd 83
3/22/2014 8:03:34 AM
84╇
❘╇ CHAPTER 3╇ SSIS Tasks
Figure 3-30
Execute Process Task The Execute Process Task will execute a Windows or console application inside of the Control Flow. You’ll find great uses for this task to run command-line-based programs and utilities prior to performing other ETL tasks. The most common example would have to be unzipping packed or encrypted data files with a command-line tool. You can store any errors resulting from the execution of the task into a variable that can be read later and logged. In addition, any output from the command file can also be written to a variable for logging purposes. Figure 3-31 shows a sample of using the Execute Process Task to expand a compressed customers.zip file. The Process tab in the Execute Process Task Editor contains most of the important configuration items for this task: ➤➤
RequireFullFileName property: Tells the task whether it needs the full path to execute the command. If the file is not found at the full path or in the PATH environment variables of the machine, the task will fail. Typically, a full path is used only if you want to explicitly identify the executable you want to run. However, if the file exists in the System32 directory, you wouldn’t normally have to type the full path to the file because this path is automatically known to a typical Windows system.
www.it-ebooks.info
c03.indd 84
3/22/2014 8:03:34 AM
Workflow Tasks╇
❘╇ 85
➤➤
Executable property: Identifies the path and filename for the executable you want to run. Be careful not to provide any parameters or optional switches in this property that would be passed to the executable. Use the Arguments property to set these types of options separately. For example, Figure 3-31 shows that the task will execute expand.exe and pass in the cabinet from which you want to extract and where you’d like it to be extracted.
➤➤
WorkingDirectory option: Contains the path from which the executable or command file will work.
➤➤
StandardInputVariable parameter: This is the variable you want to pass into the process as an argument. Use this property if you want to dynamically provide a parameter to the executable based on a variable.
➤➤
StandardOutputVariable parameter: You can also capture the result of the execution by setting the property StandardOutputVariable to a variable.
➤➤
StandardErrorVariable property: Any errors that occurred from the execution can be captured in the variable you provide in this property.
Figure 3-31
These variable values can be used to send back to a scripting component to log or can be used in a precedence constraint that checks the length of the variables to determine whether you should go to the next task. This provides the logical functionality of looping back and trying again if the result of the execution of the expand.exe program was a sharing violation or another similar error.
www.it-ebooks.info
c03.indd 85
3/22/2014 8:03:34 AM
86╇
❘╇ CHAPTER 3╇ SSIS Tasks
Other options in the Process tab include: ➤➤
FailTaskIfReturnCodeIsNotSuccessValue property: Another option for validating the task.
➤➤
SuccessValue option: The Execute Process Task will fail if the exit code passed from the program is different from the value provided in the SuccessValue option. The default value of 0 indicates that the task was successful in executing the process.
➤➤
Timeout/TerminateProcessAfterTimeOut properties: The Timeout property determines the number of seconds that must elapse before the program is considered a runaway process. A value of 0, which is the default, means the process can run for an infinite amount of time. This property is used in conjunction with the TerminateProcessAfterTimeOut property, which, if set to true, terminates the process after the timeout has been exceeded.
➤➤
WindowStyle option: This can set the executable to be run minimized, maximized, hidden, or normal. If this is set to any option other than hidden, users will be able to see any windows that potentially pop up and may interact with them during runtime. Typically, these are set to hidden once a package is fully tested.
With the Execute Process Task, you can continue to use command-line or out-of-processes executables to organize work for ETL tasks. Now it’s time to take a look at how SSIS can interact and integrate with your enterprise messaging bus.
Message Queue Task The Message Queue Task enables you to send or receive messages from Microsoft Message Queuing (MSMQ) right out of the box. For integration with other messaging systems like IBM’s MQ Series or Tibco’s Rendezveus, you need to either code to a library within a Script Task, create a custom component, or execute T-SQL statements to a SQL Server Service Broker queue. Messaging architectures are created to ensure reliable communication between two disparate subsystems. A message can be a string, a file, or a variable. The main benefit to using this task is the capability to make packages communicate with each other at runtime. You can use this to scale out your packages, having multiple packages executing in parallel, with each loading a subset of the data, and then checking in with the parent package after they reach certain checkpoints. You can also use this task for enterprise-level information integration to do things like deliver dashboard-level information using XML files to an enterprise bus or distribute report content files across your network. Satellite offices or any other subscriber to those services could pull content from the queue for application-level processing. The task is straightforward. In the General tab, shown in Figure 3-32, you specify the MSMQ Connection Manager under the MSMQConnection property. Then, you specify whether you want to send or receive a message under the Message option. In this tab, you can also specify whether you want to use the legacy Windows 2000 version of MSMQ; this option is set to false by default. The bulk of the configuration is under the Send or Receive tab (the one you see varies according to the Message option you selected in the General tab). If you’re on the Receive tab, you can configure the task to remove the message from the queue after it has been read. You can also set the timeout properties here, to control whether the task will produce an error if it experiences a timeout.
www.it-ebooks.info
c03.indd 86
3/22/2014 8:03:35 AM
Workflow Tasks╇
❘╇ 87
Figure 3-32
Regardless of whether you’re sending or receiving messages, you can select the type of the message under the MessageType option. You can either send or receive a string message, a variable, or a data file. Additionally, if you’re receiving a message, you can immediately store the message you receive in a package variable by setting String Message to Variable and then specifying a variable in the Variable option.
Send Mail Task The Send Mail Task provides a configurable SSIS task for sending e-mail messages via SMTP. In legacy DTS packages, you had to send messages out through MAPI, which meant installing Outlook on the server on which the package was running. That is now no longer a requirement. Most of the configuration options are set in the Mail tab (shown in Figure 3-33) of the Send Mail Task Editor. The SmtpConnection property is where you either create a new or select an existing SMTP Connection Manager. Most of the configuration options will depend upon your specific SMTP connection. One option of special interest is the MessageSourceType property, which specifies whether the message source will be provided from a file or a variable or be directly inputted into the MessageSource property. Typically, the best practice is to use a variable-based approach to set the MessageSource property.
www.it-ebooks.info
c03.indd 87
3/22/2014 8:03:35 AM
88╇
❘╇ CHAPTER 3╇ SSIS Tasks
Figure 3-33
WMI Data Reader Task Windows Management Instrumentation (WMI) is one of the best-kept secrets in Windows. WMI enables you to manage Windows servers and workstations through a scripting interface similar to running a T-SQL query. The WMI Data Reader Task enables you to interface with this environment by writing WQL queries (the query language for WMI) against the server or workstation (to look at the Application event log, for example). The output of this query can be written to a file or variable for later consumption. Following are some applications for which you could use the WMI Data Reader Task: ➤➤
Read the event log looking for a given error.
➤➤
Query the list of applications that are running.
➤➤
Query to see how much RAM is available at package execution for debugging.
➤➤
Determine the amount of free space on a hard drive.
To get started, you first need to set up a WMI Connection Manager in the Connection Manager Editor. Connection requirements vary, but Figure 3-34 shows an example of a WMI connection for a typical standalone workstation.
www.it-ebooks.info
c03.indd 88
3/22/2014 8:03:35 AM
Workflow Tasks╇
❘╇ 89
Figure 3-34
Notice here that the Use Windows Authentication option has been set. WMI typically requires a higher level of security authorization because you are able to query OS-level data. With a WMI connection, you can configure the WMI Data Reader Task Editor using the WMI Options tab shown in Figure 3-35.
Figure 3-35
www.it-ebooks.info
c03.indd 89
3/22/2014 8:03:36 AM
90╇
❘╇ CHAPTER 3╇ SSIS Tasks
➤➤
WmiConnection/WqlQuerySourceType: First, you set the WMIConnection, and then determine whether the WMI query will be directly inputted, retrieved from a variable, or retrieved from a file, and set the WqlQuerySourceType.
➤➤
WqlQuerySource: Specifies the source for the query that you wish to run against the connection. This may be a variable name, a text filename, or a hardcoded query itself.
➤➤
OutputType: This option specifies whether you want the output of the query to retrieve just the values from the query or also the column names along with the values.
➤➤
OverwriteDestination: This option specifies whether you want the destination to be overwritten each time it is run, or whether you want it to just append to any configured destination. If you save the output to an object variable, you can use the same technique of shredding a recordset that you learned earlier in the Execute SQL Task.
WQL queries look like SQL queries, and for all practical purposes they are, with the difference that you are retrieving data sets from the operating systems. For example, the following query selects the free space, the name, and a few other metrics about the C: drive (see code file Ch03SQL.txt): SELECT FreeSpace, DeviceId, Size, SystemName, Description FROM Win32_LogicalDisk WHERE DeviceID = 'C:'
The output of this type of query would look like this in a table: Description, Local Fixed Disk DeviceID, C: FreeSpace, 32110985216 Size, 60003381248 SystemName, BKNIGHT
The following example of a WQL query selects information written to the Application event log after a certain date about the SQL Server and SSIS services (code file Ch03SQL.txt): SELECT * FROM Win32_NTLogEvent WHERE LogFile = 'Application' AND (SourceName='SQLISService' OR SourceName='SQLISPackage') AND TimeGenerated> '20050117'
The results would look like this: 0 BKNIGHT 12289 1073819649 3 System.String[] Application 3738 SQLISPackage 20050430174924.000000-240 20050430174924.000000-240 Information BKNIGHT\Brian Knight 0
www.it-ebooks.info
c03.indd 90
3/22/2014 8:03:36 AM
Workflow Tasks╇
❘╇ 91
Typically, the WMI Data Reader Task is used in SQL Server administration packages to gather operational-type data from the SQL Server environments. However, the WMI Event Watcher Task has some interesting uses for ETL processes that you’ll look at next.
WMI Event Watcher Task The WMI Event Watcher Task empowers SSIS to wait for and respond to certain WMI events that occur in the operating system. The task operates in much the same way as the WMI Data Reader Task operates. The following are some of the useful things you can do with this task: ➤➤
Watch a directory for a certain file to be written.
➤➤
Wait for a given service to start.
➤➤
Wait for the memory of a server to reach a certain level before executing the rest of the package or before transferring files to the server.
➤➤
Watch for the CPU to be free.
To illustrate the last example of polling to determine when the CPU is less than 50 percent utilized, you could have the WMI Event Watcher Task look for an event with this WQL code: SELECT * from __InstanceModificationEvent WITHIN 2 WHERE TargetInstance ISA 'Win32_Processor' and TargetInstance.LoadPercentage < 50
The next section looks at a direct application of this WMI Event Watcher Task to give you a better idea of how to configure it and what it can do.
Polling a Directory for the Delivery of a File One very practical use of the WMI Event Watcher for ETL processing is to provide a buffer between the time when an SSIS job starts and the time when a file is actually delivered to a folder location. If there is a window of variability in file delivery and an SSIS package starts on a onetime schedule, then it is possible to miss processing the file for the day. By using a WMI Event Watcher, you can set up your SSIS packages to poll a folder location for a set period of time until a file is detected. If you have this type of challenge, a better solution may be a ForEach Loop Container scheduled to run periodically, but you’ll learn more about that in Chapter 6. To set up a task to perform this automated action, open the WMI Options tab of the WMI Event Watcher Task Editor (see Figure 3-36). Notice that this WMI Task is completely different from the WMI Data Reader Task. This WMI Event Watcher Task provides properties such as the AfterEvent option, which specifies whether the task should succeed, fail, or keep querying if the condition is met. You also need to provide a length of time after which the WMI Event Watcher stops watching by setting the Timeout property. The timeout value is in seconds. The default of zero (0) indicates that there is no timeout. Outside of your development activities, be very careful with leaving this setting on zero (0). The WMI Event Watcher could leave your SSIS package running indefinitely.
www.it-ebooks.info
c03.indd 91
3/22/2014 8:03:36 AM
92╇
❘╇ CHAPTER 3╇ SSIS Tasks
Figure 3-36
You can also configure what happens when a timeout occurs under the ActionAtTimeout and AfterTimeout settings. The NumberOfEvents option configures the number of events to watch for. You can use this setting to look for more than one file before you start processing. The WqlQuerySource for the File Watcher Configuration for this WMI Task would look like this code: SELECT * FROM __InstanceCreationEvent WITHIN 10 WHERE TargetInstance ISA "CIM_DirectoryContainsFile" AND TargetInstance.GroupComponent = "Win32_Directory.Name=\"c:\\\\ProSSIS\""
If you run this task with no files in the C:\ProSSIS\ directory, the task will remain yellow as the watcher continuously waits for an event to be raised. If you copy a file into the directory, the task will turn green and complete successfully. This is a great addition that is less resource-intensive than the legacy DTS version of iterating in a For loop until the file is found. As you can see, there are some major improvements in the capabilities to control workflow in SSIS.
SMO Administration Tasks The last section of this chapter is reserved for a set of tasks that are convenient for copying or moving schema and data-level information. The SQL Management Objects (SMO) model allows developers to interact with DBA functions programmatically. These rarely used tasks are used by DBAs to synchronize systems. Because they aren’t used as often, they’re covered only at a high level. These tasks can do the following:
www.it-ebooks.info
c03.indd 92
3/22/2014 8:03:36 AM
SMO Administration Tasks╇
➤➤
Move or copy entire databases. This can be accomplished by detaching the database and moving the files (faster) or by moving the schema and content (slower).
➤➤
Transfer error messages from one server to another.
➤➤
Move or copy selected or entire SQL Agent jobs.
➤➤
Move or copy server-level or database-level logins.
➤➤
Move or copy objects such as tables, views, stored procedures, functions, defaults, userdefined data types, partition functions, partition schemas, schemas (or roles), SQL assemblies, user-defined aggregates, user-defined types, and XML schemas. These objects can be copied over by selecting all, by individually selecting each desired object type, or even by selecting individual objects themselves.
➤➤
Move or copy master stored procedures between two servers.
❘╇ 93
Transfer Database Task The Transfer Database Task has, as you would expect, a source and destination connection and a database property. The other properties address how the transfer should take place. Figure 3-37 is an example of the Transfer Database Task filled out to copy a development database on the same server as a QA instance.
Figure 3-37
www.it-ebooks.info
c03.indd 93
3/22/2014 8:03:37 AM
94╇
❘╇ CHAPTER 3╇ SSIS Tasks
Notice that the destination and source are set to the same server. For this copy to work, the DestinationDatabaseFiles property has to be set to new mdf and ldf filenames. The property is set by default to the SourceDatabaseFiles property. To set the new destination database filenames, click the ellipsis, and then change the Destination File or Destination Folder properties. You can set the Method property to DatabaseOnline or DatabaseOffline. If the option is set to DatabaseOffline, the database is detached copied over and then reattached to both systems. This is a much faster process than with DatabaseOnline, but it comes at a cost of making the database inaccessible. The Action property controls whether the task should copy or move the source database. The Method property controls whether the database should be copied while the source database is kept online, using SQL Server Management Objects (SMO), or by detaching the database, moving the files, and then reattaching the database. The DestinationOverwrite property controls whether the creation of the destination database should be allowed to overwrite. This includes deleting the database in the destination if it is found. This is useful in cases where you want to copy a database from production into a quality-control or production test environment, and the new database should replace any existing similar database. The last property is the ReattachSourceDatabase, which specifies what action should be taken upon failure of the copy. Use this property if you have a package running on a schedule that takes a production database offline to copy it, and you need to guarantee that the database goes back online even if the copy fails. What’s really great about the Transfer Database Task is that the logins, roles, object permissions, and even the data can be transferred as well. This task may in some instances be too big of a hammer. You may find it more advantageous to just transfer specific sets of objects from one database to another. The next five tasks give you that capability.
Transfer Error Messages Task If you are using custom error messages in the sys.messages table, you need to remember to copy these over when you move a database from one server to another. In the past, you needed to code a cursor-based script to fire the sp_addmessage system stored procedure to move these messages around — and you needed to remember to do it. Now you can create a package that moves your database with the Transfer Database Task and add this Transfer Error Messages Task to move the messages as well. One thing you’ll find in this task that you’ll see in the rest of the SMO administration tasks is the opportunity to select the specific things that you want to transfer. The properties ErrorMessagesList and ErrorMessageLanguagesList in the Messages tab are examples of this selective-type UI. If you click the ellipsis, you’ll get a dialog in which you can select specific messages to transfer. Generally, unless you are performing a one-off update, you should set the TransferAllErrorMessages property to true, and then set the IfObjectExists property to skip messages that already exist in the destination database.
Transfer Logins Task The Transfer Logins Task (shown in Figure 3-38) focuses only on the security aspects of your databases. With this task you can transfer the logins from one database and have them corrected at the destination.
www.it-ebooks.info
c03.indd 94
3/22/2014 8:03:37 AM
SMO Administration Tasks╇
❘╇ 95
Figure 3-38
Of course, you’ll have your obligatory source and destination connection properties in this editor. You also have the option to move logins from all databases or selected databases, or you can select individual logins to transfer. Make this choice in the LoginsToTransfer property; the default is SelectedLogins. The partner properties to LoginsToTransfer are LoginsList and DatabasesList. One will be activated based on your choice of logins to transfer. Two last properties to cover relate to what you want the transfer logins process to do if it encounters an existing login in the destination. If you want the login to be replaced, set the IfObjectExists property to Overwrite. Other options are to fail the task or to skip that login. The long-awaited option to resolve unmatched user security IDs is found in the property CopySids, and can be true or false.
Transfer Master Stored Procedures Task This task is used to transfer master stored procedures. If you need to transfer your own stored procedure, use the Transfer SQL Server Objects Task instead. To use this task, set the source and destination connections, and then set the property TransferAllStoredProcedures to true or false. If you set this property to false, you’ll be able to select individual master stored procedures to transfer. The remaining property, IfObjectExists, enables you to select what action should take place if a transferring object exists in the destination. The options are to Overwrite, FailTask, or Skip.
www.it-ebooks.info
c03.indd 95
3/22/2014 8:03:37 AM
96╇
❘╇ CHAPTER 3╇ SSIS Tasks
Transfer Jobs Task The Transfer Jobs Task (shown in Figure 3-39) aids you in transferring any of the existing SQL Server Agent jobs between SQL Server instances. Just like the other SMO tasks, you can either select to transfer all jobs to synchronize two instances or use the task to selectively pick which jobs you want to move to another instance. You can also select in the IfObjectExists property how the task should react if the job is already there. One important option is the EnableJobsAtDestination property, which turns the jobs after they’ve been transferred. This default property is false by default, meaning the jobs transfer but will not be functioning until enabled.
Figure 3-39
Transfer SQL Server Objects Task The Transfer SQL Server Objects Task is the most flexible of the Transfer tasks. This task is capable of transferring all types of database objects. To use this task, set the properties to connect to a source and destination database; if the properties aren’t visible, expand the Connection category. Some may be hidden until categories are expanded. This task exists for those instances when selective object copying is needed, which is why this is not called the Transfer Database Task. You specifically have to set the property CopyData to true to get the bulk transfers of data. The property CopyAllObjects means that only the tables, views, stored
www.it-ebooks.info
c03.indd 96
3/22/2014 8:03:37 AM
Summary╇
❘╇ 97
procedures, defaults, rules, and UDFs will be transferred. If you want the table indexes, triggers, primary keys, foreign keys, full-text indexes, or extended properties, you have to select these individually. By expanding the ObjectsToCopy category, you expose properties that allow individual selection of tables, views, and other programmable objects. The security options give you some of the same capabilities as the Transfer Database Task. You can transfer database users, roles, logins, and object-level permissions by selecting true for these properties. The power of this task lies in its flexibility, as it can be customized and used in packages to move only specific items, for example, during the promotion of objects from one environment to another, or to be less discriminate and copy all tables, views, and other database objects, with or without the data.
Summary This chapter attempted to stick with the everyday nuts-and-bolts uses of the SSIS tasks. Throughout the chapter, you looked at each task, learned how to configure it, and looked at an example of the task in action. In fact, you saw a number of examples that demonstrated how to use these tasks in real-world ETL and EAI applications. In Chapter 6, you’ll circle back to look at the Control Flow again to explore containers, which enable you to loop through tasks. In the next chapter, you’ll cover the Data Flow Task and dive deeper into configuring Data Flow, and learn about all the transformations that are available in this task.
www.it-ebooks.info
c03.indd 97
3/22/2014 8:03:37 AM
www.it-ebooks.info
c03.indd 98
3/22/2014 8:03:37 AM
4
The Data Flow What’s in This Chapter? ➤➤
Learn about the SSIS Data Flow architecture
➤➤
Reading data out of sources
➤➤
Loading data into destinations
➤➤
Transforming data with common transformations
Wrox.com Code Downloads for this Chapter
You can find the wrox.com code downloads for this chapter at www.wrox.com/go/ prossis2014 on the Download Code tab.
In the last chapter you were introduced to the Control Flow tab through tasks. In this chapter, you’ll continue along those lines with an exploration of the Data Flow tab, which is where you will spend most of your time as an SSIS developer. The Data Flow Task is where the bulk of your data heavy lifting occurs in SSIS. This chapter walks you through the transformations in the Data Flow Task, demonstrating how they can help you move and clean your data. You’ll notice a few components (the CDC ones) aren’t covered in this chapter. Those needed more coverage than this chapter had room for and are covered in Chapter 11.
Understanding the Data Flow The SSIS Data Flow is implemented as a logical pipeline, where data flows from one or more sources, through whatever transformations are needed to cleanse and reshape it for its new purpose, and into one or more destinations. The Data Flow does its work primarily in memory, which gives SSIS its strength, allowing the Data Flow to perform faster than any ELT
www.it-ebooks.info
c04.indd 99
3/22/2014 8:07:16 AM
100╇
❘╇ CHAPTER 4╇ The Data Flow
type environment (in most cases) where the data is first loaded into a staging environment and then cleansed with a SQL statement. One of the toughest concepts to understand for a new SSIS developer is the difference between the Control Flow and the Data Flow tabs. Chapter 2 explains this further, but just to restate a piece of that concept, the Control Flow tab controls the workflow of the package and the order in which each task will execute. Each task in the Control Flow has a user interface to configure the task, with the exception of the Data Flow Task. The Data Flow Task is configured in the Data Flow tab. Once you drag a Data Flow Task onto the Control Flow tab and double-click it to configure it, you’re immediately taken to the Data Flow tab. The Data Flow is made up of three components that are discussed in this chapter: sources, transformations (also known as transforms), and destinations. These three components make up the fundamentals of ETL. Sources extract data out of flat files, OLE DB databases, and other locations; transformations process the data once it has been pulled out; and destinations write the data to its final location. Much of this ETL processing is done in memory, which is what gives SSIS its speed. It is much faster to apply business rules to your data in memory using a transformation than to have to constantly update a staging table. Because of this, though, your SSIS server will potentially need a large amount of memory, depending on the size of the file you are processing. Data flows out of a source in memory buffers that are 10 megabytes in size or 10,000 rows (whichever comes first) by default. As the first transformation is working on those 10,000 rows, the next buffer of 10,000 rows is being processed at the source. This architecture limits the consumption of memory by SSIS and, in most cases, means that if you had 5 transforms dragged over, 50,000 rows will be worked on at the same time in memory. This can change only if you have asynchronous components like the Aggregate or Sort Transforms, which cause a full block of the pipeline.
Data Viewers Data viewers are a very important feature in SSIS for debugging your Data Flow pipeline. They enable you to view data at points in time at runtime. If you place a data viewer before and after the Aggregate Transformation, for example, you can see the data flowing into the transformation at runtime and what it looks like after the transformation happens. Once you deploy your package and run it on the server as a job or with the service, the data viewers do not show because they are only a debug feature within SQL Server Data Tools (SSDT). To place a data viewer in your pipeline, right-click one of the paths (red or blue arrows leaving a transformation or source) and select Enable Data Viewer. Once you run the package, you’ll see the data viewers open and populate with data when the package gets to that path in the pipeline that it’s attached to. The package will not proceed until you click the green play button (>). You can also copy the data into a viewer like Excel or Notepad for further investigation by clicking Copy Data. The data viewer displays up to 10,000 rows by default, so you may have to click the > button multiple times in order to go through all the data.
www.it-ebooks.info
c04.indd 100
3/22/2014 8:07:16 AM
Sources╇
❘╇ 101
After adding more and more data viewers, you may want to remove them eventually to speed up your development execution. You can remove them by right-clicking the path that has the data viewer and selecting Disable Data Viewer.
Sources A source in the SSIS Data Flow is where you specify the location of your source data. Most sources will point to a Connection Manager in SSIS. By pointing to a Connection Manager, you can reuse connections throughout your package, because you need only change the connection in one place.
Source Assistant and Destination Assistant The Source Assistant and Destination Assistant are two components designed to remove the complexity of configuring a source or a destination in the Data Flow. The components determine what drivers you have installed and show you only the applicable drivers. It also simplifies the selection of a valid connection manager based on the database platform you select that you wish to connect to. In the Source Assistant or Destination Assistant (the Source Assistant is shown in Figure 4-1), only the data providers that you have installed are actually shown. Once you select how you want to connect, you’ll see a list of Connection Managers on the right that you can use to connect to your selected source. You can also create a new Connection Manager from the same area on the right. If you uncheck the “Show only installed source types” option, you’ll see other providers like DB2 or Oracle for which you may not have the right software installed.
Figure 4-1
www.it-ebooks.info
c04.indd 101
3/22/2014 8:07:16 AM
102╇
❘╇ CHAPTER 4╇ The Data Flow
OLE DB Source The OLE DB Source is the most common type of source, and it can point to any OLE DB–compliant Data Source such as SQL Server, Oracle, or DB2. To configure the OLE DB Source, double-click the source once you have added it to the design pane in the Data Flow tab. In the Connection Manager page of the OLE DB Source Editor (see Figure 4-2), select the Connection Manager of your OLE DB Source from the OLE DB Connection Manager dropdown box. You can also add a new Connection Manager in the editor by clicking the New button.
Figure 4-2
The “Data access mode” option specifies how you wish to retrieve the data. Your options here are Table/View or SQL Command, or you can pull either from a package variable. Once you select the data access mode, you need the table or view, or you can type a query. For multiple reasons that will be explained momentarily, it is a best practice to retrieve the data from a query. This query can also be a stored procedure. Additionally, you can pass parameters into the query by substituting a question mark (?) for where the parameter should be and then clicking the Parameters button. You’ll learn more about parameterization of your queries in Chapter 5. As with most sources, you can go to the Columns page to set columns that you wish to output to the Data Flow, as shown in Figure 4-3. Simply check the columns you wish to output, and you can then assign the name you want to send down the Data Flow in the Output column. Select only the columns that you want to use, because the smaller the data set, the better the performance you will get.
www.it-ebooks.info
c04.indd 102
3/22/2014 8:07:17 AM
Sources╇
❘╇ 103
Figure 4-3
From a performance perspective, this is a case where it’s better to have typed the query in the Connection Manager page rather than to have selected a table. Selecting a table to pull data from essentially selects all columns and all rows from the target table, transporting all that data across the network. Then, going to the Columns page and unchecking the unnecessary columns applies a client-side filter on the data, which is not nearly as efficient as selecting only the necessary columns in the SQL query. This is also gentler on the amount of buffers you fill as well. Optionally, you can go to the Error Output page (shown in Figure 4-4) and specify how you wish to handle rows that have errors. For example, you may wish to output any rows that have a data type conversion issue to a different path in the Data Flow. On each column, you can specify that if an error occurs, you wish the row to be ignored, be redirected, or fail. If you choose to ignore failures, the column for that row will be set to NULL. If you redirect the row, the row will be sent down the red path in the Data Flow coming out of the OLE DB Source. The Truncation column specifies what to do if data truncation occurs. A truncation error would happen, for example, if you try to place 200 characters of data into a column in the Data Flow that supports only 100. You have the same options available to you for Truncation as you do for the Error option. By default, if an error occurs with data types or truncation, an error will occur, causing the entire Data Flow to fail.
www.it-ebooks.info
c04.indd 103
3/22/2014 8:07:18 AM
104╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-4
Excel Source The Excel Source is a source component that points to an Excel spreadsheet, just like it sounds. Once you point to an Excel Connection Manager, you can select the sheet from the “Name of the Excel sheet” dropdown box, or you can run a query by changing the Data Access Mode. This source treats Excel just like a database, where an Excel sheet is the table and the workbook is the database. If you do not see a list of sheets in the dropdown box, you may have a 64-bit machine that needs the ACE driver installed or you need to run the package in 32-bit mode. How to do this is documented in the next section in this chapter. SSIS supports Excel data types, but it may not support them the way you wish by default. For example, the default format in Excel is General. If you right-click a column and select Format Cells, you’ll find that most of the columns in your Excel spreadsheet have probably been set to General. SSIS translates this general format as a Unicode string data type. In SQL Server, Unicode translates into nvarchar, which is probably not what you want. If you have a Unicode data type in SSIS and you try to insert it into a varchar column, it will potentially fail. The solution is to place a Data Conversion Transformation between the source and the destination in order to change the Excel data types. You can read more about Data Conversion Transformations later in this chapter.
www.it-ebooks.info
c04.indd 104
3/22/2014 8:07:18 AM
Sources╇
❘╇ 105
Excel 64-Bit Scenarios If you are connecting to an Excel 2007 spreadsheet or later, ensure that you select the proper Excel version when creating the Excel Connection Manager. You will not be able to connect to an Excel 2007, Excel 2010, or Excel 2013 spreadsheet otherwise. Additionally, the default Excel driver is a 32-bit driver only, and your packages have to run in 32-bit mode when using Excel connectivity. In the designer, you would receive the following error message if you do not have the correct driver installed: The 'Microsoft.ACE.OLEDB.12.0' provider is not registered on the local machine.
To fix this, simply locate this driver on the Microsoft site and you’ll be able to run packages with an Excel source in 64-bit.
Flat File Source The Flat File Source provides a data source for connections such as text files or data that’s delimited. Flat File Sources are typically comma- or tab-delimited files, or they could be fixed-width or raggedright. A fixed-width file is typically received from the mainframe or government entities and has fixed start and stop points for each column. This method enables a fast load, but it takes longer at design time for the developer to map each column. You specify a Flat File Source the same way you specify an OLE DB Source. Once you add it to your Data Flow pane, you point it to a Connection Manager connection that is a flat file or a multi-flat file. Next, from the Columns tab, you specify which columns you want to be presented to the Data Flow. All the specifications for the flat file, such as delimiter type, were previously set in the Flat File Connection Manager. In this example, you’ll create a Connection Manager that points to a file called FactSales.csv, which you can download from this book’s website at www.wrox.com/go/prossis2014. The file has a date column, a few string columns, integer columns, and a currency column. Because of the variety of data types it includes, this example presents an interesting case study for learning how to configure a Flat File Connection Manager. First, right-click in the Connection Manager area of the Package Designer and select New Flat File Connection Manager. This will open the Flat File Connection Manager Editor, as shown in Figure 4-5. Name the Connection Manager Fact Sales and point it to wherever you placed the FactSales.csv file. Check the “Column names in the first data row” option, which specifies that the first row of the file contains a header row with the column names. Another important option is the “Text qualifier” option. Although there isn’t one for this file, sometimes your comma-delimited files may require that you have a text qualifier. A text qualifier places a character around each column of data to show that any comma delimiter inside that symbol should be ignored. For example, if you had the most common text qualifier of double-quotes around your data, a row may look like the following, whereby there are only three columns even though the commas may indicate five: "Knight,Brian", 123, "Jacksonville, FL"
In the Columns page of the Connection Manager, you can specify what will delimit each column in the flat file if you chose a delimited file. The row delimiter specifies what will indicate a new row. The default option is a carriage return followed by a line feed. The Connection Manager’s file is automatically scanned to determine the column delimiter and, as shown in Figure 4-6, use a tab delimiter for the example file.
www.it-ebooks.info
c04.indd 105
3/22/2014 8:07:19 AM
106╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-5
Note ╇ Often, once you make a major change to your header delimiter or your text qualifier, you’ll have to click the Reset Columns button. Doing so requeries the file in order to obtain the new column names. If you click this option, though, all your metadata in the Advanced page will be recreated as well, and you may lose a sizable amount of work.
The Advanced page of the Connection Manager is the most important feature in the Connection Manager. In this tab, you specify the data type for each column in the flat file and the name of the column, as shown in Figure 4-7. This column name and data type will be later sent to the Data Flow. If you need to change the data types or names, you can always come back to the Connection Manager, but be aware that you need to open the Flat File Source again to refresh the metadata. Note ╇ Making a change to the Connection Manager’s data types or columns requires that you refresh any Data Flow Task using that Connection Manager. To do so, open the Flat File Source Editor, which will prompt you to refresh the metadata of the Data Flow. Answer yes, and the metadata will be corrected throughout the Data Flow.
www.it-ebooks.info
c04.indd 106
3/22/2014 8:07:19 AM
Sources╇
❘╇ 107
Figure 4-6
Figure 4-7
www.it-ebooks.info
c04.indd 107
3/22/2014 8:07:20 AM
108╇
❘╇ CHAPTER 4╇ The Data Flow
If you don’t want to specify the data type for each individual column, you can click the Suggest Types button on this page to have SSIS scan the first 100 records (by default) in the file to guess the appropriate data types. Generally speaking, it does a bad job at guessing, but it’s a great place to start if you have a lot of columns. If you prefer to do this manually, select each column and then specify its data type. You can also hold down the Ctrl key or Shift key and select multiple columns at once and change the data types or column length for multiple columns at the same time. A Flat File Connection Manager initially treats each column as a 50-character string by default. Leaving this default behavior harms performance when you have a true integer column that you’re trying to insert into SQL Server, or if your column contains more data than 50 characters of data. The settings you make in the Advanced page of the Connection Manager are the most important work you can do to ensure that all the data types for the columns are properly defined. You should also keep the data types as small as possible. For example, if you have a zip code column that’s only 9 digits in length, define it as a 9-character string. This will save an additional 41 bytes in memory multiplied by however many rows you have. A frustrating point with SSIS sometimes is how it deals with SQL Server data types. For example, a varchar maps in SSIS to a string column. It was designed this way to translate well into the .NET development world and to provide an agnostic product. The following table contains some of the common SQL Server data types and what they are mapped into in a Flat File Connection Manager. SQL Server Data T ype
Connec tion Manager Data T ype
Bigint
Eight-byte signed integer [DT_I8]
Binary
Byte stream [DT_BYTES]
Bit
Boolean [DT_BOOL]
Tinyint
Single-byte unsigned integer [DT_UI1]
Datetime
Database timestamp [DT_DBTIMESTAMP]
Decimal
Numeric [DT_NUMERIC]
Real
Float [DT_R4]
Int
Four-byte signed integer [DT_I4]
Image
Image [DT_IMAGE]
Nvarchar or nchar
Unicode string [DT_WSTR]
Ntext
Unicode text stream [DT_NTEXT]
Numeric
Numeric [DT_NUMERIC]
Smallint
Two-byte signed integer [DT_I2]
Text
Text stream [DT_TEXT]
Timestamp
Byte stream [DT_BYTES]
www.it-ebooks.info
c04.indd 108
3/22/2014 8:07:20 AM
Sources╇
Uniqueidentifier
Unique identifier [DT_GUID]
Varbinary
Byte stream [DT_BYTES]
Varchar or char
String [DT_STR]
Xml
Unicode string [DT_WSTR]
❘╇ 109
FastParse Option By default, SSIS issues a contract between the Flat File Source and a Data Flow. It states that the source component must validate any numeric or date column. For example, if you have a flat file in which a given column is set to a four-byte integer, every row must first go through a short validation routine to ensure that it is truly an integer and that no character data has passed through. On date columns, a quick check is done to ensure that every date is indeed a valid in-range date. This validation is fast but it does require approximately 20 to 30 percent more time to validate that contract. To set the FastParse property, go into the Data Flow Task for which you’re using a Flat File Source. Right-click the Flat File Source and select Show Advanced Editor. From there, select the Input and Output Properties tab, and choose any number or date column under Flat File Output ➪ Output Columns tree. In the right pane, change the FastParse property to True, as shown in Figure 4-8.
Figure 4-8
www.it-ebooks.info
c04.indd 109
3/22/2014 8:07:21 AM
110╇
❘╇ CHAPTER 4╇ The Data Flow
MultiFlatFile Connection Manager If you know that you want to process a series of flat files in a Data Flow, or you want to refer to many files in the Control Flow, you can optionally use the MultiFlatFile or “Multiple Flat File Connection Manager.” The Multiple Flat File Connection Manager refers to a list of files for copying or moving, or it may hold a series of SQL scripts to execute, similar to the File Connection Manager. The Multiple Flat File Connection Manager gives you the same view as a Flat File Connection Manager, but it enables you to point to multiple files. In either case, you can point to a list of files by placing a vertical bar (|) between each filename: C:\Projects\011305c.dat|C:\Projects\053105c.dat
In the Data Flow, the Multiple Flat File Connection Manager reacts by combining the total number of records from all the files that you have pointed to, appearing like a single merged file. Using this option will initiate the Data Flow process only once for the files whereas the Foreach Loop container will initiate the process once per file being processed. In either case, the metadata from the file must match in order to use them in the Data Flow. Most developers lean toward using Foreach Loop Containers because it’s easier to make them dynamic. With these Multiple File or Multiple Flat File Connection Managers, you have to parse your file list and add the vertical bar between them. If you use Foreach Loop Containers, that is taken care of for you.
Raw File Source The Raw File Source is a specialized type of file that is optimized for reading data quickly from SSIS. A Raw File Source is created by a Raw File Destination (discussed later in this chapter). You can’t add columns to the Raw File Source, but you can remove unused columns from the source in much the same way you do in the other sources. Because the Raw File Source requires little translation, it can load data much faster than the Flat File Source, but the price of this speed is little flexibility. Typically, you see raw files used to capture data at checkpoints to be used later in case of a package failure. These sources are typically used for cross-package or cross-Data Flow communication. For example, if you have a Data Flow that takes four hours to run, you might wish to stage the data to a raw file halfway through the processing in case a problem occurs. Then, the second Data Flow Task would continue the remaining two hours of processing.
XML Source The XML source is a powerful SSIS source that can use a local or remote (via HTTP or UNC) XML file as the source. This source component is a bit different from the OLE DB Source in its configuration. First, you point to the XML file locally on your machine or at a UNC path. You can also point to a remote HTTP address for an XML file. This is useful for interaction with a vendor. This source is also very useful when used in conjunction with the Web Service Task or the XML Task. Once you point the data item to an XML file, you must generate an XSD (XML Schema Definition) file by clicking the Generate XSD button or point to an existing XSD file. The schema definition can also be an in-line XML file, so you don’t necessarily need an XSD file. Each of these cases may vary based on the XML that you’re trying to connect. The rest of the source resembles other sources; for example, you can filter the columns you don’t want to see down the chain.
www.it-ebooks.info
c04.indd 110
3/22/2014 8:07:21 AM
Destinations╇
❘╇ 111
ADO.NET Source The ADO.NET Source enables you to make a .NET provider a source and make it available for consumption inside the package. The source uses an ADO.NET Connection Manager to connect to the provider. The Data Flow is based on OLE DB, so for best performance, using the OLE DB Source is preferred. However, some providers might require that you use the ADO.NET source. Its interface is identical in appearance to the OLE DB Source, but it does require an ADO.NET Connection Manager.
Destinations Inside the Data Flow, destinations accept the data from the Data Sources and from the transformations. The architecture can send the data to nearly any OLE DB–compliant Data Source, a flat file, or Analysis Services, to name just a few. Like sources, destinations are managed through Connection Managers. The configuration difference between sources and destinations is that in destinations, you have a Mappings page (shown in Figure 4-9), where you specify how the inputted data from the Data Flow maps to the destination. As shown in the Mappings page in this figure, the columns are automatically mapped based on column names, but they don’t necessarily have to be exactly lined up. You can also choose to ignore given columns, such as when you’re inserting into a table that has an identity column, and you don’t want to inherit the value from the source table.
Figure 4-9
www.it-ebooks.info
c04.indd 111
3/22/2014 8:07:21 AM
112╇
❘╇ CHAPTER 4╇ The Data Flow
In SQL Server 2014, you can start by configuring the destination first, but it would lack the metadata you need. So, you will really want to connect to a Data Flow path. To do this, select the source or a transformation and drag the blue arrow to the destination. If you want to output bad data or data that has had an error to a destination, you would drag the red arrow to that destination. If you try to configure the destination before attaching it to the transformation or source, you will see the error in Figure 4-10. In SQL Server 2014, you can still proceed and edit the component, but it won’t be as meaningful without the live metadata.
Figure 4-10
Excel Destination The Excel Destination is identical to the Excel Source except that it accepts data rather than sends data. To use it, first select the Excel Connection Manager from the Connection Manager page, and then specify the worksheet into which you wish to load data. WARNING╇ The big caveat with the Excel Destination is that unlike the Flat File Destination, an Excel spreadsheet must already exist with the sheet into which you wish to copy data. If the spreadsheet doesn’t exist, you will receive an error. To work around this issue, you can create a blank spreadsheet to use as your template, and then use the File System Task to copy the file over.
Flat File Destination The commonly used Flat File Destination sends data to a flat file, and it can be fixed-width or delimited based on the Connection Manager. The destination uses a Flat File Connection Manager. You can also add a custom header to the file by typing it into the Header option in the Connection Manager page. Lastly, you can specify on this page that the destination file should be overwritten each time the Data Flow is run.
OLE DB Destination Your most commonly used destination will probably be the OLE DB Destination (see Figure 4-11). It can write data from the source or transformation to OLE DB–compliant Data Sources such as Oracle, DB2, Access, and SQL Server. It is configured like any other destination and source, using OLE DB Connection Managers. A dynamic option it has is the Data Access Mode. If you select Table or View - Fast Load, or its variable equivalent, several options will be available, such as Table Lock. This Fast Load option is available only for SQL Server database instances and turns on a bulk load option in SQL Server instead of a row-by-row operation.
www.it-ebooks.info
c04.indd 112
3/22/2014 8:07:22 AM
Destinations╇
❘╇ 113
Figure 4-11
A few options of note here are Rows Per Batch, which specifies how many rows are in each batch sent to the destination, and Maximum Insert Commit Size, which specifies how large the batch size will be prior to issuing a commit statement. The Table Lock option places a lock on the destination table to speed up the load. As you can imagine, this causes grief for your users if they are trying to read from the table at the same time. Another important option is Keep Identity, which enables you to insert into a column that has the identity property set on it. Generally speaking, you can improve performance by setting Max Insert Commit Size to a number like 10,000, but that number will vary according to column width. New users commonly ask what the difference is between the fast load and the normal load (table or view option) for the OLE DB Destination. The Fast Load option specifies that SSIS will load data in bulk into the OLE DB Destination’s target table. Because this is a bulk operation, error handling via a redirection or ignoring data errors is not allowed. If you require this level of error handling, you need to turn off bulk loading of the data by selecting Table or View for the Data Access Mode option. Doing so will allow you to redirect your errors down the red line, but it causes a slowdown of the load by a factor of at least four.
Raw File Destination The Raw File Destination is an especially speedy Data Destination that does not use a Connection Manager to configure. Instead, you point to the file on the server in the editor. This destination is written to typically as an intermediate point for partially transformed data. Once written to, other packages can read the data in by using the Raw File Source. The file is written in native format, so it is very fast.
www.it-ebooks.info
c04.indd 113
3/22/2014 8:07:22 AM
114╇
❘╇ CHAPTER 4╇ The Data Flow
Recordset Destination The Recordset Destination populates an ADO recordset that can be used outside the transformation. For example, you can populate the ADO recordset, and then a Script Task could read that recordset by reading a variable later in the Control Flow. This type of destination does not support an error output like some of the other destinations.
Data Mining Model Training The Data Mining Model Training Destination can train (the process of a data mining algorithm learning the data) an Analysis Services data mining model by passing it data from the Data Flow. You can train multiple mining models from a single destination and Data Flow. To use this destination, you select an Analysis Services Connection Manager and the mining model. Analysis Services mining models are beyond the scope of this book; for more information, please see Professional SQL Server Analysis Services 2012 with MDX and DAX by Sivakumar Harinath and his coauthors (Wrox, 2012). Note╇ The data you pass into the Data Mining Model Training Destination must be presorted. To do this, you use the Sort Transformation, discussed later in this chapter.
DataReader Destination The DataReader Destination provides a way to extend SSIS Data Flows to external packages or programs that can use the DataReader interface, such as a .NET application. When you configure this destination, ensure that its name is something that’s easy to recognize later in your program, because you will be calling that name later. After you have configured the name and basic properties, check the columns you’d like outputted to the destination in the Input Columns tab.
Dimension and Partition Processing The Dimension Processing Destination loads and processes an Analysis Services dimension. You have the option to perform full, incremental, or update processing. To configure the destination, select the Analysis Services Connection Manager that contains the dimension you would like to process on the Connection Manager page of the Dimension Processing Destination Editor. You will then see a list of dimensions and fact tables in the box. Select the dimension you want to load and process, and from the Mappings page, map the data from the Data Flow to the selected dimension. Lastly, you can configure how you would like to handle errors, such as unknown keys, in the Advanced page. Generally, the default options are fine for this page unless you have special error-handling needs. The Partition Processing Destination has identical options, but it processes an Analysis Services partition instead of a dimension.
www.it-ebooks.info
c04.indd 114
3/22/2014 8:07:23 AM
Common Transformations╇
❘╇ 115
Common Transformations Transformations or transforms are key components to the Data Flow that transform the data to a desired format as you move from step to step. For example, you may want a sampling of your data to be sorted and aggregated. Three transformations can accomplish this task for you: one to take a random sampling of the data, one to sort, and another to aggregate. The nicest thing about transformations in SSIS is that they occur in-memory and no longer require elaborate scripting as in SQL Server 2000 DTS. As you add a transformation, the data is altered and passed down the path in the Data Flow. Also, because this is done in-memory, you no longer have to create staging tables to perform most functions. When dealing with very large data sets, though, you may still choose to create staging tables. You set up the transformation by dragging it onto the Data Flow tab design area. Then, click the source or transformation you’d like to connect it to, and drag the green arrow to the target transformation or destination. If you drag the red arrow, then rows that fail to transform will be directed to that target. After you have the transformation connected, you can double-click it to configure it.
Synchronous versus Asynchronous Transformations Transformations are divided into two main categories: synchronous and asynchronous. In SSIS, you want to ideally use all synchronous components. Synchronous transformations are components such as the Derived Column and Data Conversion Transformations, where rows flow into memory buffers in the transformation, and the same buffers come out. No rows are held, and typically these transformations perform very quickly, with minimal impact to your Data Flow. Asynchronous transformations can cause a block in your Data Flow and slow down your runtime. There are two types of asynchronous transformations: partially blocking and fully blocking. ➤➤
Partially blocking transformations, such as the Union All, create new memory buffers for the output of the transformation.
➤➤
Fully blocking transformations, such as the Sort and Aggregate Transformations, do the same thing but cause a full block of the data. In order to sort the data, SSIS must first see every single row of the data. If you have a 100MB file, then you may require 200MB of RAM in order to process the Data Flow because of a fully blocking transformation. These fully blocking transformations represent the single largest slowdown in SSIS and should be considered carefully in terms of any architecture decisions you must make. Note╇ Chapter 16 covers these concepts in much more depth.
Aggregate The fully blocking asynchronous Aggregate Transformation allows you to aggregate data from the Data Flow to apply certain T-SQL functions that are done in a GROUP BY statement, such as Average, Minimum, Maximum, and Count. For example, in Figure 4-12, you can see that the data is grouped
www.it-ebooks.info
c04.indd 115
3/22/2014 8:07:23 AM
116╇
❘╇ CHAPTER 4╇ The Data Flow
together on the ProductKey column, and then the SalesAmount column is summed. Lastly, for every ProductKey, the maximum OrderDateKey is aggregated. This produces four new columns that can be consumed down the path, or future actions can be performed on them and the other columns dropped at that time.
Figure 4-12
The Aggregate Transformation is configured in the Aggregate Transformation Editor (see Figure 4-12). To do so, first check the column on which you wish to perform the action. After checking the column, the input column will be filled below in the grid. Optionally, type an alias in the Output Alias column that you wish to give the column when it is outputted to the next transformation or destination. For example, if the column currently holds the total money per customer, you might change the name of the column that’s outputted from SalesAmount to TotalCustomerSaleAmt. This will make it easier for you to recognize what the column represents along the data path. The most important option is Operation. For this option, you can select the following: ➤➤
Group By: Breaks the data set into groups by the column you specify
➤➤
Average: Averages the selected column’s numeric data
➤➤
Count: Counts the records in a group
➤➤
Count Distinct: Counts the distinct non-NULL values in a group
www.it-ebooks.info
c04.indd 116
3/22/2014 8:07:23 AM
Common Transformations╇
➤➤
Minimum: Returns the minimum numeric value in the group
➤➤
Maximum: Returns the maximum numeric value in the group
➤➤
Sum: Returns sum of the selected column’s numeric data in the group
❘╇ 117
You can click the Advanced tab to see options that enable you to configure multiple outputs from the transformation. After you click Advanced, you can type a new Aggregation Name to create a new output. You will then be able to check the columns you’d like to aggregate again as if it were a new transformation. This can be used to roll up the same input data different ways. In the Advanced tab, the “Key scale” option sets an approximate number of keys. The default is Unspecified, which optimizes the transformation’s cache to the appropriate level. For example, setting this to Low will optimize the transform to write 500,000 keys. Setting it to Medium will optimize it for 5,000,000 keys, and High will optimize the transform for 25,000,000 keys. You can also set the exact number of keys by using the “Number of keys” option. The “Count distinct scale” option will optionally set the amount of distinct values that can be written by the transformation. The default value is Unspecified, but if you set it to Low, the transformation will be optimized to write 500,000 distinct values. Setting the option to Medium will set it to 5,000,000 values, and High will optimize the transformation to 25,000,000. The Auto Extend Factor specifies to what factor your memory can be extended by the transformation. The default option is 25 percent, but you can specify another setting to keep your RAM from getting away from you.
Conditional Split The Conditional Split Transformation is a fantastic way to add complex logic to your Data Flow. This transformation enables you to send the data from a single data path to various outputs or paths based on conditions that use the SSIS expression language. For example, you could configure the transformation to send all products with sales that have a quantity greater than 500 to one path, and products that have more than 50 sales down another path. Lastly, if neither condition is met, the sales would go down a third path, called “Small Sale,” which essentially acts as an ELSE statement in T-SQL. This exact situation is shown in Figure 4-13. You can drag and drop the column or expression code snippets from the tree in the top-right panel. After you complete the condition, you need to name it something logical, rather than the default name of Case 1. You’ll use this case name later in the Data Flow. You also can configure the “Default output name,” which will output any data that does not fit any case. Each case in the transform and the default output name will show as a green line in the Data Flow and will be annotated with the name you typed in. You can also conditionally read string data by using SSIS expressions, such as the following example, which reads the first letter of the City column: SUBSTRING(City,1,1) == "F"
You can learn much more about the expression language in Chapter 5. Once you connect the transformation to the next transformation in the path or destination, you’ll see a pop-up dialog that lets you select which case you wish to flow down this path, as shown in Figure 4-14. In this figure, you can see three cases. The “Large Sale” condition can go down one path, “Medium Sales” down another, and the default “Small Sales” down the last path. After you complete the configuration of the first case, you can create a path for each case in the conditional split.
www.it-ebooks.info
c04.indd 117
3/22/2014 8:07:23 AM
118╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-13
Figure 4-14
A much more detailed example of the Conditional Split Transformation is given in Chapter 8.
Data Conversion The Data Conversion Transformation performs a similar function to the CONVERT or CAST functions in T-SQL. This transformation is configured in the Data Conversion Transformation Editor (see Figure 4-15), where you check each column that you wish to convert and then specify to what
www.it-ebooks.info
c04.indd 118
3/22/2014 8:07:24 AM
Common Transformations╇
❘╇ 119
you wish to convert it under the Data Type column. The Output Alias is the column name you want to assign to the column after it is transformed. If you don’t assign it a new name, it will later be displayed as Data Conversion: ColumnName in the Data Flow. This same logic can also be accomplished in a Derived Column Transform, but this component provides a simpler UI.
Figure 4-15
Derived Column The Derived Column Transformation creates a new column that is calculated (derived) from the output of another column or set of columns. It is one of the most important transformations in your Data Flow arsenal. You may wish to use this transformation, for example, to multiply the quantity of orders by the cost of the order to derive the total cost of the order, as shown in Figure 4-16. You can also use it to find out the current date or to fill in the blanks in the data by using the ISNULL function. This is one of the top five transformations that you will find yourself using to alleviate the need for T-SQL scripting in the package.
www.it-ebooks.info
c04.indd 119
3/22/2014 8:07:25 AM
120╇
❘╇ CHAPTER 4╇ The Data Flow
To configure this transformation, drag the column or variable into the Expression column, as shown in Figure 4-16. Then add any functions to it. You can find a list of functions in the topright corner of the Derived Column Transformation Editor. You must then specify, in the Derived Column dropdown box, whether you want the output to replace an existing column in the Data Flow or create a new column. As shown in Figure 4-16, the first derived column expression is doing an in-place update of the OrderQuantity column. The expression states that if the OrderQuantity column is null, then convert it to 0; otherwise, keep the existing data in the OrderQuantity column. If you create a new column, specify the name in the Derived Column Name column, as shown in the VAT column.
Figure 4-16
You’ll find all the available functions for the expression language in the top-right pane of the editor. There are no hidden or secret expressions in this C# variant expression language. We use the expression language much more throughout this and future chapters so don’t worry too much about the details of the language yet.
www.it-ebooks.info
c04.indd 120
3/22/2014 8:07:26 AM
Common Transformations╇
❘╇ 121
Some common expressions can be found in the following table: Ex ample Expression
Description
SUBSTRING(ZipCode, 1,5)
Captures the first 5 numbers of a zip code
ISNULL(Name) ? “NA” : Name
If the Name column is NULL, replace with the value of NA. Otherwise, keep the Name column as is.
UPPER(FirstName)
Uppercases the FirstName column
(DT_WSTR, 3)CompanyID + Name
Converts the CompanyID column to a string and appends it to the Name column
Lookup The Lookup Transformation performs what equates to an INNER JOIN on the Data Flow and a second data set. The second data set can be an OLE DB table or a cached file, which is loaded in the Cache Transformation. After you perform the lookup, you can retrieve additional columns from the second column. If no match is found, an error occurs by default. You can later choose, using the Configure Error Output button, to ignore the failure (setting any additional columns retrieved from the reference table to NULL) or redirect the rows down the second nonmatched green path. Note ╇ This is a very detailed transformation; it is covered in much more depth in Chapter 7 and again in Chapter 8.
Cache The Cache Transformation enables you to load a cache file on disk in the Data Flow. This cache file is later used for fast lookups in a Lookup Transformation. The Cache Transformation can be used to populate a cache file in the Data Flow as a transformation, and then be immediately used, or it can be used as a destination and then used by another package or Data Flow in the same package. The cache file that’s created enables you to perform lookups against large data sets from a raw file. It also enables you to share the same lookup cache across many Data Flows or packages. Note ╇ This transformation is covered in much more detail in Chapter 7.
Row Count The Row Count Transformation provides the capability to count rows in a stream that is directed to its input source. This transformation must place that count into a variable that could be used in the Control Flow — for insertion into an audit table, for example. This transformation is useful for tasks that require knowing “how many?” It is especially valuable because you don’t physically have to commit stream data to a physical table to retrieve the count, and it can act as a
www.it-ebooks.info
c04.indd 121
3/22/2014 8:07:26 AM
122╇
❘╇ CHAPTER 4╇ The Data Flow
destination, terminating your data stream. If you need to know how many rows are split during the Conditional Split Transformation, direct the output of each side of the split to a separate Row Count Transformation. Each Row Count Transformation is designed for an input stream and will output a row count into a Long (integer) or compatible data type. You can then use this variable to log information into storage, to build e-mail messages, or to conditionally run steps in your packages. For this transformation, all you really need to provide in terms of configuration is the name of the variable to store the count of the input stream. You will now simulate a row count situation in a package. You could use this type of logic to implement conditional execution of any task, but for simplicity you’ll conditionally execute a Script Task that does nothing.
1.
Create an SSIS package named Row Count Example. Add a Data Flow Task to the Control Flow design surface.
2.
In the Control Flow tab, add a variable named iRowCount. Ensure that the variable is package scoped and of type Int32. If you don’t know how to add a variable, select Variable from the SSIS menu and click the Add Variable button.
3.
Create a Connection Manager that connects to the AdventureWorks database. Add an OLE DB Data Source to the Data Flow design surface. Configure the source to point to your AdventureWorks database’s Connection Manager and the table [ErrorLog].
4.
Add a Row Count Transformation Task to the Data Flow tab. Open the Advanced Editor. Select the variable named User::iRowCount as the Variable property. Your editor should resemble Figure 4-17.
5.
Return to the Control Flow tab and add a Script Task. This task won’t really perform any action. It will be used to show the conditional capability to perform steps based on the value returned by the Row Count Transformation.
6. 7.
Figure 4-17
Connect the Data Flow Task to the Script Task. Right-click the arrow connecting the Data Flow and Script Tasks. Select the Edit menu. In the Precedence Constraint Editor, change the Evaluation Operation to Expression. Set the Expression to @iRowCount>0.
When you run the package, you’ll see that the Script Task is not executed. If you are curious, insert a row into the [ErrorLog] table and rerun the package or change the source table that has data. You’ll see that the Script Task will show a green checkmark, indicating that it was executed. An example of what your package may look like is shown in Figure 4-18. In this screenshot, no rows were transformed, so the Script Task never executed.
Script Component
Figure 4-18
The Script Component enables you to write custom .NET scripts as transformations, sources, or destinations. Once you drag the component over, it will ask you if
www.it-ebooks.info
c04.indd 122
3/22/2014 8:07:27 AM
Common Transformations╇
❘╇ 123
you want it to be a source, transformation, or destination. Some of the things you can do with this transformation include the following: ➤➤
Create a custom transformation that would use a .NET assembly to validate credit card numbers or mailing addresses.
➤➤
Validate data and skip records that don’t seem reasonable. For example, you can use it in a human resource recruitment system to pull out candidates that don’t match the salary requirement at a job code level.
➤➤
Read from a proprietary system for which no standard provider exists.
➤➤
Write a custom component to integrate with a third-party vendor.
Scripts used as sources can support multiple outputs, and you have the option of precompiling the scripts for runtime efficiency. Note ╇ You can learn much more about the Script Component in Chapter 9.
Slowly Changing Dimension The Slowly Changing Dimension (SCD) Transformation provides a great head start in helping to solve a common, classic changing-dimension problem that occurs in the outer edge of your data model — the dimension or lookup tables. The changing-dimension issue in online transaction and analytical processing database designs is too big to cover in this chapter, but a brief overview should help you understand the value of service the SCD Transformation provides. A dimension table contains a set of discrete values with a description and often other measurable attributes such as price, weight, or sales territory. The classic problem is what to do in your dimension data when an attribute in a row changes — particularly when you are loading data automatically through an ETL process. This transformation can shave days off of your development time in relation to creating the load manually through T-SQL, but it can add time because of how it queries your destination and how it updates with the OLE DB Command Transform (row by row). Note ╇ Loading data warehouses is covered in Chapter 12.
Sort The Sort Transformation is a fully blocking asynchronous transformation that enables you to sort data based on any column in the path. This is probably one of the top ten transformations you will use on a regular basis because some of the other transformations require sorted data, and you’re reading data from a system that does not allow you to perform an ORDER BY clause or is not pre-sorted. To configure it, open the Sort Transformation Editor after it is connected to the path and check the column that you wish to sort by. Then, uncheck any column you don’t want passed through to the path from the Pass Through column. By default, every column will be passed through the pipeline. You can see this in Figure 4-19, where the user is sorting by the Name column and passing all other columns in the path as output.
www.it-ebooks.info
c04.indd 123
3/22/2014 8:07:27 AM
124╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-19
In the bottom grid, you can specify the alias that you wish to output and whether you want to sort in ascending or descending order. The Sort Order column shows which column will be sorted on first, second, third, and so on. You can optionally check the Remove Rows with Duplicate Sort Values option to “Remove rows that have duplicate sort values.” This is a great way to do rudimentary de-duplication of your data. If a second value comes in that matches your same sort key, it is ignored and the row is dropped. Note ╇ Because this is an asynchronous transformation, it will slow down your Data Flow immensely. Use it only when you have to, and use it sparingly.
As mentioned previously, avoid using the Sort Transformation when possible, because of speed. However, some transformations, like the Merge Join and Merge, require the data to be sorted. If you place an ORDER BY statement in the OLE DB Source, SSIS is not aware of the ORDER BY statement because it could just have easily been in a stored procedure. If you have an ORDER BY clause in your T-SQL statement in the OLE DB Source or the ADO .NET Source, you can notify SSIS that the data is already sorted, obviating the need for the Sort Transformation in the Advanced Editor. After ordering the data in your SQL statement, right-click the source and select Advanced Editor. From the Input and Output Properties tab, select OLE DB Source Output. In the Properties pane, change the IsSorted property to True.
www.it-ebooks.info
c04.indd 124
3/22/2014 8:07:27 AM
Common Transformations╇
❘╇ 125
Then, under Output Columns, select the column you are ordering on in your SQL statement, and change the SortKeyPosition to 1 if you’re sorting only by a single column ascending, as shown in Figure 4-20. If you have multiple columns, you could change this SortKeyPosition value to the column position in the ORDER BY statement starting at 1. A value of -1 will sort the data in descending order.
Figure 4-20
Union All The Union All Transformation works much the same way as the Merge Transformation, but it does not require sorted data. It takes the outputs from multiple sources or transformations and combines them into a single result set. For example, in Figure 4-21, the user combines the data from three sources into a single output using the Union All Transformation. Notice that the City column is called something different in each source and that all are now merged in this transformation into a single column. Think of the Union All as essentially stacking the data on top of each other, much like the T-SQL UNION operator does.
www.it-ebooks.info
c04.indd 125
3/22/2014 8:07:28 AM
126╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-21
To configure the transformation, connect the first source or transformation to the Union All Transformation, and then continue to connect the other sources or transformations to it until you are done. You can optionally open the Union All Transformation Editor to ensure that the columns map correctly, but SSIS takes care of that for you automatically. The transformation fixes minor metadata issues. For example, if you have one input that is a 20-character string and another that is 50 characters, the output of this from the Union All Transformation will be the longer 50-character column. You need to open the Union All Transformation Editor only if the column names from one of the transformations that feed the Union All Transformation have different column names.
Other Transformations There are many more transformations you can use to complete your more complex Data Flow. Some of these transformations like the Audit and Case Transformations can be used in lieu of a Derived Column Transformation because they have a simpler UI. Others serve a purpose that’s specialized.
Audit The Audit Transformation allows you to add auditing data to your Data Flow. Because of acts such as HIPPA and Sarbanes-Oxley (SOX) governing audits, you often must be able to track who inserted
www.it-ebooks.info
c04.indd 126
3/22/2014 8:07:29 AM
Other Transformations╇
❘╇ 127
data into a table and when. This transformation helps you with that function. The task is easy to configure. For example, to track what task inserted data into the table, you can add those columns to the Data Flow path with this transformation. The functionality in the Audit Transformation can be achieved with a Derived Column Transformation, but the Audit Transformation provides an easier interface. All other columns are passed through to the path as an output, and any auditing item you add will also be added to the path. Simply select the type of data you want to audit in the Audit Type column (shown in Figure 4-22), and then name the column that will be outputted to the flow. Following are some of the available options: ➤➤
Execution instance GUID: GUID that identifies the execution instance of the package
➤➤
Package ID: Unique ID for the package
➤➤
Package name: Name of the package
➤➤
Version ID: Version GUID of the package
➤➤
Execution start time: Time the package began
➤➤
Machine name: Machine on which the package ran
➤➤
User name: User who started the package
➤➤
Task name: Data Flow Task name that holds the Audit Task
➤➤
Task ID: Unique identifier for the Data Flow Task that holds the Audit Task
Figure 4-22
www.it-ebooks.info
c04.indd 127
3/22/2014 8:07:29 AM
128╇
❘╇ CHAPTER 4╇ The Data Flow
Character Map The Character Map Transformation (shown in Figure 4-23) performs common character translations in the flow. This simple transformation can be configured in a single tab. To do so, check the columns you wish to transform. Then, select whether you want this modified column to be added as a new column or whether you want to update the original column. You can give the column a new name under the Output Alias column. Lastly, select the operation you wish to perform on the inputted column. The available operation types are as follows: ➤➤
Byte Reversal: Reverses the order of the bytes. For example, for the data 0x1234 0x9876, the result is 0x4321 0x6789. This uses the same behavior as LCMapString with the LCMAP_BYTEREV option.
➤➤
Full Width: Converts the half-width character type to full width
➤➤
Half Width: Converts the full-width character type to half width
➤➤
Hiragana: Converts the Katakana style of Japanese characters to Hiragana
➤➤
Katakana: Converts the Hiragana style of Japanese characters to Katakana
➤➤
Linguistic Casing: Applies the regional linguistic rules for casing
➤➤
Lowercase: Changes all letters in the input to lowercase
➤➤
Traditional Chinese: Converts the simplified Chinese characters to traditional Chinese
➤➤
Simplified Chinese: Converts the traditional Chinese characters to simplified Chinese
➤➤
Uppercase: Changes all letters in the input to uppercase
In Figure 4-23, you can see that two columns are being transformed — both to uppercase. For the TaskName input, a new column is added, and the original is kept. The PackageName column is replaced in-line.
Copy Column The Copy Column Transformation is a very simple transformation that copies the output of a column to a clone of itself. This is useful if you wish to create a copy of a column before you perform some elaborate transformations. You could then keep the original value as your control subject and the copy as the modified column. To configure this transformation, go to the Copy Column Transformation Editor and check the column you want to clone. Then assign a name to the new column. Note ╇ The Derived Column Transformation will allow you to transform the data from a column to a new column, but the UI in the Copy Column Transformation is simpler for some.
www.it-ebooks.info
c04.indd 128
3/22/2014 8:07:29 AM
Other Transformations╇
❘╇ 129
Figure 4-23
Data Mining Query The Data Mining Query Transformation typically is used to fill in gaps in your data or predict a new column for your Data Flow. This transformation runs a Data Mining Extensions (DMX) query against an SSAS data-mining model, and adds the output to the Data Flow. It also can optionally add columns, such as the probability of a certain condition being true. A few great scenarios for this transformation would be the following: ➤➤
You could take columns, such as number of children, household income, and marital income, to predict a new column that states whether the person owns a house or not.
➤➤
You could predict what customers would want to buy based on their shopping cart items.
➤➤
You could fill the gaps in your data where customers didn’t enter all the fields in a questionnaire.
The possibilities are endless with this transformation.
www.it-ebooks.info
c04.indd 129
3/22/2014 8:07:30 AM
130╇
❘╇ CHAPTER 4╇ The Data Flow
DQS Cleansing The Data Quality Services (DQS) Cleansing Transformation performs advanced data cleansing on data flowing through it. With this transformation, you can have your business analyst (BA) create a series of business rules that declare what good data looks like in the Data Quality Client (included in SQL Server). The BA will use a tool called the Data Quality Client to create domains that define data in your company, such as what a Company Name column should always look like. The DQS Cleansing Transformation can then use that business rule. This transformation will score the data for you and tell you what the proper cleansed value should be. Chapter 10 covers this transformation in much more detail.
Export Column The Export Column Transformation is a transformation that exports data to a file from the Data Flow. Unlike the other transformations, the Export Column Transformation doesn’t need a destination to create the file. To configure it, go to the Export Column Transformation Editor, shown in Figure 4-24. Select the column that contains the file from the Extract Column dropdown box. Select the column that contains the path and filename to send the files to in the File Path Column dropdown box.
Figure 4-24
www.it-ebooks.info
c04.indd 130
3/22/2014 8:07:31 AM
Other Transformations╇
❘╇ 131
The other options specify where the file will be overwritten or dropped. The Allow Append checkbox specifies whether the output should be appended to the existing file, if one exists. If you check Force Truncate, the existing file will be overwritten if it exists. The Write BOM option specifies whether a byte-order mark is written to the file if it is a DT_NTEXT or DT_WSTR data type. If you do not check the Append or Truncate options and the file exists, the package will fail if the error isn’t handled. The following error is a subset of the complete error you would receive: Error: 0xC02090A6 at Data Flow Task, Export Column [61]: Opening the file "wheel_small.tif" for writing failed. The file exists and cannot be overwritten. If the AllowAppend property is FALSE and the ForceTruncate property is set to FALSE, the existence of the file will cause this failure.
The Export Column Transformation Task is used to extract blob-type data from fields in a database and create files in their original formats to be stored in a file system or viewed by a format viewer, such as Microsoft Word or Microsoft Paint. The trick to understanding the Export Column Transformation is that it requires an input stream field that contains digitized document data, and another field that can be used for a fully qualified path. The Export Column Transformation will convert the digitized data into a physical file on the file system for each row in the input stream using the fully qualified path. In the following example, you’ll use existing data in the AdventureWorksDW database to output some stored documents from the database back to file storage. The database has a table named DimProduct that contains a file path and a field containing an embedded Microsoft Word document. Pull these documents out of the database and save them into a directory on the file system.
1.
Create a directory with an easy name like c:\ProSSIS\Chapter4\Export that you can use when exporting these pictures.
2.
Create a new SSIS project and package named Export Column Example.dtsx. Add a Data Flow Task to the Control Flow design surface.
3.
On the Data Flow design surface, add an OLE DB Data Source configured to the AdventureWorksDW database table DimProduct.
4.
Add a Derived Column Transformation Task to the Data Flow design surface. Connect the output of the OLE DB data to the task.
5.
Create a Derived Column Name named FilePath. Use the Derived Column setting of . To derive a new filename, just use the primary key for the filename and add your path to it. To do this, set the expression to the following: "c:\\ProSSIS\\Chapter4\\Export\\" + (DT_WSTR,50)ProductKey + ".gif"
Note ╇ The \\ is required in the expressions editor instead of \ because of its use as an escape sequence.
6.
Add an Export Column Transformation Task to the Data Flow design surface. Connect the output of the Derived Column Task to the Export Column Transformation Task, which will
www.it-ebooks.info
c04.indd 131
3/22/2014 8:07:31 AM
132╇
❘╇ CHAPTER 4╇ The Data Flow
consume the input stream and separate all the fields into two usable categories: fields that can possibly be in digitized data formats, and fields that can possibly be used as filenames.
7.
Set the Extract Column equal to the [LargePhoto] field, since this contains the embedded GIF image. Set the File Path Column equal to the field name [FilePath]. This field is the one that you derived in the Derived Column Task.
8.
Check the Force Truncate option to rewrite the files if they exist. (This will enable you to run the package again without an error if the files already exist.)
9.
Run the package and check the contents of the directory. You should see a list of image files in primary key sequence.
Fuzzy Lookup If you have done some work in the world of extract, transfer, and load (ETL) processes, then you’ve run into the proverbial crossroads of handling bad data. The test data is staged, but all attempts to retrieve a foreign key from a dimension table result in no matches for a number of rows. This is the crossroads of bad data. At this point, you have a finite set of options. You could create a set of hand-coded complex lookup functions using SQL Sound-Ex, full-text searching, or distance-based word calculation formulas. This strategy is time-consuming to create and test, complicated to implement, and dependent on a given language, and it isn’t always consistent or reusable (not to mention that everyone after you will be scared to alter the code for fear of breaking it). You could just give up and divert the row for manual processing by subject matter experts (that’s a way to make some new friends). You could just add the new data to the lookup tables and retrieve the new keys. If you just add the data, the foreign key retrieval issue is solved, but you could be adding an entry into the dimension table that skews data-mining results downstream. This is what we like to call a lazy-add. This is a descriptive, not a technical, term. A lazy-add would import a misspelled job title like “prasedent” into the dimension table when there is already an entry of “president.” It was added, but it was lazy. The Fuzzy Lookup and Fuzzy Grouping Transformations add one more road to take at the crossroads of bad data. These transformations allow the addition of a step to the process that is easy to use, consistent, scalable, and reusable, and they will reduce your unmatched rows significantly — maybe even altogether. If you’ve already allowed bad data in your dimension tables, or you are just starting a new ETL process, you’ll want to put the Fuzzy Grouping Transformation to work on your data to find data redundancy. This transformation can examine the contents of a suspect field in a staged or committed table and provide possible groupings of similar words based on provided tolerances. This matching information can then be used to clean up that table. Fuzzy Grouping is discussed later in this chapter. If you are correcting data during an ETL process, use the Fuzzy Lookup Transformation — my suggestion is to do so only after attempting to perform a regular lookup on the field. This best practice is recommended because Fuzzy Lookups don’t come cheap. They build specialized indexes of the input stream and the reference data for comparison purposes. You can store them for efficiency, but these indexes can use up some disk space or take up some memory if you choose to rebuild them on each run. Storing matches made by the Fuzzy Lookups over time in a translation or pre-dimension table is a great design. Regular Lookup Transformations can first be run against
www.it-ebooks.info
c04.indd 132
3/22/2014 8:07:31 AM
Other Transformations╇
❘╇ 133
this lookup table and then divert only those items in the Data Flow that can’t be matched to a Fuzzy Lookup. This technique uses Lookup Transformations and translation tables to find matches using INNER JOINs. Fuzzy Lookups whittle the remaining unknowns down if similar matches can be found with a high level of confidence. Finally, if your last resort is to have the item diverted to a subject matter expert, you can save that decision into the translation table so that the ETL process can match it next time in the first iteration. Using the Fuzzy Lookup Transformation requires an input stream of at least one field that is a string. Internally, the transformation has to be configured to connect to a reference table that will be used for comparison. The output to this transformation will be a set of columns containing the following: ➤➤
Input and Pass-Through Field Names and Values: This column contains the name and value of the text input provided to the Fuzzy Lookup Transformation or passed through during the lookup.
➤➤
Reference Field Name and Value: This column contains the name and value(s) of the matched results from the reference table.
➤➤
Similarity: This column contains a number between 0 and 1 representing similarity to the matched row and column. Similarity is a threshold that you set when configuring the Fuzzy Lookup Task. The closer this number is to 1, the closer the two text fields must match.
➤➤
Confidence: This column contains a number between 0 and 1 representing confidence of the match relative to the set of matched results. Confidence is different from similarity, because it is not calculated by examining just one word against another but rather by comparing the chosen word match against all the other possible matches. For example, the value of Knight Brian may have a low similarity threshold but a high confidence that it matches to Brian Knight. Confidence gets better the more accurately your reference data represents your subject domain, and it can change based on the sample of the data coming into the ETL process. The Fuzzy Lookup Transformation Editor has three configuration tabs.
➤➤
Reference Table: This tab (shown in Figure 4-25) sets up the OLE DB Connection to the source of the reference data. The Fuzzy Lookup takes this reference data and builds a tokenbased index (which is actually a table) out of it before it can begin to compare items. This tab contains the options to save that index or use an existing index from a previous process. There is also an option to maintain the index, which will detect changes from run to run and keep the index current. Note that if you are processing large amounts of potential data, this index table can grow large. There are a few additional settings in this tab that are of interest. The default option to set is the “Generate new index” option. By setting this, a table will be created on the reference table’s Connection Manager each time the transformation is run, and that table will be populated with loads of data as mentioned earlier in this section. The creation and loading of the table can be an expensive process. This table is removed after the transformation is complete.
www.it-ebooks.info
c04.indd 133
3/22/2014 8:07:31 AM
134╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-25
Alternatively, you can select the “Store new index” option, which will instantiate the table and not drop it. You can then reuse that table from other Data Flows or other Data Flows from other packages and over multiple days. As you can imagine, by doing this your index table becomes stale soon after its creation. There are stored procedures you can run to refresh it in SQL, or you can click the “Maintain stored index” checkbox to create a trigger on the underlying reference table to automatically maintain the index table. This is available only with SQL Server reference tables, and it may slow down your insert, update, and delete statements to that table. ➤➤
Columns: This tab allows mapping of the one text field in the input stream to the field in the reference table for comparison. Drag and drop a field from the Available Input Column onto the matching field in the Available Lookup Column. You can also click the two fields to be compared and right-click to create a relationship. Another neat feature is the capability to add the foreign key of the lookup table to the output stream. To do this, just click that field in the Available Input Columns.
➤➤
Advanced: This tab contains the settings that control the fuzzy logic algorithms. You can set the maximum number of matches to output per incoming row. The default is set to 1, which means pull the best record out of the reference table if it meets the similarity
www.it-ebooks.info
c04.indd 134
3/22/2014 8:07:32 AM
Other Transformations╇
❘╇ 135
threshold. Incrementing this setting higher than this may generate more results that you’ll have to sift through, but it may be required if there are too many closely matching strings in your domain data. A slider controls the Similarity threshold. It is recommended that you start this setting at .71 when experimenting and move up or down as you review the results. This setting is normally determined based on a businessperson’s review of the data, not the developer’s review. If a row cannot be found that’s similar enough, the columns that you checked in the Columns tab will be set to NULL. The token delimiters can also be set if, for example, you don’t want the comparison process to break incoming strings up by a period (.) or spaces. The default for this setting is all common delimiters. Figure 4-26 shows an example of an Advanced tab.
Figure 4-26
It’s important to not use Fuzzy Lookup as your primary Lookup Transformation for lookups because of the performance overhead; the Fuzzy Lookup transformation is significantly slower than the Lookup transformation. Always try an exact match using a Lookup Transformation and then redirect nonmatches to the Fuzzy Lookup if you need that level of lookup. Additionally, the Fuzzy Lookup Transformation does require the BI or Enterprise Edition of SQL Server.
www.it-ebooks.info
c04.indd 135
3/22/2014 8:07:32 AM
136╇
❘╇ CHAPTER 4╇ The Data Flow
Although this transformation neatly packages some highly complex logic in an easy-to-use component, the results won’t be perfect. You’ll need to spend some time experimenting with the configurable settings and monitoring the results. To that end, the following short example puts the Fuzzy Lookup Transformation to work by setting up a small table of occupation titles that will represent your dimension table. You will then import a set of person records that requires a lookup on the occupation to your dimension table. Not all will match, of course. The Fuzzy Lookup Transformation will be employed to find matches, and you will experiment with the settings to learn about its capabilities.
1.
Use the following data (code file FuzzyExample.txt) for this next example. This file can also be downloaded from www.wrox.com/go/prossis2014 and saved to c:\ProSSIS\ Chapter4\FuzzyExample.txt. The data represents employee information that you are going to import. Notice that some of the occupation titles are cut off in the text file because of the positioning within the layout. Also notice that this file has an uneven right margin. Both of these issues are typical ETL situations that are especially painful. EMPID 00001 00002 00003 00005 00006 00007 00008 00009 00010 00011 00012 00013 00014 00015 00016 00017 00018 00019 00020
2.
TITLE EXECUTIVE VICE PRESIDEN EXEC VICE PRES EXECUTIVE VP EXEC VP EXECUTIVE VICE PRASIDENS FIELDS OPERATION MGR FLDS OPS MGR FIELDS OPS MGR FIELDS OPERATIONS MANAG BUSINESS OFFICE MANAGER BUS OFFICE MANAGER BUS OFF MANAGER BUS OFF MGR BUS OFFICE MNGR BUS OFFICE MGR X-RAY TECHNOLOGIST XRAY TECHNOLOGIST XRAY TECH X-RAY TECH
LNAME WASHINGTON PIZUR BROWN MILLER WAMI SKY JEAN GANDI HINSON BROWN GREEN GATES HALE SMITH AI CHIN ABULA HOGAN ROBERSON
Run the following SQL code (code file FuzzyExampleInsert.sql) in AdventureWorksDW or another database of your choice. This code will create your dimension table and add the accepted entries that will be used for reference purposes. Again, this file can be downloaded from www.wrox.com/go/prossis2014: CREATE TABLE [Occupation]( [OccupationID] [smallint] IDENTITY(1,1) NOT NULL, [OccupationLabel] [varchar] (50) NOT NULL CONSTRAINT [PK_Occupation_OccupationID] PRIMARY KEY CLUSTERED ( [OccupationID] ASC ) ON [PRIMARY] ) ON [PRIMARY] GO INSERT INTO [Occupation] Select 'EXEC VICE PRES' INSERT INTO [Occupation] Select 'FIELDS OPS MGR' INSERT INTO [Occupation] Select 'BUS OFFICE MGR' INSERT INTO [Occupation] Select 'X-RAY TECH'
www.it-ebooks.info
c04.indd 136
3/22/2014 8:07:33 AM
Other Transformations╇
❘╇ 137
3.
Create a new SSIS package and drop a Data Flow Task on the Control Flow design surface and click the Data Flow tab.
4.
Add a Flat File Connection to the Connection Manager. Name it Extract, and then set the filename to c:\projects\Chapter4\fuzzyexample.txt. Set the Format property to Delimited, and set the option to pull the column names from the first data row, as shown in Figure 4-27.
Figure 4-27
5.
Click the Columns tab to confirm it is properly configured and showing three columns. Click the Advanced tab and ensure the OuputColumnWidth property for the TITLE field is set to 50 characters in length. Save the connection.
6.
Add a Flat File Source to the Data Flow surface and configure it to use the Extract connection.
7.
Add a Fuzzy Lookup Transformation to the Data Flow design surface. Connect the output of the Flat File Source to the Fuzzy Lookup, and connect the output of the Fuzzy Lookup to the OLE DB Destination.
www.it-ebooks.info
c04.indd 137
3/22/2014 8:07:33 AM
138╇
❘╇ CHAPTER 4╇ The Data Flow
8.
Open the Fuzzy Lookup Transformation Editor. Set the OLE DB Connection Manager in the Reference tab to use the AdventureWorksDW database connection and the Occupation table. Set up the Columns tab by connecting the input to the reference table columns as in Figure 4-28, dragging the Title column to the OccupationLabel column on the right. Set up the Advanced tab with a Similarity threshold of 50 (0.50).
Figure 4-28
9.
10.
Open the editor for the OLE DB Destination. Set the OLE DB connection to the AdventureWorksDW database. Click New to create a new table to store the results. Change the table name in the DDL statement that is presented to you to create the [FuzzyResults] table. Click the Mappings tab, accept the defaults, and save. Add a Data Viewer of type grid to the Data Flow between the Fuzzy Lookup and the OLE DB Destination.
Run the package. Your results at the Data View should resemble those in Figure 4-29. Notice that the logic has matched most of the items at a 50 percent similarity threshold — and you have the foreign key OccupationID added to your input for free! Had you used a strict INNER JOIN or Lookup Transformation, you would have made only three matches, a dismal hit ratio. These items can be seen in the Fuzzy Lookup output, where the values are 1 for similarity and confidence. A few of the columns are set to NULL now, because the row like Executive VP wasn’t 50 percent similar to the Exec Vice Pres value. You would typically send those NULL records with a conditional split to a table for manual inspection.
www.it-ebooks.info
c04.indd 138
3/22/2014 8:07:34 AM
Other Transformations╇
❘╇ 139
Figure 4-29
Fuzzy Grouping In the previous section, you learned about situations where bad data creeps into your dimension tables. The blame was placed on the “lazy-add” ETL processes that add data to dimension tables to avoid rejecting rows when there are no natural key matches. Processes like these are responsible for state abbreviations like “XX” and entries that look to the human eye like duplicates but are stored as two separate entries. The occupation titles “X-Ray Tech” and “XRay Tech” are good examples of duplicates that humans can recognize but computers have a harder time with. The Fuzzy Grouping Transformation can look through a list of similar text and group the results using the same logic as the Fuzzy Lookup. You can use these groupings in a transformation table to clean up source and destination data or to crunch fact tables into more meaningful results without altering the underlying data. The Fuzzy Group Transformation also expects an input stream of text, and it requires a connection to an OLE DB Data Source because it creates in that source a set of structures to use during analysis of the input stream. The Fuzzy Lookup Editor has three configuration tabs: ➤➤
Connection Manager: This tab sets the OLE DB connection that the transform will use to write the storage tables that it needs.
➤➤
Columns: This tab displays the Available Input Columns and allows the selection of any or all input columns for fuzzy grouping analysis. Figure 4-30 shows a completed Columns tab. Each column selected is analyzed and grouped into logical matches, resulting in a new column representing that group match for each data row. Each column can also be selected for Pass-Through — meaning the data is not analyzed, but it is available in the output stream. You can choose the names of any of the output columns: Group Output Alias, Output Alias, Clean Match, and Similarity Alias Score column.
www.it-ebooks.info
c04.indd 139
3/22/2014 8:07:35 AM
140╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-30
The minimum similarity evaluation is available at the column level if you select more than one column. The numerals option (which is not visible in Figure 4-30 but can be found by scrolling to the right) enables configuration of the significance of numbers in the input stream when grouping text logically. The options are leading numbers, trailing numbers, leading and trailing numbers, or neither leading nor trailing numbers. This option needs to be considered when comparing addresses or similar types of information. Comparison flags provide the same options to ignore or pay attention to case, kana type, nonspacing characters, character width, symbols, and punctuation. ➤➤
Advanced: This tab contains the settings controlling the fuzzy logic algorithms that assign groupings to text in the input stream. You can set the names of the three additional fields that are added automatically to the output of this transformation. These fields are named _key_in, _key_out, and _score by default. A slider controls the Similarity threshold. The recommended initial setting for this transformation is 0.5, which can be adjusted up or down as you review the results. The token delimiters can also be set if, for example, you don’t want the comparison process to break incoming strings up by a period (.) or spaces. The default for this setting is all common delimiters. Figure 4-31 shows a completed Advanced tab.
www.it-ebooks.info
c04.indd 140
3/22/2014 8:07:35 AM
Other Transformations╇
❘╇ 141
Figure 4-31
Suppose you are tasked with creating a brand-new occupations table using the employee occupations text file you imported in the Fuzzy Lookup example. Using only this data, you need to create a new employee occupations table with occupation titles that can serve as natural keys and that best represent this sample. You can use the Fuzzy Grouping Transformation to develop the groupings for the dimension table, like this:
1.
Create a new SSIS project named Fuzzy Grouping Example. Drop a Data Flow Task on the Control Flow design surface and click the Data Flow tab.
2.
Add a Flat File Connection to the Connection Manager. Name it Extract. Set the filename to c:\ProSSIS\Chapter4\FuzzyExample.txt. (Use the FuzzyExample.txt file from the Fuzzy Lookup example with the same configuration.) Save the connection.
3.
Add a Flat File Source to the Data Flow surface and configure it to use the Employee Data connection.
4.
Add a Fuzzy Grouping Transformation to the Data Flow design surface. Connect the output of the Flat File Source to the Fuzzy Lookup.
5.
Open the Fuzzy Grouping Editor and set the OLE DB Connection Manager to a new AdventureWorksDW connection.
6.
In the Columns tab, select the Title column in the Available Input Columns. Accept the other defaults.
www.it-ebooks.info
c04.indd 141
3/22/2014 8:07:36 AM
142╇
❘╇ CHAPTER 4╇ The Data Flow
7.
In the Advanced tab, set the Similarity threshold to 0.50. This will be your starting point for similarity comparisons.
8.
Add an OLE DB Destination to the Data Flow design surface. Configure the destination to use the AdventureWorksDW database or another database of your choice. For the Name of Table or View, click the New button. Change the name of the table in the CREATE table statement to [FuzzyGroupingResults]. Click the Mappings tab to complete the task and then save it.
9.
Add a Data Viewer in the pipe between the Fuzzy Grouping Transformation and the OLE DB Destination. Set the type to grid so that you can review the data at this point. Run the package. The output shown at various similarity thresholds would look similar to Figure 4-32.
Figure 4-32
Now you can look at these results and see more logical groupings and a few issues even at the lowest level of similarity. The title of “X-Ray Tech” is similar to the title “X-Ray Technologist.” The title “Executive Vice Presiden” isn’t a complete title, and really should be grouped with “Exec VP,” but this is pretty good for about five minutes of work. To build a dimension table from this output, look at the two fields in the Data View named _key_in and _key_out. If these two values match, then the grouped value is the “best” representative candidate for the natural key in a dimension table. Separate the rows in the stream using a Conditional Split Transformation where these two values match, and use an OLE Command Transformation to insert the values in the dimension table. Remember that the more data, the better the grouping. The output of the Fuzzy Grouping Transformation is also a good basis for a translation table in your ETL processes. By saving both the original value and the Fuzzy Grouping value — with a little subject matter expert editing — you can use a Lookup Transformation and this table to provide
www.it-ebooks.info
c04.indd 142
3/22/2014 8:07:36 AM
Other Transformations╇
❘╇ 143
much better foreign key lookup results. You’ll be able to improve on this with the Slowly Changing Dimension Transformation later in the chapter.
Import Column The Import Column Transformation is a partner to the Export Column Transformation. These transformations do the work of translating physical files from system file storage paths into database blob-type fields, and vice versa. The trick to understanding the Import Column Transformation is knowing that its input source requires at least one column that is the fully qualified path to the file you are going to store in the database, and you need a destination column name for the output of the resulting blob and file path string. This transformation also has to be configured using the Advanced Editor. The Advanced Editor is not intuitive, or wizard-like in appearance — hence the name “Advanced” (which, incidentally, you will be once you figure it out). In the editor, you won’t be able to merge two incoming column sources into the full file path; therefore, if your source data for the file paths have the filename separate from the file path, you should use the Merge Transformations to concatenate the columns before connecting that stream to the Import Column Transformation. In the following example, you’ll import some images into your AdventureWorksDW database. Create a new SSIS package. Transformations live in the Data Flow tab, so add a Data Flow Task to the Control Flow, and then add an Import Column Transformation to the Data Flow surface. To keep this easy, you will complete the following short tasks:
1.
Find a small GIF file and copy it three times into c:\ProSSIS\Chapter4\import (or just copy and paste the files from the Export Column example). Change the filenames to 1.gif, 2.gif, and 3.gif.
2.
Create a text file with the following content and save it in c:\ProSSIS\ Chapter4 as filelist.txt: C:\import\1.Gif C:\import\2.Gif C:\import\3.Gif
3.
Run the following SQL script in AdventureWorks to create a storage location for the image files: CREATE TABLE dbo.tblmyImages ( [StoredFilePath] [varchar](50) NOT NULL, [ProdPicture] image )
4.
You are going to use the filelist.txt file as your input stream for the files that you need to load into your database, so add a Flat File Source to your Data Flow surface and configure it to read one column from your filelist.txt flat file. Rename this column ImageFilePath.
Take advantage of the opportunity to open the Advanced Editor on the Flat File Source by clicking the Show Advanced Editor link in the property window or by right-clicking the transformation and selecting Advanced Editor, which looks quite a bit different from the Advanced Editor for a source. Note the difference between this editor and the normal Flat File Editor. The Advanced Editor is
www.it-ebooks.info
c04.indd 143
3/22/2014 8:07:36 AM
144╇
❘╇ CHAPTER 4╇ The Data Flow
stripped down to the core of the Data Flow component — no custom wizards, just an interface sitting directly over the object properties themselves. It is possible to mess these properties up beyond recognition, but even in the worst case you can just drop and recreate the component. Look particularly at the Input and Output Properties of the Advanced Editor. You didn’t have to use the Advanced Editor to set up the import of the filelist.txt file. However, looking at how the Advanced Editor displays the information will be very helpful when you configure the Import Column Transformation. Notice that you have an External Columns (Input) and Output Columns collection, with one node in each collection named ImageFilePath. This reflects the fact that your connection describes a field called ImageFilePath and that this transformation simply outputs data with the same field name. Connect the Flat File Source to the Import Column Transformation. Open the Advanced Editor for the Import Column Transformation and click the Input Columns tab. The input stream for this task is the output stream for the flat file. Select the one available column, move to the Input and Output Properties tab, and expand these nodes. This time you don’t have much help. An example of this editor is shown in Figure 4-33.
Figure 4-33
The Input Columns collection has a column named ImageFilePath, but there are no output columns. On the Flat File Source, you could ignore some of the inputs. In the Import Column Transformation, all inputs have to be re-output. In fact, if you don’t map an output, you’ll get the following error:
www.it-ebooks.info
c04.indd 144
3/22/2014 8:07:37 AM
Other Transformations╇
❘╇ 145
Validation error. Data Flow Task: Import Column [1]: The "input column "ImageFilePath" (164)" references output column ID 0, and that column is not found on the output.
Add an output column by clicking the Output Columns folder icon and click the Add Column button. Name the column myImage. Notice that the DataType property is [DT_IMAGE] by default. That is because this transformation produces image outputs. You can also pass DT_TEXT, DT_NTEXT, or DT_IMAGE types as outputs from this task. Your last task is to connect the input to the output. Note the output column’s ID property for myImage. This ID needs to be updated in the FileDataColumnID property of the input column ImageFilePath. If you fail to link the output column, you’ll get the following error: Validation error. Data Flow Task: Import Column [1]: The "output column "myImage" (207)" is not referenced by any input column. Each output column must be referenced by exactly one input column.
The Advanced Editor for each of the different transformations has a similar layout but may have other properties available. Another property of interest in this task is Expect BOM, which you would set to True if you expect a byte-order mark at the beginning of the file path (not for this example). A completed editor resembles Figure 4-33. Complete this example by adding an OLE Destination to the Data Flow design surface. Connect the data from the Import Column to the OLE Destination. Configure the OLE Destination to the AdventureWorksDW database and to the tblmyImages structure that was created for database storage. Click the Mappings setting. Notice that you have two available input columns from the Import Column Task. One is the full path and the other will be the file as type DT_IMAGE. Connect the input and destination columns to complete the transform. Go ahead and run it. Take a look at the destination table to see the results: FullFileName ---------------------C:\import\images\1.JPG C:\import\images\2.JPG C:\import\images\3.JPG (3 row(s) affected)
Document -----------------------0xFFD8FFE120EE45786966000049492A00... 0xFFD8FFE125FE45786966000049492A00... 0xFFD8FFE1269B45786966000049492A00...
Merge The Merge Transformation can merge data from two paths into a single output. This transformation is useful when you wish to break out your Data Flow into a path that handles certain errors and then merge it back into the main Data Flow downstream after the errors have been handled. It’s also useful if you wish to merge data from two Data Sources. This transformation is similar to the Union All Transformation, but the Merge Transformation has some restrictions that may cause you to lean toward using Union All: ➤➤
The data must be sorted before the Merge Transformation. You can do this by using the Sort Transformation prior to the merge or by specifying an ORDER BY clause in the source connection.
➤➤
The metadata must be the same between both paths. For example, the CustomerID column can’t be a numeric column in one path and a character column in another path.
➤➤
If you have more than two paths, you should choose the Union All Transformation.
www.it-ebooks.info
c04.indd 145
3/22/2014 8:07:37 AM
146╇
❘╇ CHAPTER 4╇ The Data Flow
To configure the transformation, ensure that the data is sorted exactly the same on both paths and drag the path onto the transform. You’ll be asked if the path you want to merge is Merge Input 1 or 2. If this is the first path you’re connecting to the transformation, select Merge Input 1. Next, connect the second path into the transformation. The transformation will automatically configure itself. Essentially, it maps each of the columns to the column from the other path, and you have the option to ignore a certain column’s data.
Merge Join One of the overriding themes of SSIS is that you shouldn’t have to write any code to create your transformation. This transformation will merge the output of two inputs and perform an INNER or OUTER join on the data. An example of when this would be useful is if you have a front-end web system in one data stream that has a review of a product in it, and you have an inventory product system in another data stream with the product data. You could merge join the two data inputs and output the review and product information into a single path. Note ╇ If both inputs are in the same database, then it would be faster to perform a join at the OLE DB Source level, rather than use a transformation through T-SQL. This transformation is useful when you have two different Data Sources you wish to merge, or when you don’t want to write your own join code.
To configure the Merge Join Transformation, connect your two inputs into the Merge Join Transformation, and then select what represents the right and left join as you connect each input. Open the Merge Join Transformation Editor and verify the linkage between the two tables. You can see an example of this in Figure 4-34. You can right-click the arrow to delete a linkage or drag a column from the left input onto the right input to create a new linkage if one is missing. Lastly, check each of the columns you want to be passed as output to the path and select the type of join you wish to make (LEFT, INNER, or FULL).
Multicast The Multicast Transformation, as the name implies, can send a single data input to multiple output paths easily. You may want to use this transformation to send a path to multiple destinations sliced in different ways. To configure this transformation, simply connect it to your input, and then drag the output path from the Multicast Transformation onto your next destination or transformation. After you connect the Multicast Transformation to your first destination or transformation, you can keep connecting it to other transformations or destinations. There is nothing to configure in the Multicast Transformation Editor other than the names of the outputs. Note ╇ The Multicast Transformation is similar to the Conditional Split Transformation in that both transformations send data to multiple outputs. The Multicast will send all the rows down every output path, whereas the Conditional Split will conditionally send each row down exactly one output path.
www.it-ebooks.info
c04.indd 146
3/22/2014 8:07:37 AM
Other Transformations╇
❘╇ 147
Figure 4-34
OLE DB Command The OLE DB Command Transformation is a component designed to execute a SQL statement for each row in an input stream. This task is analogous to an ADO Command object being created, prepared, and executed for each row of a result set. The input stream provides the data for parameters that can be set into the SQL statement, which is either an in-line statement or a stored procedure call. If you’re like us, just hearing the words “for each row” in the context of SQL makes us think of two other words: performance degradation. This involves firing an update, insert, or delete statement, prepared or unprepared some unknown number of times. This doesn’t mean there are no good reasons to use this transformation — you’ll actually be doing a few in this chapter. Just understand the impact and think about your use of this transformation. Pay specific attention to the volume of input rows that will be fed into it. Weigh the performance and scalability aspects during your design phases against a solution that would cache the stream into a temporary table and use set-based logic instead. To use the OLE DB Command Transformation, you basically need to determine how to set up the connection where the SQL statement will be run, provide the SQL statement to be executed, and configure the mapping of any parameters in the input stream to the SQL statement. Take a look at the settings for the OLE DB Command Transformation by opening its editor. The OLE DB
www.it-ebooks.info
c04.indd 147
3/22/2014 8:07:38 AM
148╇
❘╇ CHAPTER 4╇ The Data Flow
Command Transformation is another component that uses the Advanced Editor. There are four tabs in the editor: ➤➤
Connection Manager: Allows the selection of an OLE DB Connection. This connection is where the SQL statement will be executed. This doesn’t have to be the same connection that is used to provide the input stream.
➤➤
Component Properties: Here you can set the SQL Command statement to be executed in the SQLCommand property, and set the amount of time to allow for a timeout in the CommandTimeout property, in seconds. The property works the same way as the ADO Command object. The value for the CommandTimeout of 0 indicates no timeout. You can also name the task and provide a description in this tab.
➤➤
Column Mappings: This tab displays columns available in the input stream and the destination columns, which will be the parameters available in the SQL command. You can map the columns by clicking a column in the input columns and dragging it onto the matching destination parameter. It is a one-to-one mapping, so if you need to use a value for two parameters, you need use a Derived Column Transformation to duplicate the column in the input stream prior to configuring the columns in this transformation.
➤➤
Input and Output Properties: Most of the time you’ll be able to map your parameters in the Column Mappings tab. However, if the OLE DB provider doesn’t provide support for deriving parameter information (parameter refreshing), you have to come here to manually set up your output columns using specific parameter names and DBParamInfoFlags.
This transformation should be avoided whenever possible. It’s a better practice to land the data into a staging table using an OLE DB Destination and perform an update with a set-based process in the Control Flow with an Execute SQL Task. The Execute SQL Task’s statement would look something like this if you loaded a table called stg_TransactionHistoryUpdate and were trying to do a bulk update: BEGIN TRAN update CDCTransactionHistory SET TransactionDate = b.TransactionDate, ActualCost = b.ActualCost FROM CDCTransactionHistory CDC INNER JOIN [stg_TransactionHistoryUpdate] b ON CDC.TransactionID = b.TransactionID GO TRUNCATE TABLE [stg_TransactionHistoryUpdate] COMMIT
If you have 2,000 rows running through the transformation, the stored procedure or command will be executed 2,000 times. It might be more efficient to process these transactions in a SQL batch, but then you would have to stage the data and code the batch transaction. The main problem with this transformation is performance.
Percentage and Row Sampling The Percentage Sampling and Row Sampling Transformations enable you to take the data from the source and randomly select a subset of data. The transformation produces two outputs that you can select. One output is the data that was randomly selected, and the other is the data that was not
www.it-ebooks.info
c04.indd 148
3/22/2014 8:07:38 AM
Other Transformations╇
❘╇ 149
selected. You can use this to send a subset of data to a development or test server. The most useful application of this transformation is to train a data-mining model. You can use one output path to train your data-mining model, and the sampling to validate the model. To configure the transformation, select the percentage or number of rows you wish to be sampled. As you can guess, the Percentage Sampling Transformation enables you to select the percentage of rows, and the Row Sampling Transformation enables you to specify how many rows you wish to be outputted randomly. Next, you can optionally name each of the outputs from the transformation. The last option is to specify the seed that will randomize the data. If you select a seed and run the transformation multiple times, the same data will be outputted to the destination. If you uncheck this option, which is the default, the seed will be automatically incremented by one at runtime, and you will see random data each time.
Pivot Transform Do you ever get the feeling that pivot tables are the modern-day Rosetta Stone for translating data to your business owners? You store it relationally, but they ask for it in a format that requires you to write a complex case statement to generate. Well, not anymore. Now you can use an SSIS transformation to generate the results. A pivot table is a result of cross-tabulated columns generated by summarizing data from a row format. Typically, a Pivot Transformation is generated using the following input columns: ➤➤
Pivot Key: A pivot column is the element of input data to “pivot.” The word “pivot” is another way of saying “to create a column for each unique instance of.” However, this data must be under control. Think about creating columns in a table. You wouldn’t create 1,000 uniquely named columns in a table, so for best results when choosing a data element to pivot, pick an element that can be run through a GROUP BY statement that will generate 15 or fewer columns. If you are dealing with dates, use something like a DATENAME function to convert to the month or day of the year.
➤➤
Set Key: Set key creates one column and places all the unique values for all rows into this column. Just like any GROUP BY statement, some of the data is needed to define the group (row), whereas other data is just along for the ride.
➤➤
Pivot Value: These columns are aggregations for data that provide the results in the matrix between the row columns and the pivot columns.
The Pivot Transformation can accept an input stream, use your definitions of the preceding columns, and generate a pivot table output. It helps if you are familiar with your input needs and format your data prior to this transformation. Aggregate the data using GROUP BY statements. Pay special attention to sorting by row columns — this can significantly alter your results. To set your expectations properly, you have to define each of your literal pivot columns. A common misconception, and source of confusion, is approaching the Pivot Transformation with the idea that you can simply set the pivot column to pivot by the month of the purchase date column, and the transformation should automatically build 12 pivot columns with the month of the year for you. It will not. It is your task to create an output column for each month of the year. If you are using colors as your pivot column, you need to add an output column for every possible color. For example, if columns are set up for blue, green, and yellow, and the color red appears in the input
www.it-ebooks.info
c04.indd 149
3/22/2014 8:07:38 AM
150╇
❘╇ CHAPTER 4╇ The Data Flow
source, then the Pivot Transformation will fail. Therefore, plan ahead and know the possible pivots that can result from your choice of a pivot column or provide for an error output for data that doesn’t match your expected pivot values. In this example, you’ll use some of the AdventureWorks product and transactional history to generate a quick pivot table to show product quantities sold by month. This is a typical uppermanagement request, and you can cover all the options with this example. AdventureWorks Management wants a listing of each product with the total quantity of transactions by month for the year 2003. First identify the pivot column. The month of the year looks like the data that is driving the creation of the pivot columns. The row data columns will be the product name and the product number. The value field will be the total number of transactions for the product in a matrix by month. Now you are ready to set up the Pivot Transformation:
1.
Create a new SSIS project named Pivot Example. Add a Data Flow Task to the Control Flow design surface.
2.
Add an OLE DB Source to the Data Flow design surface. Configure the connection to the AdventureWorks database. Set the Data Access Mode to SQL Command. Add the following SQL statement (code file PivotExample.sql) to the SQL Command text box: SELECT p.[Name] as ProductName, p.ProductNumber, datename(mm, t.TransactionDate) as TransMonth, sum(t.quantity) as TotQuantity FROM production.product p INNER JOIN production.transactionhistory t ON t.productid = p.productid WHERE t.transactiondate between '01/01/03' and '12/31/03' GROUP BY p.[name], p.productnumber, datename(mm,t.transactiondate) ORDER BY productname, datename(mm, t.transactiondate)
3.
Add the Pivot Transformation and connect the output of the OLE DB Source to the input of the transformation. Open the transformation to edit it.
4.
Select TransMonth for the Pivot Key. This is the column that represents your columns. Change the Set Key property to ProductName. This is the column that will show on the rows, and your earlier query must be sorting by this column. Lastly, type the values of [December],[November],[October],[September] in the “Generate pivot output columns from values” area and check the Ignore option above this text box. Once complete, click the Generate Columns Now button. The final screen looks like Figure 4-35. Note ╇ The output columns are generated in exactly the same order that they appear on the output columns collection.
5.
To finish the example, add an OLE DB Destination. Configure to the AdventureWorks connection. Connect the Pivot Default Output to the input of the OLE DB Destination. Click the New button to alter the CREATE TABLE statement to build a table named PivotTable.
www.it-ebooks.info
c04.indd 150
3/22/2014 8:07:38 AM
Other Transformations╇
❘╇ 151
Figure 4-35
6.
Add a Data Viewer in the pipe between the PIVOT and the OLE DB destination and run the package. You’ll see the data in a pivot table in the Data Viewer, as shown in the partial results in Figure 4-36.
Figure 4-36
www.it-ebooks.info
c04.indd 151
3/22/2014 8:07:39 AM
152╇
❘╇ CHAPTER 4╇ The Data Flow
Unpivot As you know, mainframe screens rarely conform to any normalized form. For example, a screen may show a Bill To Customer, a Ship To Customer, and a Dedicated To Customer field. Typically, the Data Source would store these three fields as three columns in a file (such as a virtual storage access system, or VSAM). Therefore, when you receive an extract from the mainframe you may have three columns, as shown in Figure 4-37.
Figure 4-37
Your goal is to load this file into a Customer table in SQL Server. You want a row for each customer in each column, for a total of six rows in the Customer table, as shown in the CustomerName and OrderID columns in Figure 4-38. The Unpivot Transformation is a way to accomplish this business requirement. In this example, you’ll be shown how to use the Unpivot Transformation to create rows in the Data Flow from columns and shown how it is the opposite of the Pivot Transformation.
Figure 4-38
Your first step is to create a new package and drag a new Data Flow Task onto the Control Flow. From the Data Flow tab, configure the task. For this example, create a Flat File Connection Manager that points to UnPivotExample.csv, which looks like Figure 4-39 and can be downloaded from www.wrox.com/go/prossis2014. Name the Connection Manager FF Extract, and make the first row a header row. The file is comma-delimited, so you will want to specify the delimiter on the Columns page.
www.it-ebooks.info
c04.indd 152
3/22/2014 8:07:40 AM
Other Transformations╇
❘╇ 153
Figure 4-39
Once the Connection Manager is created, add a new Flat File Source and rename it “Mainframe Data.” Point the connection to the Pivot Source Connection Manager. Ensure that all the columns are checked in the Columns page on the source and click OK to go back to the Data Flow. The next step is the most important step. You need to unpivot the data and make each column into a row in the Data Flow. You can do this by dragging an Unpivot Transformation onto the Data Flow and connecting it to the source. In this example, you want to unpivot the BillTo and ShipTo columns, and the OrderID column will just be passed through for each row. To do this, check each column you wish to unpivot, as shown in Figure 4-40, and check Pass Through for the OrderID column. As you check each column that you wish to unpivot on, the column will be added to the grid below (shown in Figure 4-40). You’ll then need to type CustomerName for the Destination Column property for each row in the grid. This will write the data from each of the two columns into a single column called CustomerName. Optionally, you can also type Original Column for the Pivot Key Column Name property. By doing this, each row that’s written by the transformation will have an additional column called Original Column. This new column will state where the data came from. The Pivot Transformation will take care of columns that have NULL values. For example, if your ShipTo column for OrderID 1 had a NULL value, that column will not be written as a row. You may wish to handle empty string values though, which will create blank rows in the Data Flow. To throw these records out, you can use a Conditional Split Transformation. In this transformation, you can create one condition for your good data that you wish to keep with the following code, which accepts only rows with actual data: ISNULL(CustomerName) == FALSE && TRIM(CustomerName) != ""
www.it-ebooks.info
c04.indd 153
3/22/2014 8:07:41 AM
154╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-40
The default (else) condition handles empty strings and NULL customers and in this example is called NULL Customer. After this, you’re ready to send the data to the destination of your choice. The simplest example is to send the data to a new SQL Server table in the AdventureWorks database. Execute the package. You’ll see that the Valid Customer output goes to the customer table, and the NULL data condition is just thrown out. You could also place a data viewer prior to the OLE DB Destination to see the data interactively.
Term Extraction If you have ever done some word and phrase analysis on websites for better search engine placement, you are familiar with the job that this transformation performs. The Term Extraction Transformation is a tool to mine free-flowing text for English word and phrase frequency. You can feed any text-based input stream into the transformation and it will output two columns: a text phrase and a statistical value for the phrase relative to the total input stream. The statistical values or scores that can be calculated can be as simple as a count of the frequency of the words and phrases, or they can be a little more complicated, such as the result of a formula named the TFIDF
www.it-ebooks.info
c04.indd 154
3/22/2014 8:07:42 AM
Other Transformations╇
❘╇ 155
score. The TFIDF acronym stands for Term Frequency and Inverse Document Frequency, and it is a formula designed to balance the frequency of the distinct words and phrases relative to the total text sampled. If you’re interested, here’s the formula: TDIDF (of term or phrase) = (frequency of term) * log((# rows in sample)/ (# rows with term or phrase))
The results generated by the Term Extraction Transformation are based on internal algorithms and statistical models that are encapsulated in the component. You can’t alter or gain any insight into this logic by examining the code. However, some of the core rules about how the logic breaks apart the text to determine word and phrase boundaries are documented in Books Online. What you can do is tweak some external settings and make adjustments to the extraction behavior by examining the resulting output. Because text extraction is domain-specific, the transformation also provides the capability to store terms and phrases that you have predetermined are noisy or insignificant in your final results. You can then automatically remove these items from future extractions. Within just a few testing iterations, you can have the transformation producing meaningful results. Before you write this transformation off as a cool utility that you’ll never use, consider this: How useful would it be to query into something like a customer service memo field stored in your data warehouse and generate some statistics about the comments being made? This is the type of usage for which the Term Extraction Transformation is perfectly suited. The trick to understanding how to use the component is to remember that it has one input. That input must be either a NULLterminated ANSI (DT_WSTR) or a Unicode (DT_NTEXT) string. If your input stream is not one of these two types, you can use the Data Conversion Transformation to convert it. Because this transformation can best be learned by playing around with all the settings, the next example puts this transformation to work doing exactly what was proposed earlier — mining some customer service memo fields. Assume you have a set of comment fields from a customer service database for an appliance manufacturer. In this field, the customer service representative records a note that summarizes his or her contact with the customer. For simplicity’s sake, you’ll create these comment fields in a text file and analyze them in the Term Extraction Transformation.
1.
Create the customer service text file using the following text (you can download the code file custsvc.txt from www.wrox.com/go/prossis2014). Save it as c:\ProSSIS\ Chapter4\custsvc.txt. Ice maker in freezer stopped working model XX-YY3 Door to refrigerator is coming off model XX-1 Ice maker is making a funny noise XX-YY3 Handle on fridge falling off model XX-Z1 Freezer is not getting cold enough XX-1 Ice maker grinding sound fridge XX-YY3 Customer asking how to get the ice maker to work model XX-YY3 Customer complaining about dent in side panel model XX-Z1 Dent in model XX-Z1 Customer wants to exchange model XX-Z1 because of dent in door Handle is wiggling model XX-Z1
2.
Create a new SSIS package named TermExtractionExample. Add a Data Flow Task to the Control Flow design surface.
www.it-ebooks.info
c04.indd 155
3/22/2014 8:07:42 AM
156╇
❘╇ CHAPTER 4╇ The Data Flow
3.
Create a Flat File connection to c:\ProSSIS\Chapter4\custsvc.txt. Uncheck “Column names in first data row”. Change the output column named in the Advanced tab to CustSvcNote. Change OutputColumnWidth to 100 to account for the length of the field. Change the data type to DT_WSTR.
4.
Add a Flat File Source to the Data Flow design surface. Configure the source to use the Flat File connection.
5.
Add a Term Extraction Transformation to the Data Flow design surface. Connect the output of the Flat File Source to its input. Open the Term Extraction Transformation Editor. Figure 4-41 shows the available input columns from the input stream and the two defaultnamed output columns. You can change the named output columns if you wish. Only one input column can be chosen. Click the column CustSvcNote, as this is the column that is converted to a Unicode string. If you click the unconverted column, you’ll see a validation error like the following: The input column can only have DT_WSTR or DT_NTEXT as its data type.
Figure 4-41
6.
Even though we’re not going to set these tabs, the Exclusion tab enables you to specify noise words for the Term Extraction to ignore. The Advanced tab enables you to control
www.it-ebooks.info
c04.indd 156
3/22/2014 8:07:42 AM
Other Transformations╇
❘╇ 157
how many times the word must appear before you output it as evidence. Close the Term Extraction Transformation Editor. Ignore the cautionary warnings about rows sent to error outputs. You didn’t configure an error location where bad rows should be saved, but it’s not necessary for this example.
7.
Add an OLE DB Destination to the Data Flow. Connect the output of the Term Extraction Task to the OLE DB Destination. Configure the OLE DB Destination to use your AdventureWorks connection.
8.
Click the New button to configure the Name of Table or View property. A window will come up with a CREATE TABLE DDL statement. Notice that the data types are a Unicode text field and a double. Alter the statement to read as follows: CREATE TABLE TermResults ( [Term] NVARCHAR(128), [Score] DOUBLE PRECISION )
9.
When you click OK, the new table TermResults will be created in the AdventureWorks database. Click the Mappings tab to confirm the mapping between the Term Extraction outputs of Term and Score to the table TermResults.
10.
Add a data viewer by right-clicking the Data Flow between the Term Extraction Transformation and the OLE DB Destination. Set the type to grid and accept the defaults.
11.
Run the package.
The package will stop on the data viewer that is shown in Figure 4-42 to enable you to view the results of the Term Extraction Transformation. You should see a list of terms and an associated score for each word. Because you accepted all of the Term Extraction settings, the default score is a simple count of frequency. Stop the package, open the Term Extraction Transformation Editor, and view the Advanced tab.
Figure 4-42
The Advanced tab, which allows for some configuration of the task, is divided into four categories: ➤➤
Term Type: Settings that control how the input stream should be broken into bits called tokens. The Noun Term Type focuses the transformation on nouns only, Noun Phrases extracts noun phrases, and Noun and Noun Phrases extracts both.
➤➤
Score Type: Choose to analyze words either by frequency or by a weighted frequency.
➤➤
Parameters: Frequency threshold is the minimum number of times a word or phrase must appear in tokens. Maximum length of term is the maximum number of words that should be combined together for evaluation.
➤➤
Options: Check this option to consider case sensitivity or leave it unchecked to disregard.
www.it-ebooks.info
c04.indd 157
3/22/2014 8:07:43 AM
158╇
❘╇ CHAPTER 4╇ The Data Flow
This is where the work really starts. How you set the transformation up greatly affects the results you’ll see. Figure 4-43 shows an example of the results using each of the different Term Type (noun) settings combined with the different score types [Tascam Digital Interface (TDIF)]. Currently, using a combination of these statistics, you can report that customer service is logging a high percentage of calls concerning the terms “model,”“model XX-Z1,”“model XX-YY3,”“ice maker,”“dent,” and “customer.” From this, one can assume that there may be some issues with models XX-Z1 and XX-YY3 that your client needs to look into.
Figure 4-43
In evaluating this data, you might determine that over time some words are just not relevant to the analysis. In this example, the words “model” and “customer” serve no purpose and only dampen the scores for other words. To remove these words from your analysis, take advantage of the exclusion features in the Term Extraction Transformation by adding these words to a table. To really make sense of that word list, you need to add some human intervention and the next transformation — Term Lookup.
Term Lookup The Term Lookup Transformation uses the same algorithms and statistical models as the Term Extraction Transformation to break up an incoming stream into noun or noun phrase tokens, but it is designed to compare those tokens to a stored word list and output a matching list of terms and phrases with simple frequency counts. Now a strategy for working with both term-based transformations should become clear. Periodically use the Term Extraction Transformation to mine the text data and generate lists of statistical phrases. Store these phrases in a word list, along with phrases that you think the term extraction process should identify. Remove any phrases that you don’t want identified. Use the Term Lookup Transformation to reprocess the text input to generate your final statistics. This way, you are generating statistics on known phrases of importance. A realworld application of this would be to pull out all the customer service notes that had a given set of terms or that mention a competitor’s name. You can use results from the Term Extraction example by removing the word “model” from the [TermExclusions] table for future Term Extractions. You would then want to review all the terms stored in the [TermResults] table, sort them out, remove the duplicates, and add back terms that make sense to your subject matter experts reading the text. Because you want to generate some statistics about which model numbers are causing customer service calls but you don’t want to restrict your extractions to only occurrences of the model number in conjunction with the word “model,” remove phrases combining the word “model” and the model number. The final [TermResults] table looks like a dictionary, resembling something like the following:
www.it-ebooks.info
c04.indd 158
3/22/2014 8:07:43 AM
Other Transformations╇
❘╇ 159
term -------dent door freezer ice ice maker maker XX-1 XX-YY3 XX-Z1
Using a copy of the package you built in the Extraction example, exchange the Term Extraction Transformation for a Term Lookup Transformation and change the OLE DB Destination to output to a table [TermReport]. Open the Term Lookup Transformation Editor. It should look similar to Figure 4-44. In the Reference Table tab, change the Reference Table Name option to TermResults. In the Term Lookup tab, map the ConvCustSvrNote column to the Term column on the right. Check the ConvCustSvrNote as a pass-through column. Three basic tabs are used to set up this task (in the Term Lookup Transformation Editor): ➤➤
Reference Table: This is where you configure the connection to the reference table. The Term Lookup Task should be used to validate each tokenized term that it finds in the input stream.
➤➤
Term Lookup: After selecting the lookup table, you map the field from the input stream to the reference table for matching.
➤➤
Advanced: This tab has one setting to check whether the matching is case sensitive.
Figure 4-44
The result of running this package is a list of phrases that you are expecting from your stored word list. A sample of the first six rows is displayed in the following code. Notice that this result set doesn’t summarize the findings. You are just given a blow-by-blow report on the number of terms
www.it-ebooks.info
c04.indd 159
3/22/2014 8:07:44 AM
160╇
❘╇ CHAPTER 4╇ The Data Flow
in the word list that were found for each row of the customer service notes. In this text sample, it is just a coincidence that each term appears only once in each note. term Frequency ConvCustSvcNote -----------------------------------------------------freezer 1 ice maker in freezer stopped working model XX-YY3 ice maker 1 ice maker in freezer stopped working model XX-YY3 XX-YY3 1 ice maker in freezer stopped working model XX-YY3 Door 1 door to refrigerator is coming off model XX-1 XX-1 1 door to refrigerator is coming off model XX-1 ice maker 1 ice maker is making a funny noise XX-YY3 (Only first six rows of resultset are displayed)
To complete the report, you could add an Aggregate Transformation between the Term Lookup Transformation and the OLE DB Destination. Set up the Aggregate Transformation to ignore the ConvCustSvcNote column, group by the Term column, and summarize the Frequency Column. Connect the Aggregate Transformation to the OLE DB Destination and remap the columns in the OLE DB Destination. Although this is a very rudimentary example, you can start to see the possibilities of using SSIS for very raw and unstructured Data Sources like this customer service comment data. In a short period of time, you have pulled some meaningful results from the data. Already you can provide the intelligence that model XX-Z1 is generating 45 percent of your sample calls and that 36 percent of your customer calls are related to the ice maker. Pretty cool results from what is considered unstructured data. This transformation is often used for advanced text mining.
Data Flow Example Now you can practice what you have learned in this chapter, pulling together some of the transformations and connections to create a small ETL process. This process will pull transactional data from the AdventureWorks database and then massage the data by aggregating, sorting, and calculating new columns. This extract may be used by another vendor or an internal organization. Please note that this example uses a Sort and Aggregate Transformation. In reality it would be better to use T-SQL to replace that functionality.
1.
Create a new package and rename it AdventureWorksExtract.dtsx. Start by dragging a Data Flow Task onto the Control Flow. Double-click the task to open the Data Flow tab.
2.
In the Data Flow tab, drag an OLE DB Source onto the design pane. Right-click the source and rename it TransactionHistory. Double-click it to open the editor. Click the New button next to the OLE DB Connection Manager dropdown box. The connection to the AdventureWorks database may already be in the Data Connections list on the left. If it is, select it and click OK. Otherwise, click New to add a new connection to the AdventureWorks database on any server.
3.
When you click OK, you’ll be taken back to the OLE DB Source Editor. Ensure that the Data Access Mode option is set to SQL Command. Type the following query for the command, as shown in Figure 4-45 or as follows: SELECT ProductID, Quantity, ActualCost from Production.TransactionHistoryArchive
www.it-ebooks.info
c04.indd 160
3/22/2014 8:07:44 AM
Data Flow Example╇
❘╇ 161
Figure 4-45
4.
Drag a Derived Column Transformation onto the Data Flow, right-click it, and select Rename. Rename the transform Calculate Total Cost. Click the TransactionHistory OLE DB Source and drag the green arrow (the data path) onto the Derived Column Transformation.
5.
Double-click the Derived Column Transformation to open the editor (shown in Figure 4-46). For the Expression column, type the following code or drag and drop the column names from the upper-left box: [Quantity]* [ActualCost]. The Derived Column should have the option selected, and type TotalCost for the Derived Column Name option. Click OK to exit the editor.
6.
Drag an Aggregate Transformation onto the Data Flow and rename it Aggregate Data. Drag the blue arrow from the Derived Column Transformation onto this transformation. Doubleclick the Aggregate Transformation to open its editor (shown in Figure 4-47). Select the ProductID column and note that it is transposed into the bottom section. The ProductID column should have Group By for the Operation column. Next, check the Quantity and TotalCost columns and set the operation of both of these columns to Sum. Click OK to exit the editor.
www.it-ebooks.info
c04.indd 161
3/22/2014 8:07:45 AM
162╇
❘╇ CHAPTER 4╇ The Data Flow
Figure 4-46
7.
Drag a Sort Transformation onto the Data Flow and rename it Sort by Quantity. Connect the Aggregate Transformation to this transformation by the blue arrow as in the preceding step. Double-click the Sort Transformation to configure it in the editor. You can sort by the most popular products by checking the Quantity column and selecting Descending for the Sort Type dropdown box. Click OK to exit the editor.
8.
You have now done enough massaging of the data and are ready to export the data to a flat file that can be consumed by another vendor. Drag a Flat File Destination onto the Data Flow. Connect it to the Sort Transformation by using the blue arrow as shown in the last few steps. Rename the Flat File Destination Vendor Extract.
9.
Double-click the destination to open the Flat File Destination Editor. You’re going to output the data to a new Connection Manager, so click New. When prompted for the Flat File Format, select Delimited. Name the Connection Manager Vendor Extract also, and type whatever description you’d like. If you have the directory, point the File Name option to C:\ProSSIS\Chapter4\VendorExtract.csv (make sure this directory is created before proceeding). Check the “Column names in the first data row” option. Your final screen should look like Figure 4-48. Click OK to go back to the Flat File Destination Editor.
www.it-ebooks.info
c04.indd 162
3/22/2014 8:07:45 AM
Data Flow Example╇
❘╇ 163
Figure 4-47
Figure 4-48
www.it-ebooks.info
c04.indd 163
3/22/2014 8:07:46 AM
164╇
❘╇ CHAPTER 4╇ The Data Flow
10.
From the Mappings page, ensure that each column in the Inputs table is mapped to the Destination table. Click OK to exit the editor and go back to the Data Flow.
Now your first larger ETL package is complete! This package is very typical of what you’ll be doing daily inside of SSIS, and you will see this expanded on greatly, in Chapter 8. Execute the package. You should see the rows flow through the Data Flow, as shown in Figure 4-49. Note that as the data flows from transformation to transformation, you can see how many records were passed through.
Figure 4-49
Summary The SSIS Data Flow moves data from a variety of sources and then transforms the data in memory prior to landing the data into a destination. Because it is in memory, it has superior speed to loading data into a table and performing a number of T-SQL transformations. In this chapter you learned about the common sources, destinations, and transformations used in the SSIS Data Flow. In the next chapter, you learn how to make SSIS dynamic by using variables and expressions.
www.it-ebooks.info
c04.indd 164
3/22/2014 8:07:47 AM
5
Using Variables, Parameters, and Expressions What’s in This Chapter? ➤➤
Reviewing variables, parameters, and expressions
➤➤
Using data types for variables and parameters
➤➤
Creating variables and parameters
➤➤
Expression syntax and usage
➤➤
Expression examples
Wrox.com Downloads for This Chapter
You can find the wrox.com code downloads for this chapter at www.wrox.com/go/Â� prossis2014 on the Download Code tab. If you have used SSIS packages for any involved ETL processes, you have inevitably Â� encountered the need for dynamic capabilities. A dynamic package can reconfigure itself at runtime to do things like run certain steps conditionally, create a series of auto-generated Â�filenames for export, or retrieve and set send-to addresses on an alert e-mail from a data table. Because dynamic changes are fairly common occurrences, developers and architects turn to expressions as they begin rolling out SSIS projects in their development shops. This chapter attempts to provide a solid information base to get you up to speed on Â� expressions. Here we will consolidate the common questions, answers, and best practices about expressions that we’ve heard and explained since the first release of SSIS. The good news is that expressions are easy to use and impressively powerful. As you read this chapter, you will not only gain an understanding about how expressions work but also gain some insight into how you can use variables and parameters to set up expressions on your current SSIS project.
www.it-ebooks.info
c05.indd 165
22-03-2014 18:10:28
166â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Dynamic Package Objects SSIS includes multiple objects that can be used to create dynamic packages. Figure 5-1 provides a graphical description of this model in SSIS. SSIS can dynamically set a property in a task or component using an expression, which can be built using a set of building blocks, including variables, parameters, functions, literals, and more. When the package is run, the expression is evaluated and used in the property as each task or component is accessed.
Variable Overview
Expression
Literals
Operators
Functions
Variables
Parameters
Expressions are built using a set of building blocks, including variables, parameters, functions, literals, and more. Figure 5-1
Variables are a key feature in the SSIS package development process. This object contains a value that can be hardcoded, dynamically set once, or modified multiple times throughout the execution of the package. Principally, variables provide a method for objects in a package to communicate with each other. Similar to their use in programming languages, variables are used in specific types of logic, such as iterating through a loop, concatenating multiple values together to create a file directory, or passing an array between objects.
Parameter Overview Parameters were first introduced in SQL Server 2012. While similar to a variable in that a �parameter can store information and be used in an expression, it has a few different properties and uses that you will want to understand. As demonstrated in Figure 5-2, parameters are set externally. The parameter can then be used in an expression to affect different properties in the package. SSIS uses two types of parameters: project parameters and package parameters. Project parameters are created at the project level and can be used in all packages that are included in that project. On the other hand, package parameters are created at the package level and can be used only in that package. Project parameters are best used for values that are shared among packages, such as e-mail addresses for error messages. Package parameters are best used for values specific to that package, such as directory locations. When using the project deployment model (discussed in depth in Chapter 22), parameters are the best choices to replace package configurations to create a dynamic and more flexible SSIS solution. Using the Required property of the parameter, you can also necessitate the caller of the package to pass in a value for the parameter. If you want to set the value of a property from outside the �package, either required or not, parameters are the object to use. On the other hand, if you want to create or store values only within a package, variables are the object to use.
www.it-ebooks.info
c05.indd 166
22-03-2014 18:10:29
Dynamic Package Objectsâ•… ❘â•… 167
Package Parameters Tasks
Transforms Conditional Split
Script Task
Foreach Loop Container
$MyParameter
Execute SQL Task
Script Component
Derived Column
Figure 5-2
Expression Overview Expressions are the key to understanding how to create dynamic packages in SSIS. One way to think about expressions is to compare them to the familiar model of a spreadsheet cell in a program like Microsoft Excel. A spreadsheet cell can hold a literal value, a reference to another cell on the spreadsheet, or functions ranging from simple to complex arrangements. In each instance, the result is a resolved value displayed in the cell. Figure 5-3 shows these same capabilities of the expression, � which can hold literal values, identifiers available to the operation (references to variables or �columns), or functions (built-in or user-defined). The difference in the SSIS world is that these values can be substituted directly into properties of the package model, providing powerful and dynamic workflow and operational functionalities.
Similarity of Expressions to Microsoft Excel Cells Literal
A 1
Identifier Built-in Function
User-Defined Function Figure 5-3
Starting with SQL Server 2012, it is easy to see when an expression has been set on a property within an object. Expression adorners are special icons that are placed on top of the object icon if the object has an expression. This indicator makes it easier to understand why the package seems to be doing something behind the scenes that you weren’t expecting!
www.it-ebooks.info
c05.indd 167
22-03-2014 18:10:30
168â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
If you understand these visual analogies explaining how expressions, parameters, and variables fit into the SSIS picture, then you are almost ready to dig into the details of how to build an Â�expression. First, however, it’s time to take a look at some of the details about data types, variables, and Â�parameters that cause many of the issues for SSIS package developers.
Understanding Data Types In SSIS, you must pay attention to data types, whether the data is coming from your Data Flow, is stored in variables, or is included in expressions. Failure to do so will cause a lot of frustration because the syntax checker will complain about incompatible data types when you are building expressions. If your Data Flow contains incompatible data types, your packages will raise either warnings or errors (if implicit conversions are made). This will happen even if the conversion is between Unicode and non-Unicode character sets. Comparison operations also are subject to either hard or soft errors during implicit conversion. Bad data type decisions can have a serious impact on performance. This seemingly simple topic causes significant grief for SSIS developers who haven’t yet learned the specifics of data types and how they are converted. The following sections provide a brief overview of how to resolve common data type conversion issues, beginning with a primer on SSIS data types.
SSIS Data Types If you research the topic of “Integration Services Data Types” in Books Online, you’ll first notice that the data types are named much differently than similar types found in .NET or T-SQL. This nomenclature is troublesome for most new users. The following table provides a matrix between SSIS data types and a typical SQL Server set of data types. You’ll need this table to map between data flow columns and variable or parameters data types. The .NET managed types are important only if you are using script component, CLR, or .NET-based coding to manipulate your Data Flows. The following table is just for SQL Server. To do a similar analysis for your own data source, look at the mapping files that can be found in this directory: C:\Program Files\Microsoft SQL Server\120\DTS\MappingFiles\. If you’re familiar with OLE DB data types, you’ll understand these SSIS data type enumerations, because they are similar. However, there is more going on than just naming differences. First, SSIS supports some data types that may not be familiar at all, nor are they applicable to SQL Server — namely, most of the unsigned integer types and a few of the date types. You’ll also notice the availability of the separate date-only (DT_DBDATE) and time-only (DT_DBTIME) types, which prior to SQL Server 2008 were available only for RDBMS databases like DB2 and ORACLE. With the introduction of similar data types in the SQL Server 2008 engine, they are also applicable in SSIS. Finally, notice the arrow “➪” in the table, which indicates that these data types are converted Â� to other SSIS data types in Data Flow operations that may be opportunities for performance enhancements.
www.it-ebooks.info
c05.indd 168
22-03-2014 18:10:30
Understanding Data Typesâ•… ❘â•… 169
SSIS Data T ype
SQL Server Data T ype
.NET Managed T ype
DT_WSTR
nvarchar, nchar
System.String
DT_STR ➪ DT_WSTR
varchar, char
DT_TEXT ➪ DT_WSTR
text
DT_NTEXT ➪ DT_WSTR
ntext, sql_variant, xml
DT_BYTES
binary, varbinary
DT_IMAGE ➪ DT_BYTES
timestamp, image
DT_DBTIMESTAMP
smalldatetime, datetime
DT_DBTIMESTAMP2 ➪ DT_DBTIMESTAMP
datetime
DT_DBDATE ➪ DT_DBTIMESTAMP
date
Array of System.Byte
System.DateTime
DT_DATE ➪ DT_DBTIMESTAMP DT_FILETIME ➪ DT_DBTIMESTAMP DT_DBDATETIMESTAMPOFFSET
datetimeoffset
DT_DBTIME2
time
System.TimeSpan
DT_NUMERIC
numeric
System.Decimal
DT_DECIMAL ➪ DT_NUMERIC
decimal
DT_GUID
uniqueidentifier
DT_DBTIME ➪ DT_DBTIME2
DT_I1
System.Guid
System.SByte
DT_I2
smallint
System.Int16
DT_CY
smallmoney, money
System.Currency
DT_I4
int
System.Int32 continues
www.it-ebooks.info
c05.indd 169
22-03-2014 18:10:30
170â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
╇ (continued) SSIS Data T ype
SQL Server Data T ype
.NET Managed T ype
DT_I8
bigint
System.Int64
DT_BOOL ➪ DT_I4
bit
System.Boolean
DT_R4
real
System.Single
DT_R8
float
System.Double
DT_UI1
tinyint
System.Byte
DT_UI2
int
System.UInt16
DT_UI4
bigint
System.UInt32
DT_UI8
numeric
System.UInt64
Date and Time Type Support SQL Server 2008 included new data types for separate date and time values and an additional time zone-based data type compliant with the ISO 8601 standard. SSIS has always had these data type enumerations for the other RDBMS sources, but as of SQL Server 2008, these can also be used for SQL Server as well, including DT_DBTIMESTAMP2 and DT_DBTIME2, added for more precision, and DT_DBTIMESTAMPOFFSET, added for the ISO DateTimeOffset SQL Server data type. A common mistake made in SSIS packages is the improper selection of an SSIS date data type. For some reason, DT_DBDATE and DT_DATE are often used for date types in Data Flow components, but improper use of these types can result in overflow errors or the removal of the time element from the date values. SSIS data types provide a larger net for processing incoming values than you may have in your destination data source. It is your responsibility to manage the downcasting or conversion operations. Make sure you are familiar with the data type mappings in the mapping file for your data source and destination, and the specific conversion behavior of each type. A good start would be the date/time types, because there are many rules regarding their conversion, as evidenced by the large section about them in Books Online. You can find these conversion rules for date/time data types under the topic “Integration Services Data Types” found here: http://msdn.microsoft.com/ en-us/library/ms141036(v=SQL.120).aspx.
www.it-ebooks.info
c05.indd 170
22-03-2014 18:10:30
Understanding Data Typesâ•… ❘â•… 171
How Wrong Data Types and Sizes Can Affect Performance If you’ve been working with SSIS for a while, you know that it can use serious memory resources and sometimes be slower than you expect. That’s because the Data Flow components do most of their work in memory. This can be good because it eliminates the most time-consuming I/O operations. However, because SSIS uses memory buffers to accomplish this, the number of rows that can be loaded into a buffer is directly related to the width of the row. The narrower the row, the more rows that can be processed per buffer. If you are defining the data types of a large input source, pick your data types carefully, so that you are not using the default 50 characters per column for a text file, or the suggested data types of the Connection Manager, when you do not need this extra safety cushion. Also, be aware that there are some trade-offs when selecting specific data types if they require any conversion as the data is being loaded into the buffers. Data conversion is a fact of life, and you’ll have to pay for it somewhere in the ETL process. These general guidelines can give you a start: ➤➤
Convert only when necessary. Don’t convert columns from a data source that will be dropped from the data stream. Each conversion costs something.
➤➤
Convert to the closest type for your destination source using the mapping files. If a value is converted to a nonsupported data type, you’ll incur an additional conversion internal to SSIS to the mapped data type.
➤➤
Convert using the closest size and precision. Don’t import all columns as 50-character data columns if you are working with a fixed or reliable file format with columns that don’t require as much space.
➤➤
Evaluate the option to convert after the fact. Remember that SSIS is still an ETL tool and sometimes it is more efficient to stage the data and convert it using set-based methods.
The bottom line is that data type issues can be critical in high-volume scenarios, so plan with these guidelines in mind.
Unicode and Non-Unicode Conversion Issues One aspect of ETL package development that you might not be used to is the default use of Unicode data types in SSIS packages. Not only is this the default import behavior, but all the string functions in SSIS expect Unicode strings as input. Unicode is a great choice if you’re unsure of the incoming data for handling data from import files with special characters, but if you’re not familiar with using this character set, it can be confusing at first. At the very least, using Unicode requires an additional step that is frequently missed, resulting in errors. For a typical demonstration, create a package that imports an Excel data source into a table defined with non-Unicode fields, or download the samples from www.wrox.com. Excel data is imported as Unicode by default, so the mapping step in the destination component complains that the data is not compatible, as shown in Figure 5-4. Figure 5-4
www.it-ebooks.info
c05.indd 171
22-03-2014 18:10:30
172â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Note╇ You may experience some data being replaced by NULLs when Â�importing Excel files using the Excel Connection Manager. This typically occurs when numeric and text data is stored within one column. One solution is to update the extended properties section of the connection string to look like this: Extended Properties=”EXCEL 14.0;HDR=YES;IMEX=1”
At first, you might assume that all you need to do is change the source data type to match the Â�nonUnicode destination. Using the SQL conversion table as a guide, right click on the source, select the Show Advanced Editor option, and change the column type to DT_STR to match the destination SQL Server varchar data type. Now you’ll find that the same error from Figure 5-4 is occurring on both the source and the destination components. As discussed earlier in this section, SSIS requires purposeful conversion and casting operations. To complete the task, you need to add only a Data Conversion Transformation to convert the DT_WSTR and DT_R8 data types to DT_STR and DT_CY, respectively. The Data Conversion Transformation should look similar to Figure 5-5.
Figure 5-5
www.it-ebooks.info
c05.indd 172
22-03-2014 18:10:31
Understanding Data Typesâ•… ❘â•… 173
Notice in this Data Conversion Transformation that the data types and lengths are changed to truncate and convert the incoming string to match the destination source. Also, notice the Code Page setting that auto-defaults to 1252 for ANSI Latin 1. The Code Page setting varies according to the source of the Unicode data you are working with. If you are working with international data sources, you may need to change this to interpret incoming Unicode data correctly. This type casting operation is a good, simple example of how SSIS packages handle data of differing types. However, within expressions it is not necessary to bring in the conversion component to cast between different types. You can simply use casting operators to change the data types within the expression.
Casting in SSIS Expressions If you want to experience the developer’s equivalent of poking your eye out, forget to put a Â�casting operator in your Data Flow expressions. SSIS is tightly tied to data types and requires casting, which simply defines the data type for a value or expression. If you forget to use casting or choose the wrong data type, the package may fail or cause errors when trying to insert that column into the final destination. While you can run into some frustrating issues if you don’t do it, the need for casting is not always intuitive. For example, the result of any string function defaults to the Unicode string type. If you are attempting to store that value in a non-Unicode column, you need to cast. Conversely, if you are storing the value in a variable, you don’t need to cast. (That’s because the data types in variable definitions allow only Unicode; more about that later in the section “Defining Variables.”) The good news is that casting is easy. In the expression language, this looks just like a .NET Â�primitive cast. The new data type is provided in parentheses right next to the value to be converted. A simple example is casting a 2-byte signed integer to a 4-byte signed integer: (DT_I4)32
Of course, not all the casting operators are this simple. Some require additional parameters when specific precision, lengths, or code pages have to be considered to perform the operation. These operators are listed in the following table: Casting Oper ator
Additional Par ameters
DT_STR(<>, <>)
length — Final string length code_page — Unicode character set
DT_WSTR(<>)
length — Final string length
DT_NUMERIC(<>, <>)
precision — Max number of digits scale — Number of digits after decimal
DT_DECIMAL(<>)
scale — Number of digits after decimal
DT_BYTES(<>)
length — Number of final bytes
DT_TEXT(<>)
code_page — Unicode character set
www.it-ebooks.info
c05.indd 173
22-03-2014 18:10:31
174â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Casting causes the most obvious trouble during comparison operations and logical expressions. Remember that all operands in comparison operations must evaluate to a compatible data type. The same rule applies to complex or compound logical expressions. In this case, the entire Â�expression must return a consistent data type, which may require casting sections of the expression that may not readily appear to need casting. This is similar to the situation that you have in T-SQL Â�programming when you attempt to use a number in a where clause for a numeric column, or when using a case statement that needs to return columns of different types. In the where predicate, both the condition and the column must be convertible into a compatible type. For the case statement, each column must be cast to the same variant data type. Later in the chapter, you’ll look at Â�examples in which you need to pay attention to casting when using comparison and logical expressions. A less obvious issue with casting can occur when data becomes truncated during casting. For example, casting Unicode double-byte data to non-Unicode data can result in lost characters. Significant digits can be lost in forced casting from unsigned to signed types or within types like 32-bit integers to 16-bit integers. These errors underscore the importance of wiring up the error outputs in the Data Flow components that have them. Before you look at that, however, look at the following section about variables and parameters and how they are used in dynamic SSIS package development.
Using Variables and Parameters Variables and parameters are the glue holding dynamic package development together. As discussed earlier, both of these objects are used to move values between package components. They are no �different from variables in any programming environment. Variables in SSIS packages are scoped, or can be accessed, either within the package level or to a specific package component. Parameters can be scoped to either the package or project level.
Defining Variables Variables can be created, deleted, renamed, and have their data types changed, as long as the Â�package is in design mode. Once the package is validated and in runtime mode, the variable definition is locked; only the value of the variable can change. This is by design, so that the package is more declarative and type-safe. Creating a new variable is done through a designer that defines the scope of the variable depending upon how it is accessed. As mentioned earlier, variables can be scoped either to the package or to a specific component in the package. If the variable is scoped at a Â�component level, only the component or its subcomponents have access to the variable. The following important tips can keep you out of trouble when dealing with variables: ➤
Variables are case sensitive. When you refer to a variable in a script task or an expression, pay attention to the case of the name. Different shops have their own rules, but typically variables are named using camel-case style.
➤
Variables can hide other variable values higher in the hierarchy. It is a good practice to not name variables similarly. This is a standard readability programming issue. If you have one variable outside a task and one inside the task, name them using identifiers like “inner” or “outer” to differentiate them.
www.it-ebooks.info
c05.indd 174
22-03-2014 18:10:31
Using Variables and Parametersâ•… ❘â•… 175
A variable can be created by right-clicking the design surface of the package in which you need it. The Variables dialog enables you to create, edit, and delete variables. Figure 5-6 shows an example of two variables created within two scope levels: the Data Flow Task and the package.
Figure 5-6
However, the Variables window does not expose all the capabilities of the variables. By selecting a variable and pressing F4, you will see the Properties window for the SelectSQL variable, as shown in Figure 5-7. The reason for displaying the Properties Â�window for the SelectSQL variable is to point out the EvaluateAsExpression and Expression properties. The value of a variable either can be a literal value or Figure 5-7 can be defined dynamically. By setting the EvaluateAsExpression Â�property to True, the variable takes on a dynamic quality that is defined by the expression provided in the Expression property. The SelectSQL variable is actually holding the result of a formula that concatenates the string value of the base select statement stored in the BaseSelect variable and a user-provided date parameter. The point often missed by beginning SSIS developers is that these variables can be used to store expressions that can be reused throughout the package. Rather than recreate the expression all over the package, you can create an expression in a variable and then plug it in where needed. This greatly improves package maintainability by centralizing the expression definition. You’ll see an example that shows how to create an expression-based variable later in this chapter. As an alternative to using expressions, variables can also be set programmatically using the script tasks. Refer to Chapter 9 for examples describing how this is accomplished.
Defining Parameters An exciting addition to the Integration Services family is the concept of parameters. Like variables, parameters store information, but they can be set in a variety of ways: package default, project
www.it-ebooks.info
c05.indd 175
22-03-2014 18:10:32
176â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
default, or execution values. Using any method, the value stored in the parameter can be used in any expression to set a package property or variable. The parameter is initially created in the �package or project. To create a package parameter, select the Parameters tab in the design window and click the first icon on the toolbar. You can enter all the information shown in Figure 5-8. If you set the Required property, you force the setting of the parameter value to occur at runtime. If you set the Sensitive property, you tell Integration Services to encrypt the value in the catalog. To create a project �parameter, right-click on the Project.params node in the Solution Explorer and select the Open option. This will open a similar view to the package parameters view.
Figure 5-8
Once you have created a package or project parameter, you can use it to set other values. Parameter names are case sensitive and are prefixed by a dollar sign and either “Project” or “Package” depending on its type. Keep in mind that unlike variables, parameters cannot be changed by an expression. You’ll walk through a package that uses a parameter to create expressions later in this chapter.
Variable and Parameter Data Types You may have noticed that the data types available for variable definition are a little different from the SSIS variables that were discussed earlier in this chapter. For example, the value type for string variable storage is String instead of DT_WSTR or DT_STR. Admittedly, this is �confusing. Why does SSIS use what looks like a generalized managed type in the variable definition and yet a more �specific set of data types in the Data Flows? The answer lies in the implementation of �variables within the SSIS engine. Variables can be set from outside of the package, so they are implemented in SSIS as COM variants. This enables the SSIS engine to use some late binding to resolve to the variable � value within the package. However, note that this variant data type is not available �anywhere within your control as an SSIS programmer. Variants are only an internal implementation in SSIS. Use the following table to help map the variable data types to SSIS Data Flow data types:
Variable Data T ype
SSIS Data T ype
Description
Boolean
DT_BOOL
Boolean value. Either True or False. Be �careful �setting these data types in code because the expression language and .NET languages define these differently.
Byte
DT_UI1
A 1-byte unsigned int. (Note this is not a byte array.)
www.it-ebooks.info
c05.indd 176
22-03-2014 18:10:32
Working with Expressionsâ•… ❘â•… 177
Variable Data T ype
SSIS Data T ype
Description
Char
DT_UI2
A single character
DateTime
DT_DBTIMESTAMP
A date-time structure that accommodates year, month, hour, minute, second, and fractional seconds
DBNull
N/A
A declarative NULL value
Double
DT_R8
A double-precision, floating-point value
Int16
DT_I2
A 2-byte signed integer
Int32
DT_I4
A 4-byte signed integer
Int64
DT_I8
An 8-byte signed integer
Object
N/A
An object reference. Typically used to store data sets or large object structures
SByte
DT_I1
A 1-byte, signed integer
Single
DT_R4
A single-precision, floating-point value
String
DT_WSTR
Unicode string value
UInt32
DT_UI4
A 4-byte unsigned integer
UInt64
DT_UI8
An 8-byte unsigned integer
For most of the data types, there is ample representation. Typically, the only significant issues with variable data types are related to the date/time and string data types. The only options are the higher capacity data types. This is not a big deal from a storage perspective, because variable Â�declaration is rather finite. You won’t have too many variables defined in a package. If a package requires a string data type, note in the preceding table that the default data type for strings is the Unicode version; if you put values into a variable of the string data type, you need to convert for non-Unicode values. This seems like a lot of preliminary information to go over before diving into creating an expression, but with a basic understanding of these core concepts, you will avoid most of the typical issues that SSIS developers encounter. Now you can use this knowledge to dive into the expression language and some sample uses of expressions in SSIS.
Working with Expressions The language used to build expressions can be a bit disorienting at first. If you started out as a programmer, you will be comfortable switching between Visual Basic, C#, and T-SQL. The key to being proficient in building expressions is understanding that the syntax of this new scripting �language is a combination of all these different languages.
www.it-ebooks.info
c05.indd 177
22-03-2014 18:10:32
178â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
C#-Like? Close, but Not Completely Why not write the expression language in T-SQL or a .NET-compliant language? The answer is mostly related to marketing: expressions should reflect the multiplatform capability to operating on more than just SQL Server databases. Remember that expressions can be used on data from other RDBMS sources, like Oracle, DB2, and even data from XML files. However, the Â�technical Â� Â�explanation is that the SSIS and SQL Server core engines are written in native code, so any extension of the expression language to use .NET functions would incur the performance impact of loading the CLR and the memory management systems. The expression language without .NET integration can be optimized for the custom memory management required for pumping large row sets through data flow operations. As the SSIS product matures, you’ll see the SSIS team add more expression enhancements to expand on the existing functions. Meanwhile, let’s look at some of the pitfalls of using the expression language. The expression language is marketed as having a heavily C#-like syntax, and for the most part that is true. However, you can’t just put on your C# hat and start working, because some peculiarities are mixed into the scripting language. The language is heavily C#-like when it comes to using Â�logical and comparison operators, but it leans toward a Visual Basic flavor and sometimes a little T-SQL for functions. For example, notice that the following common operators are undeniably from a C# lineage:
Expression Oper ator
Description
||
Logical OR operation
&&
Logical AND operation
==
Comparison of two expressions to determine equivalency
!=
Comparison of two expressions to determine inequality
? :
Conditional operator
The conditional operator may be new to you, but it is especially important for creating compound expressions. In earlier releases of SSIS, the availability of this operator wasn’t readily intuitive. If you aren’t used to this C-style ternary operator, it is equivalent to similar IF..THEN..ELSE.. or IIF(, , ) constructs. The following functions look more like Visual Basic script or T-SQL language functions than C#:
Expression Func tion
Description
C# Equivalent
POWER()
Raise numeric to a power
Pow()
LOWER()
Convert to lowercase
ToLower()
GETDATE()
Return current date
Now()
www.it-ebooks.info
c05.indd 178
22-03-2014 18:10:32
Working with Expressionsâ•… ❘â•… 179
This makes things interesting because you can’t just plug in a C# function without ensuring that there isn’t an SSIS expression function to perform the same operation that is named differently. Â� However, if you make this type of mistake, don’t worry. Either the expression turns red, or you’ll╯immediately get a descriptive error instructing you that the function is not recognized upon╯attempting to save. A quick look in Books Online can help resolve these types of function Â�syntax differences. In some instances, the function you are looking for can be drastically different and cause some frustration. For example, if you are used to coding in C#, it may not be intuitive to look for the GETDATE() function to return the current date. The GETDATE() function is typically something one would expect from a T-SQL language construct. Thankfully, it performs like a T-SQL Â�function should to return the current date. This is not always the case. Some functions look like T-SQL Â�functions but behave differently:
Expression Func tion
Description
Difference
DATEPART()
Parses date part from a date
Requires quotes around the date part
ISNULL()
Tests an expression for NULL
Doesn’t allow a default value
This departure from the T-SQL standard can leave you scratching your head when the Â�expression doesn’t compile. The biggest complaint about this function is that you have to use composite DATEPART() functions to get to any date part other than month, day, or year. This is a common task for naming files for archiving. Nor does the ISNULL() function work like the T-SQL function. It returns either true or false to test a value for existence of NULL. You can’t substitute a default value as you would in T-SQL. These slight variations in the expression language between full-scale implementations of T-SQL, C#, or Visual Basic syntaxes do cause some initial confusion and frustration, but these differences are minor in the grand scheme of things. Later in this chapter, you’ll find a list of expressions that you can cut and paste to emulate many of the functions that are not immediately available in the eÂ� xpression language.
The Expression Builder Several locations in the SSIS development environment allow the creation of an expression. Whether you are in the Variables window or within any property expression editor, ultimately the Â�expression is created within a user interface called the Expression Builder. This user interface maintains easy references to both system- and user-based variables and provides access to expression functions and operators. The most important feature of the Expression Builder is the capability it provides to test an expression — that is, to see the evaluated value — by clicking the Evaluate Expression button. This is especially helpful as you learn the syntax of the expression language. By dragging and dropping Â� variables and operators onto the expression workspace, you can see how to format Â�expressions properly. Inside Data Flow components, typically a specific expression builder includes additional elements related to the Data Flow. In Figure 5-9, you can see that the user interface for the Derived Column Transformation includes a folder named Columns to allow expressions to be built with data from the Data Flow.
www.it-ebooks.info
c05.indd 179
22-03-2014 18:10:32
180â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Figure 5-9
The only downside in the Data Flow component versions of the Expression Builder is that you don’t have the option to see the results of evaluating the expression to determine whether you coded the expression properly. The reason is because you can’t see the data from the Data Flow, because this information is not available without running the package. This brings up a point about maintainability. If you have an involved expression that can be realized independently from data from the data stream, you should build the expression outside of the Data Flow component and simply plug it in as a variable. However, in some cases you have no choice but to build the expression at the Data Flow component level. If so, one of the best practices that we recommend is to create one variable at the package level called MyExpressionTest. This variable gives you a quick jumping off point to build and test expressions to ensure that the syntax is coded correctly. Simply access the Variables property window and click the ellipsis beside the expression property, and the Expression Builder pops up. Use this technique to experiment with some of the basic syntax of the expression language in the next section.
Syntax Basics Building an expression in SSIS requires an understanding of the syntax details of the expression �language. Each of the following sections dives into an aspect of the expression syntax and explores the typical issues encountered with it, including their resolution.
www.it-ebooks.info
c05.indd 180
22-03-2014 18:10:33
Working with Expressionsâ•… ❘â•… 181
Equivalence Operator This binary operator, which is used to compare two values, seems to create some problems for SSIS developers who are not used to using the double equal sign syntax (==). Forgetting to use the double equal sign in a comparison operation can produce head-scratching results. For example, consider a precedence operation that tests a variable value to determine whether the value is equal to True, but the expression is written with a single equal sign. Imagine that the variable is set by a previous script task that checks whether a file is available to process. @[User::MyBooleanValue] = True
The expression is evaluated, and @MyBooleanValue is assigned the value of True. This overwrites any previous value for the variable. The precedence constraint succeeds, the value is true, and the tasks continue to run with a green light. If you aren’t used to using the double equal sign syntax, this will come back to bite you, which is why we have discussed this operator by itself at the front of the syntax section.
String Concatenation There are many uses for building strings within an expression. Strings are built to represent a SQL statement that can be executed against a database, to provide information in the body of an e-mail message, or to build file paths for file processing. Building strings is a core task that you have to be able to do for any development effort. In SSIS the concatenation operator is the plus (+) sign. Here is an example that you can quickly put together in the Expression Builder and test: "The Server [" + LOWER( @[System::MachineName]) + "] is running this package"
This returns the following string: The Server [myserver] is running this package
If you need to build a string for a file path, use the concatenation operator to build the fully Â�qualified path with the addition of an escape character to add the backslashes. Later in this chapter, the Â�section “String Literals” covers all the common escape characters that you’ll need for string building. A file path expression would look like this: "c:\\mysourcefiles\\" + @myFolder + "\\" + @myFile
Note that strings are built using double quotes (""), not single quotes ('') as you might see in T-SQL; it’s important to ensure that the strings are all Unicode or all non-Unicode. A previous Â�limitation of 4,000 characters for an expression has been removed from Integration Services. Feel free to make strings as long as you desire!
Line Continuation There are two reasons to use line continuation characters in SSIS expressions. One is to make the expression easier to troubleshoot later, and the other is to format output for e-mail or diagnostic use. Unfortunately, the expression language does not support the use of comments, but you can use the hard returns to help the expression look more organized. In the Expression Builder,
www.it-ebooks.info
c05.indd 181
22-03-2014 18:10:33
182â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
simply press the Enter key to have your expression displayed with the carriage-return-line-feed Â�character sequence. This formatting is maintained even after you save the expression. To format output of the expression language, use the C-like escape character \n. Here’s an example of using it with a simple expression: "My Line breaks here\nAnd then here\n; )"
This returns the following string: My Line breaks here And then here ; )
Note that it is not necessary to show the expression in code form in one line. An expression can be written on multiple lines to clarify viewing of it at design time. The output would remain the same.
Literals Literals are hard coded information that you must provide when building expressions. SSIS expressions have three types of literals: numeric, string, and Boolean.
Numeric Literals A numeric literal is simply a fixed number. Typically, a number is assigned to a variable or used in an expression. Numeric literals in SSIS have the same behavior that they have in C# or Java — you can’t just implicitly define numeric literals. Well, that’s not completely true; SSIS does interpret numeric values with a few default rules, but the point is that the rules are probably not what you might expect. A value of 12 would be interpreted as the default data type of DT_UI4, or the 4-byte unsigned integer. This might be what you want, but if the value were changed to 3000000000 Â�during the evaluation process, an error similar to this will be generated: The literal "3000000000" is too large to fit into type DT_UI4. The magnitude of the literal overflows the type.
SSIS operates on literals using logic similar to the underlying .NET CLR. Numeric literals are checked to see if they contain a decimal point. If they do not, the literal is cast using the unsigned integer DT_UI4 data type. If there is a decimal point, the literal is cast as a DT_NUMERIC. To override these rules, you must append a suffix to the numeric literal. The suffix enables a declarative way to define the literal. The following are examples of suffixes that can be used on numeric literals: Suffix
Description
Ex ample
L or l
Indicates that the numeric literal should be interpreted as the long version of either the DT_I8 or DT_R8 value types depending upon whether a decimal is present
3000000000L ➪ DT_I8 3.14159265L ➪ DT_R8
www.it-ebooks.info
c05.indd 182
22-03-2014 18:10:33
Working with Expressionsâ•… ❘â•… 183
Suffix
Description
Ex ample
U or u
Indicates that the numeric literal should represent the unsigned data type
3000000000UL ➪ DT_UI8
F or f
Indicates that the numeric literal �represents a float value
100.55f ➪ DT_R4
E or e
Indicates that the numeric literal �represents scientific notation
6.626 × 10 -34 J/s ➪ 6.626E-34F ➪ DT_R8
Note: Expects at least one digit scientific notation followed by float or long suffix
6.626E won’t work. If you don’t have a digit, then format as follows: 6.626E+0L or 6.626E+0f
Knowing these suffixes and rules, the previous example can be altered to 3000000000L, and the expression can be validated.
String Literals When building strings, there are times when you need to supply special characters in them. For example, PostgreSQL database sources require the use of quoted column and table names. The key here is to understand the escape sequences that are understood by the expression syntax parser. The escape sequence for the double quote symbol is \". A sample expression-generated SQL statement might look like this: "Select \"myData\" from \"owner\".\"myTable\""
The preceding expression would generate the following string: Select "myData" from "owner"."myTable"
Other common literals that you may need are listed in this table:
Suffix
Description
Ex ample
\n
New Line or Carriage Feed Line Return
"Print this on one line\nThis on another"
Tab character
"Print\twith\ttab\tseparation"
\t
Print this on one line This on another
Print \"
\\
with
Double-quotation mark character
"\"Hey! "\"
Backslash
"c:\\myfile.txt"
tab
separation
"Hey! "
c:\myfile.txt
www.it-ebooks.info
c05.indd 183
22-03-2014 18:10:33
184â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
A few other string escape sequences are supported, but the elements in this table list those most frequently used. The backslash escape sequences come in handy when building file and path strings. The double quote escape sequence is more often used to interact with data sources that require quoted identifiers. This escape sequence is also used in combination with the remaining new line and tab characters to format strings for logging or other reporting purposes.
Boolean Literals The Boolean literals of True and False don’t have to be capitalized, nor are they case Â�sensitive. Boolean expressions are shortcut versions of the logical operators. To drive certain package Â�functionality conditionally based on whether the package is running in an offline mode, you could write an expression in a variable using an artificial on or off type switch mechanism, as shown here: @[System::OfflineMode]==True ? 1 : 0 (Not Recommended)
The idea is to use the results of this operation to determine whether a precedence constraint should operate. The precedence operator would retest the expression to determine whether the value was 1 or 0. However, this is an awfully long way to do something. It’s much easier to just create an Â�expression that looks like this: @[System::OfflineMode]==False
Then all you have to do is plug the expression into the Precedence Editor, as shown in Figure 5-10. Note that using the literal is recommended over using any numeric values for �evaluation. Programming any expression to �evaluate numeric versions of Boolean values is �dangerous and should not be a part of your SSIS techniques.
Referencing Variables Referencing variables is easy using the Expression Builder. Drag and drop �variables onto the Expression Builder to format the variable into the expression properly. As Figure 5-10 shown in the following example, notice that the format of the variable automatically dropped into the expression is preceded with an @ symbol, followed by the namespace, a C++-like scope resolution operator, and then the variable name: @[namespace::variablename]
Technically, if the variable is not repeated in multiple namespaces and there are no special �characters (including spaces) in the variable name, you could get away with referring to the variable using a short identifier like @variablename or just the variable name. However, this type of lazy variable
www.it-ebooks.info
c05.indd 184
22-03-2014 18:10:33
Working with Expressionsâ•… ❘â•… 185
referencing can get you into trouble later. We recommend that you stick with the fully qualified way of referencing variables in all SSIS expressions.
Referencing Parameters Referencing parameters is just as simple in the Expression Builder. When you drag and drop the parameter name, the value is automatically preceded with an @ symbol, followed by square brackets � containing a dollar sign, the namespace of the package or project, the C++-like scope resolution operator, and the parameter name: @[$namespace::parametername]
Typically, developers can run into trouble with variable and parameter references in the Precedence Constraint Editor (refer to Figure 5-10). That’s probably because there is no Expression Builder to help build the expression, so it must be manually entered. This is where the tip of creating Â� the dummy variable MyExpressionTester comes in handy. You can create an expression within this dummy variable Expression Builder and then simply cut and paste the value into the Precedence Constraint Editor.
Referencing Columns Columns can be referenced in expressions, but only within components in a Data Flow task. This makes sense. Creating a global expression to reference a value in a Data Flow is the equivalent of trying to use a single variable to capture the value of a set-based processing operation. Even a Â�variable expression defined at the same level or scope of a Data Flow task should not be able to Â�reference a single column in the Data Flow under constant change. However, from within specific components like the Derived Column Transformation, the Expression Builder can reference a column because operations occur at the row level. Expressions within a data component can access column identifiers to allow point-and-click building of expressions. There are a couple things to remember when referencing columns in expressions: ➤
Data Flow column names must follow the SSIS standards for special characters.
➤
Column names must be uniquely named or qualified within a Data Flow.
A common issue with building expressions referencing columns in a Data Flow has less to do with the expression language than the names of the columns themselves. This is particularly true when dealing with Microsoft Excel or Access data, where columns can use nonstandard naming Â�conventions. SSIS requires that the columns being used in an expression begin with either a valid Unicode letter or an underscore (_). With the exception of bracket characters, any other special characters require qualification of the column in order to be used within an expression. Brackets ([ and ]) are the designators used by SSIS to qualify a column name. Qualification of Â�column names is required if the name contains special characters — including spaces. Because bracket characters are column name qualifiers, any column with brackets in the name must be renamed to use an expression. This doesn’t require changing the column name in the originating source. Column names also must be qualified when two or more columns in a Data Flow have the same name, in order to avoid ambiguous references. The following are examples of columns that need qualification:
www.it-ebooks.info
c05.indd 185
22-03-2014 18:10:34
186â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Column Name
Qualified Column Name
Description
My Column
[My Column]
Column names can’t contain spaces.
File#
[File#]
Column names can’t contain special characters.
@DomainName
[@DomainName]
Enrolled?
[Enrolled?]
Source1 ID
[Source1].[ID]
Source2 ID
[Source2].[ID]
Column names can’t have the same name within a Data Flow.
Another way to refer to columns, unique to SSIS, is by lineage number. A lineage number is �something that SSIS assigns to each input and output as it is added to a transformation component in a package. The lineage number is quickly replaced by the real column name when the expression is syntax compiled. To find the lineage number for a column, look at any advanced editor dialog and find the column in the input column properties under LineageID. Keep in mind that as you add �columns, the lineage numbers may change, so they should not be used for manipulation purposes, only for troubleshooting.
Boolean Expressions Boolean expressions evaluate to either true or false. In their simplest implementation, precedence constraints use Booleans expressions as gatekeepers to determine whether or not an operation should occur. Within Data Flow operations, Boolean expressions are typically employed in the Conditional Split Transformation to determine whether a row in a Data Flow should be directed to another output. For example, a Boolean expression to determine whether a Control Flow step would run only on Friday would require code to parse the day of the week from the current date and compare it to the sixth day, as shown here: DATEPART( "dw", GETDATE() ) == 6
This is a useful Boolean expression for end of the week activities. To control tasks that run on the first day of the month, use an expression like this: DATEPART ("dd", GETDATE() ) == 1
This expression validates as true only when the first day of the month occurs. Boolean expressions don’t have to be singular. Compound expressions can be built to test a variety of conditions. Here is an example in which three conditions must all evaluate to true in order for the expression to return a true value: BatchAmount == DepositAmount && @Not_Previously_Deposited == True && BatchAmount > 0.00
www.it-ebooks.info
c05.indd 186
22-03-2014 18:10:34
Working with Expressionsâ•… ❘â•… 187
The @Not_Previously_Deposited argument in this expression is a variable; the other arguments represent columns in a Data Flow. Of course, an expression can just as easily evaluate alternate �conditions, like this: (BatchAmount > 0.00 || BatchAmount < 0.00) && @Not_Previously_Deposited == True
In this case, the BatchAmount must not be equal to 0.00. An alternative way to express the same thing is to use the inequality operator: BatchAmount != 0.00 && @Not_Previously_Deposited == True
Don’t be tripped up by these simple examples. They were defined for packages in which the data had known column data types, so there was no need to take extra precautions with casting conversions. If you are dealing with data from less reliable data sources, however, or you know that two columns have different data types, then take casting precautions with your expression formulas, such as in this expression: (DT_CY)BatchAmount == (DT_CY)DepositAmount && @Not_Previously_Deposited == True && (DT_CY)BatchAmount > (DT_CY)0.00
The Boolean expression examples here are generally the style of expression that are used to enable dynamic SSIS package operations. We have not covered the conditional, date/time, and string-based Boolean expressions, which are in the following sections. String expression development requires a little more information about how to handle a NULL or missing value, which is covered next. You can see some examples of these Boolean expressions put to work at the end of this chapter.
Dealing with NULLs In SSIS, variables can’t be set to NULL. Instead, each variable data type maintains a default value in the absence of a value. For strings, the default value is an empty string, rather than the default of NULL that you might be used to in database development. However, Data Flow components can most certainly contain NULL values. This creates problems when variables are intermixed within Data Flow components. This mixture occurs either within a Script Task or within an expression. However, if a value in the Data Flow needs to be set to NULL or even tested for a NULL value, this is another matter altogether and can be accomplished rather easily with the ISNULL() expression Â�function and the NULL (type) casting functions. Just understand that variables are going to behave a little differently.
NULLs and Variables The reason you can’t set variables to NULL values is related to the COM object variant Â�implementation of variables in the SSIS engine. Regardless of the technical issue, if you are testing a variable for the absence of a value, you have to decide ahead of time what value you are going to use to represent the equivalent of a NULL value, so that you can test for it accurately. For example, the DateTime Â�variable data type defaults to 12/30/1899 12:00:00 a.m. if you purposely set it to NULL.
www.it-ebooks.info
c05.indd 187
22-03-2014 18:10:34
188â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
You can test this out yourself by creating a DateTime variable and setting it equal to an expression defined using the casting function NULL(DT_DBTIMESTAMP). It helps to get a handle on the default values for the SSIS variable data types. You can find them in this table: Variable Data T ype
Default Value
Boolean
False
Byte
0
Char
0
DateTime
12/30/1899
DBNull
(Can’t test in an expression)
Double
0
Int16
0
Int32
0
Int64
0
Object
(Can’t test in an expression)
SByte
0
Single
0
String
"" (empty string)
UInt32
0
UInt64
0
Using this table of default values, the following expression could be used in a precedence operation after testing for the absence of a value in a string variable MyNullStringVar: @[User::MyNullStringVar]==""
If the value of the user variable is an empty string, the expression evaluates to a True value and the step is executed. A frequent logic error that SSIS developers make is to use a variable to set a value from an that will be used within a multiple instance looping structure. If the value is not reset in � expression a way that enables clean retesting, the value of the variable will remain the same for the life of the package. No error will be raised, but the package may not perform multiple iterations as expected. Make sure a variable is reset to enable retesting if the test will be performed multiple times. This may require additional variables to cache intermediate results.
www.it-ebooks.info
c05.indd 188
22-03-2014 18:10:34
Working with Expressionsâ•… ❘â•… 189
NULLs in Data Flow Using the NULL function in Data Flow Transformations is a different matter because values in a Data Flow can actually be NULL. Here you can use the expression function to test for NULL values in the data stream. trouble usually stems from a misunderstanding of either how the ISNULL() function works or what to do after a NULL value is found. First, the ISNULL() expression function tests the expression in the parentheses for the value of NULL. It does not make a substitution if a NULL value is found, as the same-named function does in T-SQL. To emulate the T-SQL function ISNULL(), build an SSIS expression in a Data Flow, as shown here: IsNull(DATA_COLUMN) ? YOUR_DEFAULT_VALUE : DATA_COLUMN
If instead you want to set a column to NULL based on some attribute of the data in the incoming data stream, the logical structure is similar. First, provide the testing expression followed by the actions to take if the test is true or false. Here is a function that sets a data column in a Data Flow to NULL if the first character starts with “A”: SUBSTRING([MyColumn], 1, 1)=="A" ? NULL(DT_WSTR, 255) : [MyColumn]
A typical issue that occurs when handling NULLs doesn’t actually have anything to do with NULL values themselves but rather with string expressions. When creating data streams to punch back into RDBMS data destinations, you will often want to send back a column with NULL values when a test on the data can’t be completed. The logic is to either send the column data back or replace the column data with a NULL value. For most data types, this works by sending the results of the NULL function for the data type desired. For some reason, this works differently when you want to save non-Unicode data with a NULL value. You’d expect the following expression to work, but it doesn’t: SUBSTRING([MyColumn] , 1, 1)=="A" ? NULL(DT_STR, 255, 1252) : [MyColumn]
(This doesn't work in SSIS)
The preceding example won’t work because of how SSIS handles NULL values for the non-Unicode string type as parameters. The only way to fix this is to cast the NULL function as follows: SUBSTRING([MyColumn] , 1, 1)=="A" ? (DT_STR, 255, 1252)NULL(DT_STR, 255, 1252) : [MyColumn]
This section should have clarified the common issues you are likely to encounter when dealing with NULL values, especially as they relate to strings. However, there are still some tricks to learn about dealing with strings, which we cover next.
String Functions Handling strings in SSIS expressions is different from dealing with string data in SQL Server. The previous section discussed some of the differences with handling NULL values. You also have to pay attention to the Unicode and non-Unicode strings. If a package is moving data between multiple Unicode string sources, you have to pay attention to the code pages between the strings. If you are comparing strings, you also have to pay attention to string padding, trimming, and issues with data truncations. Handling strings is a little involved, but you really only need to remember a few things.
www.it-ebooks.info
c05.indd 189
22-03-2014 18:10:34
190â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Expression functions return Unicode string results. If you are writing an expression to return the uppercase version of a varchar-type column of data, the result will be a Unicode column with all capital letters. The string function Upper() returns a Unicode string. In fact, SSIS sets all string operations to return a Unicode string. For example, note the date expression in the Derived Column Transformation in Figure 5-11.
Figure 5-11
Here you are just adding a string column that includes the concatenation of a date value. The �function is using a DatePart() function whose results are cast to a non-Unicode string, but the default data type chosen in the editor is a Unicode string data type. This can be overridden, of course, but it is something to watch for as you develop packages. On the one hand, if the data type is reverted to non-Unicode, then the string has to be converted for each further operation. On the other hand, if the value is left as a Unicode string and the result is persisted in a non-Unicode �format, then at some point it has to be converted to a non-Unicode value. The rule of thumb that usually works out is to leave the strings converted as Unicode and then convert back to non-Unicode if required during persistence. Of course, this depends on whether there is a concern about using Unicode data. Comparing strings requires that you have two strings of the same padding length and case. The comparison is case and padding sensitive. Expressions should use the concatenation operator (+) to
www.it-ebooks.info
c05.indd 190
22-03-2014 18:10:35
Working with Expressionsâ•… ❘â•… 191
get the strings into the same padding style. Typically, this is done when putting together date strings with an expected type of padding, like this: RIGHT("0" + @Day, 2) + "/" + RIGHT("0" + @Month, 2) + "/" + RIGHT("00" + @Year, 2)
This type of zero padding ensures that the values in both variables are in the same format for �comparison purposes. By padding both sides of the comparison, you ensure the proper equality check: RIGHT("0" + @Day, 2) + "/" + RIGHT("0" + @Month, 2) + "/" + RIGHT("00" + @Year, 2) == RIGHT("0" + @FileDay, 2) + "/" + RIGHT("0" + @FileMonth, 2) + "/" + RIGHT("00" + @FileYear, 2)
A similar type of padding operation can be used to fill in spaces between two values: SUBSTRING(@Val1 + " ", 1, 5) + SUBSTRING(@Val2 + " ", 1, 5) + SUBSTRING(@Val3 + " ", 1, 5)
Typically, space padding is used for formatting output, but it could be used for comparisons. More often than not, spaces are removed from strings for comparison purposes. To remove spaces from strings in expressions, use the trim functions: LTrim(), RTrim(), and Trim(). These functions are self-explanatory, and they enable comparisons for strings that have leading and trailing spaces. For example, comparing the strings “Canterbury” and “Canterbury ” return a false unless the Â�expression is written like this: Trim("Canterbury") == Trim("Canterbury
")
This expression returns true because the significant spaces are declaratively removed. Be careful with these extra spaces in string expressions as well. Spaces are counted in all string functions, which can result in extra character counts for extra spaces when using the LEN() function and can affect carefully counted SUBSTRING() functions that do not expect leading and trailing spaces. If these issues are of importance, employ a Derived Column Transformation to trim these columns early in the Data Flow process.
Conditional Expressions You use the conditional expression operator to build logical evaluation expressions in the format of an IF..THEN logical structure: Boolean_expression ? expression_if_true : expression_if_false
The first part of the operator requires a Boolean expression that is tested for a true or false return value. If the Boolean expression returns true, then the first expression after the ternary operator
www.it-ebooks.info
c05.indd 191
22-03-2014 18:10:35
192â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
(?) will be evaluated and returned as the final result of the conditional expression. If the Boolean expression returns false, then the expression after the separation operator (:) will be evaluated and returned. Both expressions, as operands, must adhere to one of the following data type rules: ➤
Both operands must be numeric data types that can be implicitly converted.
➤
Both operands must be string data types of either Unicode or non-Unicode. Each operand can evaluate to separate types — except for the issue of setting explicit NULL values. In that case, the NULL value for DT_STR non-Unicode NULL values must be cast.
➤
Both operands must be date data types. If more than one data type is represented in the operands, the result is a DT_DBTIMESTAMP data type.
➤
Both operands for a text data type must have the same code pages.
If any of these rules are broken, or the compiler detects incompatible data types, you will have to supply explicit casting operators on one or both of the operands to cause the condition expression to evaluate. This is more of an issue as the conditional expression is compounded and nested. A typical troubleshooting issue is seeing an incompatible data type message resulting from a comparison deep in a compound conditional expression. This can be the result of a column that has changed to an incompatible data type, or a literal that has been provided without a suffix consistent with the rest of the expression. The best way to test the expression is to copy it into Notepad and test each piece of the expression until the offending portion is located. Casting issues can also create false positives. You can see casting truncation in the following example of a Boolean expression comparing the datetimestampoffset and a date value: (DT_DBDATE) "2014-01-31 20:34:52.123 -3:30" == (DT_DBDATE)"2014-01-31"
Casting converts the expression (DT_DBDATE) "2014-01-31 20:34:52.123-3:30" to "2014-01-31", causing the entire expression to evaluate to true. Date and time conversions are one example of casting issues, but they can occur on any data type that allows forced conversion.
Date and Time Functions Date and time functions tend to cause confusion for new SSIS developers. In most instances, the different syntax is causing the difficulty. As mentioned earlier, the DatePart() function is a perfect example of this. T-SQL programmers need to double quote the date part portion of the function, or they will see an error similar to this: The expression contains unrecognized token "dd". If "dd" is a variable then it should be expressed as "@dd". The specific token is not valid. If the token is intended to be a variable name, it should be prefixed with the @ symbol.
The fix is simple: put double quotation marks around the date part. A properly formatted DatePart() expression should look like this: DATEPART( "dd", GETDATE() )
www.it-ebooks.info
c05.indd 192
22-03-2014 18:10:35
Working with Expressionsâ•… ❘â•… 193
Note that this expression returns the value of the day of the month — for example, 31 if the date is January 31, 2014. A common mistake is to expect this to be the day of the week. You can Â�accomplish that task by changing the date part in the expression like this: DATEPART( "dw", GETDATE() )
These are just minor adjustments to the SSIS expression language, but they can create some Â�frustration. Another example can be found when attempting to reference the date values in an expression. If you’re used to MS Access date literals, you may be tempted to use something like this: "SELECT * FROM myTable WHERE myDate >= " + #01/31/2014# (DOESN'T WORK IN SSIS)
That won’t work in SSIS; the # signs are used for a different purpose. If the string is going to be interpreted by SQL Server, just use the single quote around the date: "SELECT * FROM MYTABLE WHERE MYDATE >= '" + "01/31/2014" + "'"
If the string is just going to be printed, the single quotes aren’t needed. Alternatively, to plug in a date value from a variable, the expression would look like this: "SELECT * FROM MYTABLE WHERE MYDATE >= '" + (DT_WSTR, 255)@[System::ContainerStartTime] + "'"
Notice that the value of the date variable must be cast to match the default Unicode data type for all expression strings of DT_WSTR. The problem with simply casting a date to a string is the fact that you get the entire date, which doesn’t translate into what you may want to use as a query parameter. This is clearer if the preceding expression is resolved: SELECT * FROM MYTABLE WHERE MYDATE >= "02/22/2014 2:28:40 PM'
If your goal is truly to see results only from after 2:28:40 p.m., then this query will run as expected. If items from earlier in the day are also expected, then you need to do some work to parse out the �values from the variable value. If the intent is just to return rows for the date that the package is running, it is much easier to create the expression like this (with your proper date style, of course): "SELECT * FROM MYTABLE WHERE MYDATE >= CONVERT(nvarchar(10), getdate(), 101)"
This method allows SQL Server to do the work of substituting the current date from the server into the query predicate. However, if you need to parse a string from a date value in an expression, take apart one of the following formulas in this section to save you a bit of time:
Description
Expression
Convert filename with �embedded date into the date-time type �format: MM/dd/yyyy HH:mm:ss.
SUBSTRING(@[User::FileName],5,2) + "/" +
continues
www.it-ebooks.info
c05.indd 193
22-03-2014 18:10:35
194â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
╇ (continued) Description
Expression
SUBSTRING(@[User::FileName],7,2) + "/" + SUBSTRING(@[User::FileName],1,4) + "" + SUBSTRING(@[User::FileName],9,2) + ":" + SUBSTRING(@[User::FileName],11,2) + ":" +
Filename format: yyyyMMddHHmmss
SUBSTRING(@[User::FileName],13,2)
Convert a date-time �variable to a filename format of: yyyyMMddHHmmss
(DT_WSTR,4) YEAR(GETDATE()) + RIGHT("0" +
(DT_WSTR,2) MONTH(GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DAY( GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DATEPART("hh", GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DATEPART("mi", GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DATEPART("ss", GETDATE()), 2)
This section covered most of the major syntactical issues that new users are likely to encounter with the expression language. The issues that have caused SSIS programmers the most trouble should not be a problem for you. Now you are ready to create some expressions and walk through the process of inserting them into SSIS packages to put them to work.
Using Expressions in SSIS Packages Creating an expression requires understanding the syntax of the SSIS expression language. As �discussed in the previous section, this expression language is part C#, part Visual Basic script, and sometimes some flavor of T-SQL mixed in. Once you can code in the expression language, you are ready to put the expressions to work. This section demonstrates how expressions can be used in SSIS package development, with some typical examples that you can use in your package development tasks. You can download the packages used as examples in the following sections in their entirety by going to www.wrox.com/go/prossis2014.
www.it-ebooks.info
c05.indd 194
22-03-2014 18:10:35
Working with Expressionsâ•… ❘â•… 195
Using Variables and Parameters as Expressions Earlier in this chapter, you learned how to use expressions in variables. A good example of �practical usage is to handle the task of processing files in a directory. This task should be familiar to �everyone. A directory must be polled for files of a specific extension. If a file is found, it is �processed and then copied into a final archival storage directory. An easy way to do this is to hardcode the source and destination paths along with the file extension into the Foreach Loop Container and File System Task. However, if you need to use a failover file server, you have to go through the �package and change all these settings manually. It is much easier to use parameters that enable these �properties to be set and then use expressions to create the destination directory and filenames. That way, when the server changes, only the parameter needs to change. The basic steps for such an SSIS package can be gleaned from Figure 5-12.
1. Retrieve the source directory from the parameter 2. Retrieve the filename from the source directory
3. Create the new destination filename using an expression 4. Move the file from the soruce directory to the destination directory
Figure 5-12
One of the things to notice in the expression of the BankFileDestinationFile is the namespace named UserExp. While there is an indicator on the variable icon to indicate whether it is an �expression, it may behoove you to make the purpose of the variable even clearer using the Namespace column, � which provides a nice way to separate variables. In this case, the namespace UserExp indicates that the variable is a user expression-based variable. The namespace UserVar indicates that the variable is defined by the user. For this package, the Foreach Loop Container Collection tab is set by an expression to retrieve the folder (or directory) from the variable BankFileSourcePath. This variable is statically defined either from configuration processes or manually by an administrator. This tells the Foreach Loop where to start looking for files. To enumerate files of a specific extension, an expression sets the FileSpec property to the value of the variable BankFileExtension, which is also a static variable. Nothing very complicated here except that the properties of the Foreach Loop are set by expressions, rather than of hardcoded values. The container looks like what is shown in Figure 5-13.
www.it-ebooks.info
c05.indd 195
22-03-2014 18:10:36
196â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Figure 5-13
Notice that the Foreach Loop Container is retrieving the filename only. This value is stored in the variable BankFileName. This isn’t shown in Figure 5-13, but it would be shown in the Variable Mappings tab. With the raw filename, no extension, and no path, some variables set up as Â�expressions can be used to create a destination file that is named using the current date. First, you need a destination location. The source folder is known, so use this folder as a starting point to Â�create a subfolder called “archive” by creating a variable named BankFileDestinationFolder that has the property EvaluateAsExpression set to True and defined by this expression: @[UserVar::BankFileSourcePath] + "\\archive"
You need the escape sequence to properly build a string path. Now build a variable named BankFileDestinationFile that will use this BankFileDestinationFolder value along with a datebased expression to put together a unique destination filename. The expression would look like this: @[UserExp::BankFileDestinationFolder] + "\\" + (DT_WSTR,4) YEAR(GETDATE()) + RIGHT("0" + (DT_WSTR,2) MONTH(GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DAY( GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DATEPART("hh", GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DATEPART("mi", GETDATE()), 2) + RIGHT("0" + (DT_WSTR,2) DATEPART("ss", GETDATE()), 2) + @[UserVar::BankFileExtension]
When evaluated, the expression results in a destination filename that looks like c:\BankFileSource\Archive\20140101154006.txt when the bank file destination folder is c:\BankFileSource\Archive. By using variables that evaluate to the value of an expression,
www.it-ebooks.info
c05.indd 196
22-03-2014 18:10:36
Working with Expressionsâ•… ❘â•… 197
combined with information set statically from administrator and environmental variables like the current date and time, you can create packages with dynamic capabilities. Another best practice is to use expression-based variables to define common logic that you’ll use throughout your SSIS package. If in the preceding example you wanted to use this datebased string in other places within your package, you could define the date portion of that expression in a separate variable called DateTimeExpression. Then the expression for the BankFileDestinationFolder variable could be simplified to look like this: @[UserExp::BankFileDestinationFolder] + "\\" + @[UserExp::DateTimeExpression] + @[UserVar::BankFileExtension]
The power in separating logic like this is that an expression need not be buried in multiple places within an SSIS package. This makes maintenance for SSIS packages much easier and more manageable.
Using Expressions in Connection Manager Properties Another simple example of using expressions is to dynamically change or set properties of an SSIS component. One of the common uses of this technique is to create a dynamic connection that enables packages to be altered by external or internal means. In this example, assume a scenario in which all logins are duplicated across environments. This means you need to change only the server name to make connections to other servers. To start, create a variable named SourceServerNamedInstance that can be used to store the server name to which the package should connect for source data. Then create any connection manager in the Connection Managers section of the SSIS package and press F4, or right-click and select Properties to get to the Properties window for the connection object. The Properties window should look like Figure 5-14. The secret here is the Expressions collection property. If you click this property, an ellipsis will be �displayed. Clicking that button will bring up the Property Expressions Editor shown in Figure 5-15, where you can see the properties that can be set to an expression, and ultimately do so in the Expression Builder.
Figure 5-14
Figure 5-15
www.it-ebooks.info
c05.indd 197
22-03-2014 18:10:36
198â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
This example completes the demonstration by setting the property of the ServerName to the �expression that is simply the value of the SourceServerNamedInstance variable. Here you affected only one property in the connection string, but this is not the only option. The entire connection string, as you may have noticed in the Property drop down, can be set by a string-based expression. This same technique can be used to set any connection property in the Data Flow components as well to dynamically alter the flat file and MS Excel-based connections. A common use is to set the connection source for a Data Flow component to a variable-based incoming filename.
Using Expressions in Control Flow Tasks A common example of using expressions in Control Flow Tasks is to create SQL statements dynamically that are run by the Execute SQL Task. The Execute SQL Task has a property called SQLStatement that can be set to a file connection, a variable, or direct input. Instead of creating parameterized SQL statements that are subject to error and OLE provider interpretation, you can try building the SQL statement using an expression and putting the whole SQL statement into the SQLStatement property. This section walks through an example like this using a DELETE statement that should run at the start of a package to delete any data from a staging table that has the same RunJobID (a theoretical identifier for a unique SSIS data load). To start, create one variable for the DELETE statement that doesn’t include the dynamic portion of the SQL statement. A variable named DeleteSQL of type String would be set to a value of the string: DELETE FROM tblStaging WHERE RunJobId =
Create another variable named DeleteSQL_RunJobId with the data type of Int32 to hold the value of a variable RunJobId. This value could be set elsewhere in the SSIS package. In the Execute SQL Task, bring up the editor and ensure that SQLSourceType is set to DirectInput. You could also set this value to use a variable if you built the SQL statement in its entirety within an expression-based variable. In this example, you’ll build the expression in the Execute SQL Task. To get there, click the Expressions tab of the editor to view the Expressions collections property. Click the ellipses to access the Property Expressions Editor and build the SQL statement using the two variables that you defined. Make sure you use the casting operator to build the string like this: @[UserVar::DeleteSQL] + (DT_WSTR, 8) @[UserVar::DeleteSQL_RunJobId]
The completed Execute SQL Task Property Expressions Editor will look like Figure 5-16. When the package runs, the expression will combine the values from both the variables and �construct a complete SQL statement that will be inserted into the property SqlStatementSource for the Execute SQL Task. This technique is more modular and works more consistently across the different OLE DB providers for dynamic query formation and execution than hardcoding the SQL statement. With this method it is possible to later define and then reconstruct your core SQL using initialization configurations. It is also a neat technique to show off your command of expressions and variables.
www.it-ebooks.info
c05.indd 198
22-03-2014 18:10:37
Working with Expressionsâ•… ❘â•… 199
Figure 5-16
Using Expressions in Control Flow Precedence Controlling the flow of SSIS packages is one of the key strengths of dynamic package development. Between each Control Flow Task is a precedence constraint that can have expressions attached to it for evaluation purposes. You can visually �identify the precedence constraint as the arrow that �connects two Control Flow Tasks. During runtime, as one Control Flow Task completes, the �precedence constraint is evaluated to determine if the flow can continue to the next task. One �common scenario is a single package that may Figure 5-17 have two separate sequence operations that need to occur based on some external factor. For example, on even days one set of tasks runs, and on odd days another separate set of tasks runs. A visual example of this type of package logic is shown in Figure 5-17. This is an easy task to perform by using an expression in a precedence constraint. To set this up, define a Boolean variable as an expression called GateKeeperSequence. Make sure the variable is in the namespace UserExp to indicate that this variable is an expression-based variable. Set the �expression to this formula: DATEPART( "dd", GetDate() ) % 2
This expression takes the current day of the month and uses the modulus operator to leave the remainder as a result. Use this value to test in the precedence constraint to �determine which sequence to run in the package. The sequence on even days should be run if the GateKeeperSequence returns 0 as a result, indicating that the current day of the month is evenly
www.it-ebooks.info
c05.indd 199
22-03-2014 18:10:37
200â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
divisible by two. Right-click the precedence constraint and select Edit to access the editor, and set it up to look like Figure 5-18. The expression @[UserExp::GateKeeper Sequence]==0 is a Boolean expression that tests the results of the first expression to determine whether the value is equal to zero. The second sequence should execute only if the current day is an odd day. The second precedence constraint needs an expression that looks like this: @[UserExp::GateKeeperSequence]!=0
By factoring the first expression into a Â�separate expression-based variable, you can reuse the same expression in both precedence Figure 5-18 constraints. This improves the readability and maintenance of your SSIS packages. With this example, you can see how a package can have sections that are conditionally executed. This same technique can also be employed to run Data Flow Tasks or other Control Flow Tasks conditionally using Boolean expressions. Refer back to the section “Boolean Expressions” if you need to review some other examples.
Using the Expression Task Recently introduced in SQL Server 2012, the Expression Task enables you to set the value of Â�variables in the Control Flow. If you are thinking, “But I could do that before!” you are absolutely correct. In previous editions of Integration Services, you could change a variable’s value by using the EvaluateAsExpression property or by using a Script Task, as previously described in this chapter. The beauty of the Expression Task is that it specifically calls out when in the package this change occurs. This means that you can make a similar change multiple times, if you placed an Expression Task in a Foreach Loop Container, or you can make it easier to see that the variable you are using is changed. The following example shows how to use an Expression Task to make an iterator variable that can count the number of times you loop through a section of the package. Begin by using and editing an Expression Task from the SSIS Toolbox. The Expression Builder that appears limits you to variables and parameters because you are in the Control Flow designer. You assign a value to a variable by using the equals sign, so the final formula will look like this: @[User::Iterator] = @[User::Iterator] + 1
A completed version of the property window is shown in Figure 5-19.
www.it-ebooks.info
c05.indd 200
22-03-2014 18:10:37
Working with Expressionsâ•… ❘â•… 201
Figure 5-19
Using Expressions in the Data Flow Although you can set properties on some of the Data Flow components, a typical use of an expression in a Data Flow is to alter a WHERE clause on a source component. In this example, you’ll alter the SQL query in a source component using a supplied date as a variable to pull out address information from the AdventureWorks database. Then you’ll use an expression to build a derived column that can be used for address labels. First, set up these variables at the Data Flow scope level by selecting the Data Flow Task before Â�creating the variable:
Variable Name
Data T ype
Namespace
Description
BaseSelect
String
UserVar
Contains base Select statement
SelectSQL_UserDateParm
DateTime
UserVar
Contains supplied date parm
SelectSQL
String
UserExp
Derived SQL to execute
SelectSQL_ExpDateParm
String
UserExp
Safe Date Expression
www.it-ebooks.info
c05.indd 201
22-03-2014 18:10:38
202â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Notice that the namespaces for the BaseSelect and SelectSQL_UserDateParm variables are using a namespace UserVar. As described previously, you use them because it makes it clear which �variables are expression-based and which are not. Provide the following values for these variables:
Variable Name
Value
BaseSelect
SELECT AddressLine1, AddressLine2 City, StateProvinceCode, PostalCode FROM Person.Address adr INNER JOIN Person.StateProvince stp ON adr.StateProvinceID = stp.StateProvinceID WHERE adr.ModifiedDate >=
SelectSQL_UserDateParm
1/12/2000
Note that you need to put the value from the BaseSelect variable into one continuous line to get it all into the variable. Make sure the entire string is in the variable value before continuing. The remaining variables need to be set up as expression-based variables. At this point, you should be proficient at this. Set the EvaluateAsExpression property to True and prepare to add the Â�expressions to each. Ultimately, you need a SQL string that contains the date from the SelectSQL_UserDateParm, but using dates in strings by just casting the date to a string can Â�produce potentially unreliable results — especially if you are given the string in one culture and you are querying data stored in another collation. This is why the extra expression variable SelectSQL_ ExpDateParm exists. This safe date expression looks like this: (DT_WSTR, (DT_WSTR, (DT_WSTR, (DT_WSTR, (DT_WSTR, (DT_WSTR,
4) 2) 2) 2) 2) 2)
DATEPART("yyyy", @[UserVar::SelectSQL_UserDateParm]) DATEPART("mm", @[UserVar::SelectSQL_UserDateParm]) + DATEPART("dd", @[UserVar::SelectSQL_UserDateParm]) + DATEPART("hh", @[UserVar::SelectSQL_UserDateParm]) + DATEPART("mi", @[UserVar::SelectSQL_UserDateParm]) + DATEPART("ss", @[UserVar::SelectSQL_UserDateParm])
+ "-" + "-" + " "+ ":" + ":" +
The expression parses out all the pieces of the date and creates an ISO-formatted date in a string format that can now be appended to the base SELECT SQL string. This is done in the last expressionbased variable SelectSQL. The expression looks like this: @[UserVar::BaseSelect] + "'" + @[UserExp::SelectSQL_ExpDateParm] + "'"
With all the pieces to create the SQL statement in place, all you need to do is apply the expression in a data source component. Drop an OLE DB Source component connected to the AdventureWorks database on the Data Flow surface, and set the Data access mode to retrieve the data as “SQL Â�command from variable.” Set the variable name to the SelectSQL variable. The OLE DB Source Editor should look like Figure 5-20.
www.it-ebooks.info
c05.indd 202
22-03-2014 18:10:38
Working with Expressionsâ•… ❘â•… 203
Figure 5-20
Click the Preview button to look at the data pulled with the current value of the variable SelectSQL_UserDateParm. Change the value and check whether the data changes as expected. Now the OLE DB source will contain the same columns, but the predicate can be easily and safely changed with a date parameter that is safe across cultures. The final task is to create a one-column output that combines the address fields. Add a Derived Column Transformation to the Data Flow, and a new column of type WSTR, length of 2000, named FullAddress. This column will need an expression that combines the columns of the address to build a one-column output. Remember that we are dealing with Data Flow data here, so it is Â�possible to realize a NULL value in the data stream. If you simply concatenate every column and a NULL value exists anywhere, the entire string will evaluate to NULL. Furthermore, you don’t want addresses that have blank lines in the body, so you only want to add a newline character Â�conditionally after addresses that aren’t NULL. Because the data tables involved can only contain NULL values in the two address fields, the final expression looks like this: (ISNULL(AddressLine1) ? "" : AddressLine1 + "\n") + (ISNULL(AddressLine2) ? "" : AddressLine2 + "\n") + City + ", "+ StateProvinceCode + " " + PostalCode
The Derived Column Transformation should look similar to Figure 5-21.
www.it-ebooks.info
c05.indd 203
22-03-2014 18:10:38
204â•…â•› Chapter 5╇ Using Variables, Parameters, and Expressions ❘â•…â•›
Figure 5-21
Running this example will create the one-output column of a combined address field that can be dynamically configured by a date parameter with a conditional Address2 line, depending upon whether the data exists. Using expressions to solve problems like this makes SSIS development seem almost too easy.
Summary The aim of this chapter was to fill any gaps in your understanding of expressions. Clearly, this feature is powerful; it enables dynamic package development in an efficient way, getting you out of the code and into productive solutions. However, expressions can be frustrating if you don’t pay attention to the data types and whether you are working with data in variables or in the Data Flow. This chapter has described the “gotchas” that typically plague SSIS developers, along with their solutions, so that you don’t have to experience them. Along the way, we consolidated the common questions, answers, and best practices we’ve learned about using expressions, making them available to you in one chapter. We discussed how you can set variables programmatically and use scripting tasks and components to further the SSIS dynamic package capabilities. We also covered how to use variables and parameters in expressions to create dynamic properties. There are still scenarios in which expressions don’t fit the bill, and for these scripting tasks can be used to save the day. Stay tuned for Chapter 9, where you explore scripting tasks both in the control and Data Flow roles, and expand your SSIS capabilities. The next chapter will switch over to discuss the Control Flow, specifically containers.
www.it-ebooks.info
c05.indd 204
22-03-2014 18:10:38
6
Containers What’s in This Chapter? ➤➤
Learning when to use containers
➤➤
Working with Sequence Containers
➤➤
Working with the For Loop Container
➤➤
Using a Foreach Loop Container to iterate through a list
Wrox.com Downloads for This Chapter
You can find the wrox.com code downloads for this chapter at www.wrox.com/go/ prossis2014 on the Download Code tab.
In Chapter 3, you read about tasks and how they interact in the Control Flow. Now we’re going to cover one of the special types of items in the control flow: containers. Containers are objects that help SSIS provide structure to one or more tasks. They can help you loop through a set of tasks until a criterion has been met or group a set of tasks logically. Containers can also be nested, containing other containers. They are set in the Control Flow tab in the Package Designer. There are three types of containers in the Control Flow tab: Sequence, For Loop, and Foreach Loop Containers.
Task Host Containers The Task Host Container is the default container under which single tasks fall and is used only behind the scenes for SSIS. You won’t notice this type of container. You’ll notice that this type of container is not in the Toolbox in Visual Studio and is implicitly assigned to each task. In fact, even if you don’t specify a container for a task, it will be placed in a Task Host Container. The SSIS architecture extends variables and event handlers to the task through the Task Host Container.
www.it-ebooks.info
c06.indd 205
22-03-2014 18:12:43
206â•…â•› Chapter 6╇╇Containers ❘â•…â•›
Sequence Containers Sequence Containers handle the flow of a subset of a package and can help you divide a package into smaller, more manageable pieces. Some nice applications that you can use Sequence Containers for include the following: ➤➤
Grouping tasks so that you can disable a part of the package that’s no longer needed
➤➤
Narrowing the scope of the variable to a container
➤➤
Managing the properties of multiple tasks in one step by setting the properties of the container
➤➤
Using one method to ensure that multiple tasks have to execute successfully before the next task executes
➤➤
Creating a transaction across a series of data-related tasks, but not on the entire package
➤➤
Creating event handlers on a single container, wherein you could send an e-mail if anything inside one container fails and perhaps page if anything else fails
Sequence Containers show up like any other task in your Control Flow tab. Once you drag and drop any container from your SSIS Toolbox onto the design pane, you just drag the tasks you want to use into the container. Figure 6-1 shows an example of a Sequence Container in which two tasks must execute successfully before the task called Run Script 3 will execute. If you click the up-pointing arrow at the top of the container, the tasks inside the container will minimize. A container can be considered to be a miniature Figure 6-1 package. Inside the container, all task names must be unique, just as they must be within a package that has no containers. You cannot connect a task in one container to anything outside of the container. If you try to do this, you will receive the following error: Cannot create connector. Cannot connect executables from different containers.
Containers such as the Sequence Container can also be nested in each other. As a best practice, each of your SSIS packages should contain a series of containers to help organize the package and to make it easy to disable subject areas quickly. For example, each set of tables that you must load probably fits into a subject area, such as Accounting or HR. Each of these loads should be placed in its own Sequence Container. Additionally, you may want to create a Sequence Container for the preparation and cleanup stages of your package. This becomes especially handy if you want to execute all the tasks in the container at one time by right-clicking on the container and selecting Execute Container.
www.it-ebooks.info
c06.indd 206
22-03-2014 18:12:43
For Loop Containerâ•…â•›❘â•…â•› 207
Groups Groups are not actually containers but simply a way to group components together. A key difference between groups and containers is that properties cannot be delegated through a container. Because of this, groups don’t have precedence constraints originating from them (only from the tasks). And you cannot disable the entire group, as you can with a Sequence Container. Groups are good for quick compartmentalization of tasks for aesthetics. Their only usefulness is to quickly group components in either a Control Flow or a Data Flow together. To create a group, highlight the tasks that you wish to place in the group, right-click, and select Group. To ungroup the tasks, right-click the group and select Ungroup. To add additional tasks, simply drag the task into the group. This can be done in the Control Flow or Data Flow. The same type of logic you saw in the Sequence Container in Figure 6-1 is also shown in Figure 6-2. In this case, the precedence constraint is originating from the task Run Script 2 to the task Run Script 3. You should always consider using Sequence Containers, instead of groups, because they provide a lot more functionality than groups.
For Loop Container
Figure 6-2
The For Loop Container enables you to create looping in your package similar to how you would loop in nearly any programming language. In this looping style, SSIS optionally initializes an expression and continues to evaluate it until the expression evaluates to false. In the example in Figure 6-3, the Script Task called Wait for File to Arrive is continuously looped through until a condition is evaluated as false. Once the loop is broken, the Script Task is executed. For another real-world example, use a Message Queue Task inside the loop to continuously loop until a message arrives in the queue. Such a configuration would allow for scaling out your SSIS environment.
Figure 6-3
The following simple example demonstrates the functionality of the For Loop Container, whereby you’ll use the container to loop over a series of tasks five times. Although this example is rudimentary, you can plug in whatever task you want in place of the Script Task.
1.
Create a new SSIS project called Chapter 6, and change the name of the default package to ForLoopContainer.dtsx.
www.it-ebooks.info
c06.indd 207
22-03-2014 18:12:43
208â•…â•› Chapter 6╇╇Containers ❘â•…â•›
2.
Open the ForLoopContainer.dtsx package, create a new variable, and call it Counter. You may have to open the Variables window if it isn’t already open. To do this, right-click in the design pane and select Variables or click on the Variables icon on the top right of your package designer screen. Once the window is open, click the Add Variable button. Accept all the defaults for the variable (int32) and a default value of 0.
3.
Drag the For Loop Container to the Control Flow and double-click it to open the editor. Set the InitExpression option to @Counter = 0. This will initialize the loop by setting the Counter variable to 0. Next, in the EvalExpression option, type @Counter < 5 and @Counter = @ Counter + 1 for the AssignExpression (shown in Figure 6-4). This means that the loop will iterate as long as the Counter �variable is less than 5; each time it loops, 1 will be added to the variable. Click OK.
Figure 6-4
4.
Drag a Script Task into the For Loop Container and double-click the task to edit it. In the General tab, name the task Pop Up the Iteration.
5.
In the Script tab, set the ReadOnlyVariables (see Figure 6-5) to Counter and select Microsoft Visual Basic 2012. Finally, click Edit Script to open the Visual Studio designer. By typing Counter for that option, you’re going to pass in the Counter parameter to be used by the Script Task.
www.it-ebooks.info
c06.indd 208
22-03-2014 18:12:44
For Loop Containerâ•… ❘â•… 209
Figure 6-5
6.
When you click Edit Script, the Visual Studio 2012 design environment will open. Replace the Main() subroutine with the following code. This code will read the variable and pop up a message box that displays the value of the Counter variable: Public Sub Main() ' ' Add your code here ' MessageBox.Show(Dts.Variables("Counter").Value) Dts.TaskResult = ScriptResults.Success End Sub
Close the script editor and click OK to save the Script Task Editor.
7.
Drag over a Data Flow task and name it Load File. Connect the success precedence Â�constraint to the task from the For Loop Container. This task won’t do anything, but it shows how the container can call another task.
8.
Save and exit the Visual Studio design environment, then click OK to exit the Script Task. When you execute the package, you should see results similar to what is shown in Figure 6-6. That is, you should see five pop-up boxes one at a time, starting at iteration 0 and �proceeding
www.it-ebooks.info
c06.indd 209
22-03-2014 18:12:44
210â•…â•› Chapter 6╇╇Containers ❘â•…â•›
Figure 6-6
through iteration 4. Only one pop-up will appear at any given point. The Script Task will turn green and then back to yellow as it transitions between each iteration of the loop. After the loop is complete, the For Loop Container and the Script Task will both be green.
Foreach Loop Container The Foreach Loop Container is a powerful looping mechanism that enables you to loop through a collection of objects. As you loop through the collection, the container assigns the value from the collection to a variable, which can later be used by tasks or connections inside or outside the container. You can also map the value to a variable. The types of objects that you will loop through vary based on the enumerator you set in the editor in the Collection page. The properties of the editor vary widely according to what you set for this option: ➤➤
Foreach File Enumerator: Performs an action for each file in a directory with a given file extension
➤➤
Foreach Item Enumerator: Loops through a list of items that are set manually in the container
➤➤
Foreach ADO Enumerator: Loops through a list of tables or rows in a table from an ADO recordset
➤➤
Foreach ADO.NET Schema Rowset Enumerator: Loops through an ADO.NET schema
➤➤
Foreach From Variable Enumerator: Loops through an SSIS variable
➤➤
Foreach Nodelist Enumerator: Loops through a node list in an XML document
➤➤
Foreach SMO Enumerator: Enumerates a list of SQL Management Objects (SMO)
The most important of the enumerators is the Foreach File enumerator since it’s used more frequently. In this next example, you’ll see how to loop over a number of files and perform an action on each file. The second most important enumerator is the Foreach ADO enumerator, which loops over records in a table.
www.it-ebooks.info
c06.indd 210
22-03-2014 18:12:44
Foreach Loop Containerâ•… ❘â•… 211
Foreach File Enumerator Example The following example uses the most common type of enumerator: the Foreach File enumerator. You’ll loop through a list of files and simulate some type of action that has occurred inside the container. This example has been simplified in an effort to highlight the core functionality, but if you would like a more detailed example, turn to Chapter 8, which has an end-to-end example. For this example to work, you need a folder full of some dummy files and an archive folder to move them into, which SSIS will be enumerating through. The folder can contain any type of file.
1.
To start, create a new package called ForEachFileEnumerator.dtsx. Then create a string variable called sFileName with a default value of the word default. This variable will hold the name of the file that SSIS is working on during each iteration of the loop.
2.
Create the variable by right-clicking in the Package Designer area of the Control Flow tab and selecting Variables. Then, click the Add New Variable option, changing the data type to a String.
3.
Next, drag a Foreach Loop Container onto the Control Flow and double-click on the container to configure it, as shown in Figure 6-7. Set the Enumerator option to Foreach File Enumerator.
Figure 6-7
www.it-ebooks.info
c06.indd 211
22-03-2014 18:12:45
212â•…â•› Chapter 6╇╇Containers ❘â•…â•›
4.
Then, set the Folder property to the folder that has the dummy files in it and leave the default Files property of *.*. In this tab, you can store the filename and extension (Readme.txt), the fully qualified filename (c:\directoryname\readme.txt), or just the filename without the extension (readme). You can also tell the container to loop through all the files in subfolders as well by checking Transverse Subfolders.
5.
Click the Variable Mappings tab on the left, select the variable you created earlier from the Variable dropdown box, and then accept the default of 0 for the index, as shown in Figure 6-8. Click OK to save the settings and return to the Control Flow tab in the Package Designer.
Figure 6-8
6.
Drag a new File System Task into the container’s box. Double-click the new task to Â�configure it in the editor that appears. After setting the operation to Copy file, the screen’s properties should look like what is shown in Figure 6-9. Select for the DestinationConnection property.
www.it-ebooks.info
c06.indd 212
22-03-2014 18:12:45
Foreach Loop Containerâ•… ❘â•… 213
Figure 6-9
7.
When the Connection Manager dialog opens, select Existing Folder and type the archive folder of your choosing, such as C:\ProSSIS\Containers\ForEachFille\Archive.
8.
Lastly, set the IsSourcePathVariable property to True and set the SourceVariable to User::sFileName.
9.
You’re now ready to execute the package. Place any set of files you wish into the folder that you configured the enumerator to loop over, and then execute the package. During execution, you’ll see each file picked up and moved in Windows Explorer, and in the package you’ll see something resembling Figure 6-10. If you had set the OverwriteDestination Â�property to True in the File System Task, the file would be overwritten if there was a conflict of duplicate filenames.
Figure 6-10
Foreach ADO Enumerator Example The Foreach ADO Enumerator loops through a collection of records and will execute anything inside the container for each row that is found. For example, if you had a table such as the following
www.it-ebooks.info
c06.indd 213
22-03-2014 18:12:46
214â•…â•› Chapter 6╇╇Containers ❘â•…â•›
that contained metadata about your environment, you could loop over that table and reconfigure the package for each iteration of the loop. This reconfiguration is done with SSIS expressions. We cover these in much more depth in Chapter 5. At a high level, the expression can reconfigure the connection to files, databases, or variables (to name just a few items) at runtime and during each loop of the container. Client
FTPLocation
ServerName
DatabaseName
Client1
C:\Client1\Pub
localhost
Client1DB
Client2
C:\Client2\Pub
localhost
Client2DB
Client3
C:\Client3\Pub
localhost
Client3DB
The first time through the loop, your Connection Managers would point to Client1, and retrieve their files from one directory. The next time, the Connection Managers would point to another client. This enables you to create a single package that will work for all your clients. In this next example, you will create a simple package that simulates this type of scenario. The package will loop over a table and then change the value for a variable for each row that is found. Inside the container, you will create a Script Task that pops up the current variable’s value.
1.
Create a new OLE DB Connection. To start the example, create a new package called ForeachADOEnumerator.dtsx. Create a new Connection Manager called MasterConnection that points to the master database on your development machine. Create two variables: one called sDBName, which is a string with no default, and the other called objResults, which is an object data type.
2.
Next, drag over an Execute SQL Task. You’ll use the Execute SQL Task to populate the ADO recordset that is stored in a variable.
3.
In the Execute SQL Task, point the Connection property to the MasterConnection Connection Manager. Change the ResultSet property to Full Result Set, which captures the results of the query run in this task to a result set in a variable. Type the following query for the SQLStatement property (as shown in Figure 6-11): Select database_id, name from sys.databases
4.
Still in the Execute SQL Task, go to the Result Set page and type 0 for the Result Name, as shown in Figure 6-12. This is the zero-based ordinal position for the result that you want to capture into a variable. If your previously typed query created multiple recordsets, then you could capture each one here by providing its ordinal position. Map this recordset to a variable called objResults that is scoped to the package and an object data type. The object variable data type can store up to 2GB of data in memory. If you don’t select that option, the package will fail upon execution because the object variable is the only way to store a recordset in memory in SSIS.
www.it-ebooks.info
c06.indd 214
22-03-2014 18:12:46
Foreach Loop Containerâ•… ❘â•… 215
Figure 6-11
Figure 6-12
www.it-ebooks.info
c06.indd 215
22-03-2014 18:12:47
216â•…â•› Chapter 6╇╇Containers ❘â•…â•›
5.
Back in the Control Flow tab, drag over a ForEach Loop Container and drag the green line from the Execute SQL Task to the Foreach Loop. Now open the Foreach Loop to configure the container. In the Collection page, select Foreach ADO Enumerator from the Enumerator dropdown box. Next, select the objResults variable from the ADO Object Source Variable dropdown, as seen in Figure 6-13. This tells the container that you wish to loop over the results stored in that variable.
Figure 6-13
6.
Go to the Variable Mappings page for the final configuration step of the container. Just like the Foreach File Enumerator, you must tell the container where you wish to map the value retrieved from the ADO result set. Your result set contains two columns: ID and Name, from the sys.databases table.
7.
In this example, you are working with the second column, so select the sDBName variable by selecting the variable from the Variable dropdown and type 1 for the Index (shown in Figure 6-14). Entering 1 means you want the second column in the result set, as the index starts at 0 and increments Â� by one for each column to the right. Because of this behavior, be careful if you change the Execute SQL Task’s query.
www.it-ebooks.info
c06.indd 216
22-03-2014 18:12:47
Foreach Loop Containerâ•… ❘â•… 217
Figure 6-14
8.
With the container now configured, drag a Script Task into the container’s box. In the Script tab of the Script Task, set ReadOnlyVariables to sDBName and select Microsoft Visual Basic 2012.
9.
Finally, click Edit Script to open the Visual Studio designer. By typing sDBName for the ReadOnlyVariables option in the Script tab, you’re going to pass in the Counter parameter to be used by the Script Task.
10.
When you click Edit Script, the Visual Studio 2012 design environment will open. Doubleclick ScriptMain.vb to open the script, and replace the Main() subroutine with the following code. This one line of code uses the MessageBox method to display the sDBName variable. Public Sub Main() ' ' Add your code here ' MessageBox.Show(Dts.Variables("sDBName").Value) Dts.TaskResult = ScriptResults.Success End Sub
www.it-ebooks.info
c06.indd 217
22-03-2014 18:12:47
218â•…â•› Chapter 6╇╇Containers ❘â•…â•›
11.
Close the editor and task and execute the package. The final running of the package should look like Figure 6-15, which pops up the value of the sDBName variable, showing you the current database. As you click OK to each pop-up, the next database name will be Â�displayed. In a less contrived example, this Script Task would obviously be replaced with a Data Flow Task to load the client’s data.
Figure 6-15
Summary In this chapter, you explored groups and the four containers in SSIS: the Task Host, the Sequence Container, the For Loop Container and the Foreach Loop Container. ➤➤
The Task Host Container is used behind the scenes on any task.
➤➤
A Sequence Container helps you compartmentalize your various tasks into logical groupings.
➤➤
The For Loop Container iterates through a loop until a requirement has been met.
➤➤
A Foreach Loop Container loops through a collection of objects such as files or records in a table.
This chapter covered one of the most common examples of a Foreach Loop Container, looping through all the records in a table and looping through files. Each of the looping containers will execute all the items in the container each time it iterates through the loop.
www.it-ebooks.info
c06.indd 218
22-03-2014 18:12:48
7
Joining Data What’s in This Chapter? ➤➤
The Lookup Transformation
➤➤
Using the Merge Join Transformation
➤➤
Building a basic package
➤➤
Using the Lookup Transformation
➤➤
Loading Lookup cache with the Cache Connection Manager and Cache Transform
Wrox.com Downloads for This Chapter
You can find the wrox.com code downloads for this chapter at www.wrox.com/go/ prossis2014 on the Download Code tab.
In the simplest ETL scenarios, you use an SSIS Data Flow to extract data from a single source table and populate the corresponding destination table. In practice, though, you usually won’t see such trivial scenarios: the more common ETL scenarios will require you to access two or more data sources simultaneously and merge their results together into a single destination structure. For instance, you may have a normalized source system that uses three or more tables to represent the product catalog, whereas the destination represents the same information using a single denormalized table (perhaps as part of a data warehouse schema). In this case you would need to join the multiple source tables together in order to present a unified structure to the destination table. This joining may take place in the source query in the SSIS package or when using a Lookup Transform in an SSIS Data Flow. Another less obvious example of joining data is loading a dimension that would need to have new rows inserted and existing rows updated in a data warehouse. The source data is coming from an OLTP database and needs to be compared to the existing dimension to find the rows
www.it-ebooks.info
c07.indd 219
22-03-2014 19:55:25
220â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
that need updating. Using the dimension as a second source, you can then join the data using a Merge Join Transformation in your Data Flow. The joined rows can then be compared to look for changes. This type of loading is discussed in Chapter 12. In the relational world, such requirements can be met by employing a relational join operation if the data exists in an environment where these joins are possible. When you are creating ETL projects, the data is often not in the same physical database, the same brand of database, the same server, or, in the worst cases, even the same physical location (all of which typically render the relational join method useless). In fact, in one common scenario, data from a mainframe system needs to be joined with data from SQL Server to create a complete data warehouse in order to provide your users with one trusted point for reporting. The ETL solutions you build need to be able to join data in a similar way to relational systems, but they should not be constrained to having the source data in the same physical database. SQL Server Integration Services (SSIS) provides several methods for performing such joins, ranging from functionality implemented in Data Flow Transformations to custom methods implemented in T-SQL or managed code. This chapter explores the various options for performing joins and provides guidelines to help you determine which method you should use for various circumstances and when to use it. After reading this chapter, you should be able to optimize the various join operations in your ETL solution and understand their various design, performance, and resource trade-offs.
The Lookup Transformation The Lookup Transformation in SSIS enables you to perform the similar relational inner and outer hash-joins. The main difference is that the operations occur outside the realm of the database engine and in the SSIS Data Flow. Typically, you would use this component within the context of an integration process, such as the ETL layer that populates a data warehouse from source systems. For example, you may want to populate a table in a destination system by joining data from two separate source systems on different database platforms. The component can join only two data sets at a time, so in order to join three or more data sets, you would need to chain multiple Lookup Transformations together, using an output from one Lookup Transformation as an input for another. Compare this to relational join semantics, whereby in a similar fashion you join two tables at a time and compose multiple such operations to join three or more tables. The transformation is written to behave in a synchronous manner, meaning it does not block the pipeline while it is doing its work. While new rows are entering the Lookup Transformation, rows that have already been processed are leaving through one of four outputs. However, there is a catch here: in certain caching modes (discussed later in this chapter) the component will initially block the package’s execution for a period of time while it loads its internal caches with the Lookup data. The component provides several modes of operation that enable you to compare performance and resource usage. In full-cache mode, one of the tables you are joining is loaded in its entirety into memory, and then the rows from the other table are flowed through the pipeline one buffer at a time, and the selected join operation is performed. With no up-front caching, each incoming row in the pipeline is compared one at a time to a specified relational table. Between these two options is a third that combines their behavior. Each of these modes is explored later in this chapter (see the “Full-Cache Mode,” “No-Cache Mode,” and “Partial-Cache Mode” sections).
www.it-ebooks.info
c07.indd 220
22-03-2014 19:55:25
Using the Merge Join Transformationç‹€ ❘â•… 221
Of course, some rows will join successfully, and some rows will not be joined. For example, consider a customer who has made no purchases. His or her identifier in the Customer table would have no matches in the sales table. SSIS supports this scenario by having multiple outputs on the Lookup Transformation. In the simplest (default/legacy) configuration, you would have one output for matched rows and a separate output for nonmatched and error rows. This functionality enables you to build robust (error-tolerant) processes that, for instance, might direct nonmatched rows to a staging area for further review. Or the errors can be ignored, and a Derived Column Transformation can be used to check for null values. A conditional statement can then be used to add default data in the Derived Column. A more detailed example is given later in this chapter. The Cache Connection Manager (CCM) is a separate component that is essential when creating advanced Lookup operations. The CCM enables you to populate the Lookup cache from an arbitrary source; for instance, you can load the cache from a relational query, an Excel file, a text file, or a Web service. You can also use the CCM to persist the Lookup cache across iterations of a looping operation. You can still use the Lookup Transformation without explicitly using the CCM, but you would then lose the resource and performance gains in doing so. CCM is described in more detail later in this chapter.
Using the Merge Join Transformation The Merge Join Transformation in SSIS enables you to perform an inner or outer join operation in a streaming fashion within the SSIS Data Flow. The Merge Join Transformation does not preload data like the Lookup Transformation does in its cached mode. Nor does it perform per-record database queries like the Lookup Transformation does in its noncached mode. Instead, the Merge Join Transformation accepts two inputs, which must be sorted, and produces a single output, which contains the selected columns from both inputs, and uses the join criteria defined by the package developer. The component accepts two sorted input streams and outputs a single stream that combines the chosen columns into a single structure. It is not possible to configure a separate nonmatched output like the one supported by the Lookup Transformation. For situations in which unmatched records need to be processed separately, a Conditional Split Transformation can be used to find the null values on the nonmatched rows and send them down a different path in the Data Flow. The Merge Join Transformation differs from the Lookup Transformation in that it accepts its reference data via a Data Flow path instead of through direct configuration of the transformation properties. Both input Data Flow paths must be sorted, but the data can come from any source supported by the SSIS Data Flow as long as they are sorted. The sorting has to occur using the same set of columns in exactly the same order, which can create some overhead upstream. The Merge Join Transformation typically uses less memory than the Lookup Transformation because it maintains only the required few rows in memory to support joining the two streams. However, it does not support short-circuit execution, in that both pipelines need to stream their entire contents before the component considers its work done. For example, if the first input has five rows, and the second input has one million rows, and it so happens that the first five rows immediately join successfully, the component will still stream the other 999,995 rows from the second input even though they cannot possibly be joined anymore.
www.it-ebooks.info
c07.indd 221
22-03-2014 19:55:26
222â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Contrasting SSIS and the Relational Join Though the methods and syntax you employ in the relational and SSIS worlds may differ, joining multiple row sets together using congruent keys achieves the same desired result. In the relational database world, the equivalent of a Lookup is accomplished by joining two or more tables together using declarative syntax that executes in a set-based manner. The operation remains close to the data at all times; there is typically no need to move the data out-of-process with respect to the database engine as long as the databases are on the same SQL Server instance (except when joining across databases, though this is usually a nonoptimal operation). When joining tables within the same database, the engine can take advantage of multiple different internal algorithms, knowledge of table statistics, cardinality, temporary storage, cost-based plans, and the benefit of many years of ongoing research and code optimization. Operations can still complete in a resource-constrained environment because the platform has many intrinsic functions and operators that simplify multistep operations, such as implicit parallelism, paging, sorting, and hashing. In a cost-based optimization database system, the end-user experience is typically transparent; the declarative SQL syntax abstracts the underlying relational machinations such that the user may not in fact know how the problem was solved by the engine. In other words, the engine is capable of transforming a problem statement as defined by the user into an internal form that can be optimized into one of many solution setsâ•–—╖╉transparently. The end-user experience is usually synchronous and nonblocking; results are materialized in a streaming manner, with the engine effecting the highest degree of parallelism possible. The operation is atomic in that once a join is specified, the operation either completes or fails in totalâ•–—╖╉there are no substeps that can succeed or fail in a way the user would experience independently. Furthermore, it is not possible to receive two result sets from the query at the same timeâ•–—╖╉for instance, if you specified a left join, then you could not direct the matches to go one direction and the nonmatches somewhere else. Advanced algorithms allow efficient caching of multiple joins using the same tablesâ•–—╖╉for instance, round-robin read-ahead enables separate T-SQL statements (using the same base tables) to utilize the same caches. The following relational query joins two tables from the AdventureWorksDW database together. Notice how you join only two tables at a time, using declarative syntax, with particular attention being paid to specification of the join columns: select sc.EnglishProductSubcategoryName, p.EnglishProductName from dbo.DimProductSubcategory sc inner join dbo.DimProduct p on sc.ProductSubcategoryKey = p.ProductSubcategoryKey;
For reference purposes, Figure 7-1 shows the plan that SQL Server chooses to execute this join.
www.it-ebooks.info
c07.indd 222
22-03-2014 19:55:26
Contrasting SSIS and the Relational Joinç‹€ ❘â•… 223
Figure 7-1
In SSIS, the data is usually joined using a Lookup Transformation on a buffered basis. The Merge Join Transformation can also be used, though it was designed to solve a different class of patterns. The calculus/algebra for these components is deterministic; the configuration that the user supplies is directly utilized by the engineâ•–—╖╉in other words, there is no opportunity for the platform to make any intelligent choices based on statistics, cost, cardinality, or count. Furthermore, the data is loaded into out-of-process buffers (with respect to the database engine) and is then treated on a row-by-row manner; therefore, because this moves the data away from the source, you can expect performance and scale to be affected. Any data moving through an SSIS Data Flow is loaded into memory in data buffers. A batch process is performed on the data in synchronous transformations. The asynchronous transformations, such as the Sort or Aggregate, still perform in batch, but all rows must be loaded into memory before they complete, and therefore bring the pipeline process to a halt. Other transformations, like the OLE DB Command Transformation, perform their work using a row-by-row approach. The end-user experience is synchronous, though in the case of some modes of the Lookup Transformation the process is blocked while the cache loads in its entirety. Execution is nonatomic in that one of multiple phases of the process can succeed or fail independently. Furthermore, you can direct successful matches to flow out the Lookup Transformation to one consumer, the nonmatches to flow to a separate consumer, and the errors to a third. Resource usage and performance compete: in Lookup’s full-cache modeâ•–—╖╉which is typically fastest with smaller data setsâ•–—╖╉the cache is acquired and then remains in memory until the process (package) terminates, and there are no implicit operators (sorting, hashing, and paging) to balance resource usage. In no-cache or partial-cache modes, the resource usage is initially lower because the cache is charged on the fly; however, overall performance will almost always be lower. The operation is explicitly parallel; individual packages scale out if and only if the developer intentionally created multiple pipelines and manually segmented the data. Even then, bringing the data back together with the Union All Transformation, which is partial blocking, can negate any performance enhancement. Benchmark testing your SSIS packages is necessary to determine the best approach. There is no opportunity for the Lookup Transformation to implicitly perform in an SMP (or scaleout) manner. The same applies to the Merge Join Transformationâ•–—╖╉on suitable hardware it will run on a separate thread to other components, but it will not utilize multiple threads within itself.
www.it-ebooks.info
c07.indd 223
22-03-2014 19:55:26
224â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Figure 7-2 shows an SSIS package that uses a Lookup Transformation to demonstrate the same functionality as the previous SQL statement. Notice how the Product table is pipelined directly into the Lookup Transformation, but the SubCategory table is referenced using properties on the component itself. It is interesting to compare this package with the query plan generated by SQL Server for the previous SQL query. Notice how in this case SQL Server chose to utilize a hash-join operation, which happens to coincide with the mechanics underlying the Lookup Transformation when used in full-cache mode. The explicit design chosen by the developer in SSIS corresponds almost exactly to the plan chosen by SQL Server to generate the same result set.
Figure 7-2
Figure 7-3 shows the same functionality, this time built using a Merge Join Transformation. Notice how similar this looks to the SQL Server plan (though in truth the execution semantics are quite different).
Lookup Features The Lookup Transformation allows you to populate the cache using a separate pipeline in either the same Figure 7-3 or a different package. You can use source data from any location that can be accessed by the SSIS Data Flow. This cache option makes it convenient to load a file or table into memory, and this data can be used by multiple Data Flows in multiple packages. Prior to SQL Server 2008, you needed to reload the cache every time it was used. For example, if you had two Data Flow Tasks in the same package and each required the same reference data set, each Lookup Transformation would load its own copy of the cache separately. You can persist the cache to virtual memory or to permanent file storage. This means that within the same package, multiple Lookup Transformations can share the same cache. The cache does not need to be reloaded for multiple Data Flows or if the same Data Flow is executed multiple times during a package execution, such as when the Data Flow Task is executed within a Foreach Loop Container.
www.it-ebooks.info
c07.indd 224
22-03-2014 19:55:26
Building the Basic Packageâ•… ❘â•… 225
You can also persist the cache to a file and share it with other packages. The cache file format is optimized for speed; it can be much faster than reloading the reference data set from the original relational source. Another enhancement in the Lookup Transformation in SQL Server 2008 SSIS is the miss-cache feature. In scenarios where the component is configured to perform the Lookups directly against the database, the miss-cache feature enables you to optimize performance by optionally loading into cache the rows without matching entries in the reference data set. For example, if the component receives the value 123 in the incoming pipeline, but there are no matching entries in the reference data set, the component will not try to find that value in the reference data set again. In other words, the component “remembers” which values it did not find before. You can also specify how much memory the miss-cache should use (expressed as a percentage of the total cache limit, by default 20%). This reduces a redundant and expensive trip to the database. The miss-cache feature alone can contribute to performance improvement especially when you have a very large data set. In the 2005 version of the component, the Lookup Transformation had only two outputsâ•–—╖╉one for matched rows and another that combined nonmatches and errors. However, the latter output caused much dismay with SSIS usersâ•–—╖╉it is often the case that a nonmatch is not an error and is in fact expected. In 2008 and later the component has one output for nonmatches and a separate output for true errors (such as truncations). Note that the old combined output is still available as an option for backward compatibility. This combined error and nonmatching output can be separated by placing a Conditional Split Transformation after the Lookup, but it is no longer necessary because of the separate outputs.
To troubleshoot issues you may have with SSIS, you can add Data Viewers into a Data Flow on the lines connecting the components. Data Viewers give you a peek at the rows in memory. They also pause the Data Flow at the point the data reaches the viewer.
Building the Basic Package To simplify the explanation of the Lookup Transformation’s operation in the next few sections, this section presents a typical ETL problem that is used to demonstrate several solutions using the components configured in various modes. The AdventureWorks database is a typical OLTP store for a bicycle retailer, and AdventureWorksDW is a database that contains the corresponding denormalized data warehouse structures. Both of these databases, as well as some secondary data, are used to represent a real-world ETL scenario. (If you do not have the databases, download them from www.wrox.com.)
www.it-ebooks.info
c07.indd 225
22-03-2014 19:55:28
226â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
The core operation focuses on extracting fact data from the source system (fact data is discussed in Chapter 12); in this scenario you will not yet be loading data into the warehouse itself. Obviously, you would not want to do one without the other in a real-world SSIS package, but it makes it easier to understand the solution if you tackle a smaller subset of the problem by itself. You will first extract sales order (fact) data from the AdventureWorks, and later you will load it into the AdventureWorksDW database, performing multiple joins along the way. The order information in AdventureWorks is represented by two main tables: SalesOrderHeader and SalesOrderDetail. You need to join these two tables first. The SalesOrderHeader table has many columns that in the real world would be interesting, but for this exercise you will scale down the columns to just the necessary few. Likewise, the SalesOrderDetail table has many useful columns, but you will use just a few of them. Here are the table structure and first five rows of data for these two tables: SalesOrderID
OrderDate
CustomerID
43659
2001-07-01
676
43660
2001-07-01
117
43661
2001-07-01
442
43662
2001-07-01
227
43663
2001-07-01
510
SalesOrderID
SalesOrderDetailID
ProductID
OrderQt y
UnitPrice
LineTotal
43659
1
776
1
2024.9940
2024.994000
43659
2
777
3
2024.9940
6074.982000
43659
3
778
1
2024.9940
2024.994000
43659
4
771
1
2039.9940
2039.994000
43659
5
772
1
2039.9940
2039.994000
As you can see, you need to join these two tables together because one table contains the order header information and the other contains the order details. Figure 7-4 shows a conceptual view of what the join would look like.
Figure 7-4
www.it-ebooks.info
c07.indd 226
22-03-2014 19:55:28
Building the Basic Packageâ•… ❘â•… 227
However, this does not get us all the way there. The CustomerID column is a surrogate key that is specific to the source system, and the very definition of surrogate keys dictates that no other systemâ•–—╖╉including the data warehouseâ•–—╖╉should have any knowledge of them. Therefore, in order to populate the warehouse you need to get the original business (natural) key. Thus, you must join the SalesOrderHeader table (Sales.SalesOrderHeader) to the Customer table (Sales.Customer) in order to find the customer business key called AccountNumber. After doing that, your conceptual join now looks like Figure 7-5.
Figure 7-5
Similarly for Product, you need to add the Product table (Production.Product) to this join in order to derive the natural key called ProductNumber, as shown in Figure 7-6.
Figure 7-6
Referring to Figure 7-7, you can get started by creating a new SSIS package that contains an OLE DB Connection Manager called localhost.AdventureWorks that points to the AdventureWorks database and a single empty Data Flow Task.
Using a Relational Join in the Source The easiest and most obvious solution in this particular scenario is to use a relational join to extract the data. In other words, you can build a package that has a single source (use an OLE DB Source Component) and set the query string in the source to utilize relational joins. This enables you to take advantage of the benefits of the relational source database to prepare the data before it enters the SSIS Data Flow.
www.it-ebooks.info
c07.indd 227
22-03-2014 19:55:29
228â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Figure 7-7
Drop an OLE DB Source Component on the Data Flow design surface, hook it up to the localhost .AdventureWorks Connection Manager, and set its query string as follows: Select --columns from Sales.SalesOrderHeader oh.SalesOrderID, oh.OrderDate, oh.CustomerID, --columns from Sales.Customer c.AccountNumber, --columns from Sales.SalesOrderDetail od.SalesOrderDetailID, od.ProductID, od.OrderQty, od.UnitPrice, od.LineTotal, --columns from Production.Product p.ProductNumber from Sales.SalesOrderHeader as oh inner join Sales.Customer as c on (oh.CustomerID = c.CustomerID) left join Sales.SalesOrderDetail as od on (oh.SalesOrderID = od.SalesOrderID) inner join Production.Product as p on (od.ProductID = p.ProductID);
Note that you can either type this query in by hand or use the Build Query button in the user interface of the OLE DB Source Component to construct it visually. Click the Preview button and make sure that it executes correctly (see Figure 7-8). For seasoned SQL developers, the query should be fairly intuitiveâ•–—╖╉the only thing worth calling out is that a left join is used between the SalesOrderHeader and SalesOrderDetail tables because it is conceivable that an order header could exist without any corresponding details. If an inner join was used here, it would have lost all such rows exhibiting this behavior. Conversely, inner joins were used everywhere else because an order header cannot exist without an associated customer, and a details row cannot exist without an associated product. In business terms, a customer will buy one or (hopefully) more products.
www.it-ebooks.info
c07.indd 228
22-03-2014 19:55:29
Building the Basic Packageâ•… ❘â•… 229
Figure 7-8
Close the preview dialog; click OK on the OLE DB Source Editor UI, and then hook up the Source Component to a Union All Transformation as shown in Figure 7-9, which serves as a temporary destination. Add a Data Viewer to the pipeline in order to watch the data travel through the system. Execute the package in debug mode and notice that the required results appear in the Data Viewer window.
Figure 7-9
The Union All Transformation has nothing to do with this specific solution; it serves simply as a dead end in the Data Flow in order to get a temporary trash destination so that you don’t have to physically land the data in a database or file. This is a great way to test your Data Flows during development; placing a Data Viewer just before the Union All gives you a quick peek at the data. After development you would need to replace the Union All with a real destination. Note that you could also use some other component such as the Conditional Split. Keep in mind that some components, like the Row Count, require extra setup (such as variables), which would make this approach more cumbersome. Third-party tools are also available (such as Task Factory by Pragmatic Works) that have trash destinations for testing purposes only.
www.it-ebooks.info
c07.indd 229
22-03-2014 19:55:29
230â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Using the Merge Join Transformation Another way you could perform the join is to use Merge Join Transformations. In this specific scenario it does not make much sense because the database will likely perform the most optimal joins, as all the data resides in one place. However, consider a system in which the four tables you are joining reside in different locations; perhaps the sales and customer data is in SQL Server, and the product data is in a flat file, which is dumped nightly from a mainframe. The following steps explain how you can build a package to emulate such a scenario:
1.
Start again with the basic package (refer to Figure 7-7) and proceed as follows. Because you do not have any actual text files as sources, you will create them inside the same package and then utilize them as needed. Of course, a real solution would not require this step; you just need to do this so that you can emulate a more complex scenario.
2.
Name the empty Data Flow Task “DFT Create Text Files.” Inside this task create a pipeline that selects the required columns from the Product table in the AdventureWorks database and writes the data to a text file. Here is the SQL statement you will need: select ProductID, ProductNumber from Production.Product;
3.
Connect the source to a Flat File destination and then configure the Flat File Destination Component to write to a location of your choice on your local hard drive, and make sure you select the delimited option and specify column headers when configuring the destination options, as shown in Figure 7-10. Name the flat file Product.txt.
Figure 7-10
www.it-ebooks.info
c07.indd 230
22-03-2014 19:55:30
Building the Basic Packageâ•… ❘â•… 231
4.
Execute the package to create a text file containing the Product data. Now create a second Data Flow Task and rename it “DFT Extract Source.” Connect the first and second Data Flow Tasks with a precedence constraint so that they execute serially, as shown in Figure 7-11. Inside the second (new) Data Flow Task, you’ll use the Lookup and Merge Join solutions to achieve the same result you did previously.
Figure 7-11 When using the Lookup Transformation, make sure that the largest table (usually a fact table) is streamed into the component, and the smallest table (usually a dimension table) is cached. That’s because the table that is cached will block the flow while it is loaded into memory, so you want to make sure it is as small as possible. Data Flow execution cannot begin until all Lookup data is loaded into memory. Since all of the data is loaded into memory, it makes the 3GB process limit on 32-bit systems a real challenge. In this case, all the tables are small, but imagine that the order header and details data is the largest, so you don’t want to incur the overhead of caching it. Thus, you can use a Merge Join Transformation instead of a Lookup to achieve the same result, without the overhead of caching a large amount of data. In some situations you can’t control the table’s server location, used in the Lookup, because the source data needs to run through multiple Lookups. A good example of this multiple Lookup Data Flow would be the loading of a fact table.
The simplest solution for retrieving the relational data would be to join the order header and order details tables directly in the Source Component (in a similar manner to that shown earlier). However, the following steps take a more complex route in order to illustrate some of the other options available:
1.
Drop an OLE DB Source Component on the design surface of the second Data Flow Task and name it “SRC Order Header.” Hook it up to the AdventureWorks Connection Manager and use the following statement as the query: select SalesOrderID, OrderDate, CustomerID from Sales.SalesOrderHeader;
Of course, you could just choose the Table or View option in the source UI, or use a select* query, and perhaps even deselect specific columns in the Columns tab of the UI. However, these are all bad practices that will usually lead to degraded performance. It is imperative that, where possible, you specify the exact columns you require in the select clause. Furthermore, you should use a predicate (where clause) to limit the number of rows returned to just the ones you need.
2.
Confirm that the query executes successfully by using the Preview button, and then hook up a Sort Transformation downstream of the source you have just created. Open the editor for the Sort Transformation and choose to sort the data by the SalesOrderID column, as shown in Figure 7-12. The reason you do this is because you will use a Merge Join Transformation
www.it-ebooks.info
c07.indd 231
22-03-2014 19:55:30
232â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
later, and it requires sorted input streams. (Note that the Lookup Transformation does not require sorted inputs.) Also, an ORDER BY clause in the source would be better for performance, but this example is giving you experience with the Sort Transform.
Figure 7-12
3.
To retrieve the SalesOrderDetails data, drop another OLE DB Source Component on the design surface, name it SRC Details, and set its query as follows. Notice how in this case you have included an ORDER BY clause directly in the SQL select statement. This is more efficient than the way you sorted the order header data, because SQL Server can sort it for you before passing it out-of-process to SSIS. Again, you will see different methods to illustrate the various options available: select SalesOrderID, SalesOrderDetailID, ProductID, OrderQty, UnitPrice, LineTotal from Sales.SalesOrderDetail order by SalesOrderID, SalesOrderDetailID, ProductID;
4.
Now drop a Merge Join Transformation on the surface and connect the outputs from the two Source Components to it. Specify the input coming from SRC Header (via the Sort Transformation) to be the left input, and the input coming from SRC Details to be the right input. You need to do this because, as discussed previously, you want to use a left join in order to keep rows from the header that do not have corresponding detail records.
After connecting both inputs, try to open the editor for the Merge Join Transformation; you should receive an error stating that “The IsSorted property must be set to True on both sources of this transformation.” The reason you get this error is because the Merge Join Transformation requires inputs that are sorted exactly the same way. However, you did ensure this by using a Sort Transformation on one stream and an explicit T-SQL ORDER BY clause on the other stream, so what’s going on? The simple answer is that the OLE DB Source Component works in a passthrough manner, so it doesn’t know that the ORDER BY clause was specified in the second SQL query statement due to the fact that the metadata returned by SQL Server includes column names, positions, and data types but does not include the sort order. By using the Sort Transformation, you forced SSIS to perform the sort, so it is fully aware of the ordering.
www.it-ebooks.info
c07.indd 232
22-03-2014 19:55:31
Building the Basic Packageâ•… ❘â•… 233
In order to remedy this situation, you have to tell the Source Transformation that its input data is presorted. Be very careful when doing thisâ•–—╖╉by specifying the sort order in the following way, you are asking the system to trust that you know what you are talking about and that the data is in fact sorted. If the data is not sorted, or it is sorted other than the way you specified, then your package can act unpredictably, which could lead to data integrity issues and data loss. Use the following steps to specify the sort order:
1.
Right-click the SRC Details Component and choose Show Advanced Editor. Select the Input and Output Properties tab, shown in Figure 7-13, and click the Root Node for the default output (not the error output). In the property grid on the right-hand side is a property called IsSorted. Change this to True.
Figure 7-13
2.
The preceding step tells the component that the data is presorted, but it does not indicate the order. Therefore, the next step is to select the columns that are being sorted on, and assign them values as follows: ➤➤
If the column is not sorted, then the value should be zero.
➤➤
If the column is sorted in ascending order, then the value should be positive.
➤➤
If the column is sorted in descending order, then the value should be negative.
The absolute value of the number should correspond to the column’s position in the order list. For instance, if the query was sorted as follows, “SalesOrderID ascending, ProductID descending,” then you would assign the value 1 to SalesOrderID and the value -2 to ProductID, with all other columns being 0.
3.
Expand the Output Columns Node under the same default Output Node, and then select the SalesOrderID column. In the property grid, set the SortKeyPosition value to 1, as shown in Figure 7-14.
www.it-ebooks.info
c07.indd 233
22-03-2014 19:55:31
234â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Figure 7-14
4.
Close the dialog and try again to open the Merge Join UI; this time you should be successful. By default, the component works in inner join mode, but you can change that very easily by selecting (in this case) Left Outer Join from the Join type dropdown (see Figure 7-15). You can also choose a Full Outer Join, which would perform a Cartesian join of all the data, though depending on the size of the source data, this will have a high memory overhead.
Figure 7-15
www.it-ebooks.info
c07.indd 234
22-03-2014 19:55:32
Using the Lookup Transformationâ•… ❘â•… 235
5.
If you had made a mistake earlier while specifying which input was the left and which was the right, you can click the Swap Inputs button to switch their places. The component will automatically figure out which columns you are joining on based on their sort orders; if it gets it wrong, or there are more columns you need to join on, you can drag a column from the left to the right in order to specify more join criteria. However, the component will refuse any column combinations that are not part of the ordering criteria.
6.
Finally, drop a Union All Transformation on the surface and connect the output of the Merge Join Transformation to it. Place a Data Viewer on the output path of the Merge Join Transformation and execute the package. Check the results in the Data Viewer; the data should be joined as required.
Merge Join is a useful component to use when memory limits or data size restricts you from using a Lookup Transformation. However, it requires the sorting of both input streamsâ•–—╖╉which may be challenging to do with large data setsâ•–—╖╉and by design it does not provide any way of caching either data set. The next section examines the Lookup Transformation, which can help you solve join problems in a different way.
Using the Lookup Transformation The Lookup Transformation solves join differently than the Merge Join Transformation. The Lookup Transformation typically caches one of the data sets in memory, and then compares each row arriving from the other data set in its input pipeline against the cache. The caching mechanism is highly configurable, providing a variety of different options in order to balance the performance and resource utilization of the process.
Full-Cache Mode In full-cache mode, the Lookup Transformation stores all the rows resulting from a specified query in memory. The benefit of this mode is that Lookups against the in-memory cache are very fastâ•–—╖╉often an order of magnitude or more, relative to a no-cache mode Lookup. Full-cache mode is the default because in most scenarios it has the best performance of all of the techniques discussed in the chapter. Continuing with the example package you built in the previous section (“Using the Merge Join Transformation”), you will in this section extend the existing package in order to join the other required tables. You already have the related values from the order header and order detail tables, but you still need to map the natural keys from the Product and Customer tables. You could use Merge Join Transformations again, but this example demonstrates how the Lookup Transformation can be of use here:
1.
Open the package you created in the previous step. Remove the Union All Transformation. Drop a Lookup Transformation on the surface, name it LKP Customer, and connect the output of the Merge Join Transformation to it. Open the editor of the Lookup Transformation.
www.it-ebooks.info
c07.indd 235
22-03-2014 19:55:32
236â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
2.
Select Full-Cache Mode, specifying an OLE DB Connection Manager. There is also an option to specify a Cache Connection Manager (CCM), but you won’t use this just yetâ•–—╖╉ later in this chapter you will learn how to use the CCM. (After you have learned about the CCM, you can return to this exercise and try to use it here instead of the OLE DB Connection Manager.)
3.
Click the Connection tab and select the AdventureWorks connection, and then use the following SQL query: select CustomerID, AccountNumber from Sales.Customer;
4.
Preview the results to ensure that everything is set up OK, then click the Columns tab. Drag the CustomerID column from the left-hand table over to the CustomerID column on the right; this creates a linkage between these two columns, which tells the component that this column is used to perform the join. Click the checkbox next to the AccountNumber column on the right, which tells the component that you want to retrieve the AccountNumber values from the Customer table for each row it compares. Note that it is not necessary to retrieve the CustomerID values from the right-hand side because you already have them from the input columns. The editor should now look like Figure 7-16.
Figure 7-16
www.it-ebooks.info
c07.indd 236
22-03-2014 19:55:32
Using the Lookup Transformationâ•… ❘â•… 237
5.
Click OK on the dialog, hook up a “trash” Union All Transformation (refer to Figure 7-9, choosing Lookup Match Output on the dialog that is invoked when you do this). Create a Data Viewer on the match output path of the Lookup Transformation and execute the package (you could also attach a Data Viewer on the no-match output and error output if needed). You should see results similar to Figure 7-17. Notice you have all the columns from the order and details data, as well as the selected column from the Customer table.
Figure 7-17
Because the Customer table is so small and the package runs so fast, you may not have noticed what happened here. As part of the pre-execution phase of the component, the Lookup Transformation fetched all the rows from the Customer table using the query specified (because the Lookup was configured to execute in full-cache mode). In this case there are only 20,000 or so rows, so this happens very quickly. Imagine that there were many more rows, perhaps two million. In this case you would likely experience a delay between executing the package and seeing any data actually traveling down the second pipeline. Figure 7-18 shows a decision tree that demonstrates how the Lookup Transformation in full-cache mode operates at runtime. Note that the Lookup Transformation can be configured to send found and not-found rows to the same output, but the illustration assumes they are going to different outputs. In either case, the basic algorithm is the same.
www.it-ebooks.info
c07.indd 237
22-03-2014 19:55:32
238â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Lookup Key in Primary Cache
No
Found?
Not Found Output
Yes Load Primary Cache from Database
Found Output
Figure 7-18
Check the Execution Results tab on the SSIS design surface (see Figure 7-19) and see how long it took for the data to be loaded into the in-memory cache. In larger data sets this number will be much larger and could even take longer than the execution of the primary functionality!
Figure 7-19
If during development and testing you want to emulate a long-running query, use the T-SQL waitfor statement in the query in the following manner.
waitfor delay ‘00:00:059’; --Wait 5 seconds before returning any rows select CustomerID, AccountNumber from Sales.Customer;
After fetching all the rows from the specified source, the Lookup Transformation caches them in memory in a special hash structure. The package then continues execution; as each input row enters
www.it-ebooks.info
c07.indd 238
22-03-2014 19:55:34
Using the Lookup Transformationâ•… ❘â•… 239
the Lookup Transformation, the specified key values are compared to the in-memory hash values, and, if a match is found, the specified return values are added to the output stream.
No-Cache Mode If the reference table (the Customer table in this case) is too large to cache all at once in the system’s memory, you can choose to cache nothing or you can choose to cache only some of the data. This section explores the first option: no-cache mode. In no-cache mode, the Lookup Transformation is configured almost exactly the same as in fullcache mode, but at execution time the reference table is not loaded into the hash structure. Instead, as each input row flows through the Lookup Transformation, the component sends a request to the reference table in the database server to ask for a match. As you would expect, this can have a high performance overhead on the system, so use this mode with care. Depending on the size of the reference data, this mode is usually the slowest, though it scales to the largest number of reference rows. It is also useful for systems in which the reference data is highly volatile, such that any form of caching would render the results stale and erroneous. Figure 7-20 illustrates the decision tree that the component uses during runtime. As before, the diagram assumes that separate outputs are configured for found and not-found rows, though the algorithm would be the same if all rows were sent to a single output.
Lookup Key in Database
Found?
No
Not Found Output
Yes
Start (Runtime)
Found Output
Figure 7-20
Here are the steps to build a package that uses no-cache mode:
1.
Rather than build a brand-new package to try out no-cache mode, use the package you built in the previous section (“Full-Cache Mode”). Open the editor for the Lookup Transformation and on the first tab (General), choose the No-Cache option. This mode also enables you to customize (optimize) the query that SSIS will submit to the relational engine. To do this, click the Advanced tab and check the Modify the SQL Statement checkbox. In this case, the auto-generated statement is close enough to optimal, so you don’t need to touch it. (If you have any problems reconfiguring the Lookup Transformation, then delete the component, drop a new Lookup on the design surface, and reconnect and configure it from scratch.)
www.it-ebooks.info
c07.indd 239
22-03-2014 19:55:34
240â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
2.
Execute the package. It should take slightly longer to execute than before, but the results should be the same.
The trade-off you make between the caching modes is one of performance versus resource utilization. Full-cache mode can potentially use a lot of memory to hold the reference rows in memory, but it is usually the fastest because Lookup operations do not require a trip to the database. No-cache mode, on the other hand, requires next to no memory, but it’s slower because it requires a database call for every Lookup. This is not a bad thing; if your reference table is volatile (i.e., the data changes often), you may want to use no-cache mode to ensure that you always have the latest version of each row.
Partial-Cache Mode Partial-cache mode gives you a middle ground between the no-cache and full-cache options. In this mode, the component caches only the most recently used data within the memory boundaries specified under the Advanced tab in the Lookup Transform. As soon as the cache grows too big, the least-used cache data is thrown away. When the package starts, much like in no-cache mode, no data is preloaded into the Lookup cache. As each input row enters the component, it uses the specified key(s) to attempt to find a matching record in the reference table using the specified query. If a match is found, then both the key and the Lookup values are added to the local cache on a just-in-time basis. If that same key enters the Lookup Transformation again, it can retrieve the matching value from the local cache instead of the reference table, thereby saving the expense and time incurred of requerying the database. In the example scenario, for instance, suppose the input stream contains a CustomerID of 123. The first time the component sees this value, it goes to the database and tries to find it using the specified query. If it finds the value, it retrieves the AccountNumber and then adds the CustomerID/ AccountNumber combination to its local cache. If CustomerCD 123 comes through again later, the component will retrieve the AccountNumber directly from the local cache instead of going to the database. If, however, the key is not found in the local cache, the component will check the database to see if it exists there. Note that the key may not be in the local cache for several reasons: maybe it is the first time it was seen, maybe it was previously in the local cache but was evicted because of memory pressure, or finally, it could have been seen before but was also not found in the database. For example, if CustomerID 456 enters the component, it will check the local cache for the value. Assuming it is not found, it will then check the database. If it finds it in the database, it will add 456 to its local cache. The next time CustomerID 456 enters the component, it can retrieve the value directly from its local cache without going to the database. However, it could also be the case that memory pressure caused this key/value to be dropped from the local cache, in which case the component will incur another database call. If CustomerID 789 is not found in the local cache, and it is not subsequently found in the reference table, the component will treat the row as a nonmatch, and will send it down the output you have chosen for nonmatched rows (typically the no-match or error output). Every time that CustomerID
www.it-ebooks.info
c07.indd 240
22-03-2014 19:55:34
Using the Lookup Transformationâ•… ❘â•… 241
789 enters the component, it will go through these same set of operations. If you have a high degree of expected misses in your Lookup scenario, this latter behaviorâ•–—╖╉though proper and expectedâ•–—╖╉can be a cause of long execution times because database calls are expensive relative to a local cache check. To avoid these repeated database calls while still getting the benefit of partial-cache mode, you can use another feature of the Lookup Transformation: the miss cache. Using the partial-cache and miss-cache options together, you can realize further performance gains. You can specify that the component remembers values that it did not previously find in the reference table, thereby avoiding the expense of looking for them again. This feature goes a long way toward solving the performance issues discussed in the previous paragraph, because ideally every key is looked for onceâ•–—╖╉and only onceâ•–—╖╉in the reference table. To configure this mode, follow these steps (refer to Figure 7-21):
1.
Open the Lookup editor, and in the General tab select the Partial Cache option. In the Advanced tab, specify the upper memory boundaries for the cache and edit the SQL statement as necessary. Note that both 32-bit and 64-bit boundaries are available because the package may be built and tested on a 32-bit platform but deployed to a 64-bit platform, which has more memory. Providing both options makes it simple to configure the component’s behavior on both platforms.
Figure 7-21
2.
If you want to use the miss-cache feature, configure what percentage of the total cache memory you want to use for this secondary cache (say, 25%).
The decision tree shown in Figure 7-22 demonstrates how the Lookup Transformation operates at runtime when using the partial-cache and miss-cache options. Note that some of the steps are conceptual; in reality, they are implemented using a more optimal design. As per the decision trees shown for the other modes, this illustration assumes separate outputs are used for the found and not-found rows.
www.it-ebooks.info
c07.indd 241
22-03-2014 19:55:34
242â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Start
Using Partial Cache Mode?
No
Yes Lookup Key in Hit Cache
No
Using Miss Cache? Yes
No
Lookup Key in Miss Cache
Lookup Key in Database No
Key Found in Hit Cache? Yes Send Row to Found Output
Key Found in Miss Cache?
Key Found in Database?
Yes
No
Yes
Send Row to Not Found Output
Using Partial Cache Mode?
Using Miss Cache? Yes Add Key to Miss Cache
No
Yes Add Key/Value No to Hit Cache
Send Row to Not Found Output
Send Row to Found Output
End Figure 7-22
www.it-ebooks.info
c07.indd 242
22-03-2014 19:55:35
Using the Lookup Transformationâ•… ❘â•… 243
Multiple Outputs At this point, your Lookup Transformation is working, and you have learned different ways to optimize its performance using fewer or more resources. In this section, you’ll learn how to utilize some of the other features in the component, such as the different outputs that are available. Using the same package you built in the previous sections, follow these steps:
1.
Reset the Lookup Transformation so that it works in full-cache mode. It so happens that, in this example, the data is clean and thus every row finds a match, but you can emulate rows not being found by playing quick and dirty with the Lookup query string. This is a useful trick to use at design time in order to test the robustness and behavior of your Lookup Transformations. Change the query statement in the Lookup Transformation as follows: select CustomerID, AccountNumber from Sales.Customer where CustomerID % 7 <> 0; --Remove 1/7 of the rows
2.
Run the package again. This time, it should fail to execute fully because the cache contains one-seventh fewer rows than before, so some of the incoming keys will not find a match, as shown in Figure 7-23. Because the default error behavior of the component is to fail on any nonmatch or error condition such as truncation, the Lookup halts as expected.
Figure 7-23
Try some of the other output options. Open the Lookup editor and on the dropdown listbox in the General tab, choose how you want the Lookup Transformation to behave when it does not manage to find a matching join entry: ➤➤
Fail Component should already be selected. This is the default behavior, which causes the component to raise an exception and halt execution if a nonmatching row is found or a row causes an error such as data truncation.
➤➤
Ignore Failure sends any nonmatched rows and rows that cause errors down the same output as the matched rows, but the Lookup values (in this case AccountNumber) will be set to null. If you add a Data Viewer to the flow, you should be able to see this; several of the AccountNumbers will have null values.
➤➤
Redirect Rows to Error Output is provided for backward compatibility with SQL Server 2005. It causes the component to send both nonmatched and error-causing rows down the same error (red) output.
➤➤
Redirect Rows to No Match Output causes errors to flow down the error (red) output, and no-match rows to flow down the no-match output.
www.it-ebooks.info
c07.indd 243
22-03-2014 19:55:35
244â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
3.
Choose Ignore Failure and execute the package. The results should look like Figure 7-24. You can see that the number of incoming rows on the Lookup Transformation matches the number of rows coming out of its match output, even though one-seventh of the rows were not actually matched. This is because the rows failed to find a match, but because you configured the Ignore Failure option, the component did not stop execution.
Figure 7-24
4.
Open the Lookup Transformation and this time select “Redirect rows to error output.” In order to make this option work, you need a second trash destination on the error output of the Lookup Transformation, as shown in Figure 7-25. When you execute the package using this mode, the found rows will be sent down the match output, and unlike the previous modes, not-found rows will not be ignored or cause the component to fail but will instead be sent down the error output.
Figure 7-25
www.it-ebooks.info
c07.indd 244
22-03-2014 19:55:36
Using the Lookup Transformationâ•… ❘â•… 245
5.
Finally, test the “Redirect rows to no match output” mode. You will need a total of three trash destinations for this to work, as shown in Figure 7-26.
In all cases, add Data Viewers to each output, execute the package, and examine the results. The outputs should not contain any errors such as truncations, though there should be many nonmatched rows. So how exactly are these outputs useful? What can you do with them to make your packages more robust? Figure 7-26 In most cases, the errors or nonmatched rows can be piped off to a different area of the package where the values can be logged or fixed as per the business requirements. For example, one common solution is for all missing rows to be tagged with an Unknown member value. In this scenario, all nonmatched rows might have their AccountNumber set to 0000. These fixed values are then joined back into the main Data Flow and from there treated the same as the rows that did find a match. Use the following steps to configure the package to do this:
1.
Open the Lookup editor. On the General tab, choose the “Redirect rows to no match output” option. Click the Error Output tab (see Figure 7-27) and configure the AccountNumber column to have the value Fail Component under the Truncation column. This combination of settings means that you want a no-match output, but you don’t want an error output; instead you want the component to fail on any errors. In a real-world scenario, you may want to have an error output that you can use to log values to an error table, but this example keeps it simple.
Figure 7-27
2.
At this point, you could drop a Derived Column Transformation on the design surface and connect the no-match output to it. Then you would add the AccountNumber column in the derived column, and use a Union All to bring the data back together. This approach works, but the partially blocking Union All slows down performance.
www.it-ebooks.info
c07.indd 245
22-03-2014 19:55:36
246â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
However, there is a better way to design the Data Flow. Set the Lookup to Ignore Errors. Drop a Derived Column on the Data Flow. Connect the match output to the derived column. Open the Derived Column editor and replace the AccountNumber column with the following expression (see Chapter 5 for more details). ISNULL(AccountNumber)?(DT_STR,10,1252)”0000”:AccountNumber
The Derived Column Transformation dialog editor should now look something like Figure 7-28.
Figure 7-28
Close the Derived Column editor, and drop a Union All Transformation on the surface. Connect the default output from the Derived Column to the Union All Transformation and then execute the package, as usual utilizing a Data Viewer on the final output. The package and results should look something like Figure 7-29. The output should show AccountNumbers for most of the values, with 0000 shown for those keys that are not present in the reference query (in this case because you artificially removed them).
Figure 7-29
Expressionable Properties If you need to build a package whose required reference table is not known at design time, this feature will be useful for you. Instead of using a static query in the Lookup Transformation, you can use an expression, which can dynamically construct the query string, or it could load the query string using the parameters feature. Parameters are discussed in Chapter 5 and Chapter 22.
www.it-ebooks.info
c07.indd 246
22-03-2014 19:55:37
Using the Lookup Transformationç‹€ ❘â•… 247
Figure 7-30 shows an example of using an expression within a Lookup Transformation. Expressions on Data Flow Components can be accessed from the property page of the Data Flow Task itself. See Chapter 5 for more details.
Figure 7-30
Cascaded Lookup Operations Sometimes the requirements of a real-world Data Flow may require several Lookup Transformations to get the job done. By using multiple Lookup Transformations, you can sometimes achieve a higher degree of performance without incurring the associated memory costs and processing times of using a single Lookup. Imagine you have a large list of products that ideally you would like to load into one Lookup. You consider using full-cache mode; however, because of the sheer number of rows, either you run out of memory when trying to load the cache or the cache-loading phase takes so long that it becomes impractical (for instance, the package takes 15 minutes to execute, but 6 minutes of that time is spent just loading the Lookup cache). Therefore, you consider no-cache mode, but the expense of all those database calls makes the solution too slow. Finally, you consider partial-cache mode, but again the expense of the initial database calls (before the internal cache is populated with enough data to be useful) is too high. The solution to this problem is based on a critical assumption that there is a subset of reference rows (in this case product rows) that are statistically likely to be found in most, if not all, data loads. For instance, if the business is a consumer goods chain, then it’s likely that a high proportion of sales transactions are from people who buy milk. Similarly, there will be many transactions for sales of bread, cheese, beer, and baby diapers. On the contrary, there will be a relatively low number of sales for expensive wines. Some of these trends may be seasonalâ•–—╖╉more suntan lotion
www.it-ebooks.info
c07.indd 247
22-03-2014 19:55:39
248â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
sold in summer, and more heaters sold in winter. This same assumption applies to other dimensions besides productsâ•–—╖╉for instance, a company specializing in direct sales may know historically which customers (or customer segments or loyalty members) have responded to specific campaigns. A bank might know which accounts (or account types) have the most activity at specific times of the month. This statistical property does not hold true for all data sets, but if it does, you may derive great benefit from this pattern. If it doesn’t, you may still find this section useful as you consider the different ways of approaching a problem and solving it with SSIS. So how do you use this statistical approach to build your solution? Using the consumer goods example, if it is the middle of winter and you know you are not going to be selling much suntan lotion, then why load the suntan products in the Lookup Transformation? Rather, load just the high-frequency items like milk, bread, and cheese. Because you know you will see those items often, you want to put them in a Lookup Transformation configured in full-cache mode. If your Product table has, say, 1 million items, then you could load the top 20% of them (in terms of frequency/ popularity) into this first Lookup. That way, you don’t spend too much time loading the cache (because it is only 200,000 rows and not 1,000,000); by the same reasoning, you don’t use as much memory. Of course, in any statistical approach there will always be outliersâ•–—╖╉for instance, in the previous example suntan lotion will still be sold in winter to people going on holiday to sunnier places. Therefore, if any Lookups fail on the first full-cache Lookup, you need a second Lookup to pick up the strays. The second Lookup would be configured in partial-cache mode (as detailed earlier in this chapter), which means it would make database calls in the event that the item was not found in its dynamically growing internal cache. The first Lookup’s not-found output would be connected to the second Lookup’s input, and both of the Lookups would have their found outputs combined using a Union All Transformation in order to send all the matches downstream. Then a third Lookup is used in no-cache mode to look up any remaining rows not found already. This final Lookup output is combined with the others in another Union All. Figure 7-31 shows what such a package might look like.
Figure 7-31
www.it-ebooks.info
c07.indd 248
22-03-2014 19:55:39
Cache Connection Manager and Cache Transformç‹€ ❘â•… 249
The benefit of this approach is that at the expense of a little more development time, you now have a system that performs efficiently for the most common Lookups and fails over to a slower mode for those items that are less common. That means that the Lookup operation will be extremely efficient for most of your data, which typically results in an overall decrease in processing time. In other words, you have used the Pareto principle (80/20 rule) to improve the solution. The first (full-cache) Lookup stores 20% of the reference (in this case product) rows and hopefully succeeds in answering 80% of the Lookup requests. This is largely dependent on the user creating the right query to get the proper 20%. If the wrong data is queried then this can be a worst approach. For the 20% of Lookups that fail, they are redirected toâ•–—╖╉and serviced byâ•–—╖╉the partial-cache Lookup, which operates against the other 80% of data. Because you are constraining the size of the partial cache, you can ensure you don’t run into any memory limitationsâ•–—╖╉at the extreme, you could even use a no-cache Lookup instead of, or in addition to, the partial-cache Lookup. The final piece to this puzzle is how you identify up front which items occur the most frequently in your domain. If the business does not already keep track of this information, you can derive it by collecting statistics within your packages and saving the results to a temporary location. For instance, each time you load your sales data, you could aggregate the number of sales for each item and write the results to a new table you have created for that purpose. The next time you load the product Lookup Transformation, you join the full Product table to the statistics table and return only those rows whose aggregate count is above a certain threshold. (You could also use the data-mining functionality in SQL Server to derive this information, though the details of that are beyond the scope of this chapter.)
Cache Connection Manager and Cache Transform The Cache Connection Manager (CCM) and Cache Transform enable you to load the Lookup cache from any source. The Cache Connection Manager is the more critical of the two componentsâ•–—╖╉it holds a reference to the internal memory cache and can both read and write the cache to a diskbased file. In fact, the Lookup Transformation internally uses the CCM as its caching mechanism. Like other Connection Managers in SSIS, the CCM is instantiated in the Connection Managers pane of the package design surface. You can also create new CCMs from the Cache Transformation Editor and Lookup Transformation Editor. At design time, the CCM contains no data, so at runtime you need to populate it. You can do this in one of two ways: ➤➤
You can create a separate Data Flow Task within the same package to extract data from any source and load the data into a Cache Transformation, as shown in Figure 7-32. You then configure the Cache Transformation to write the data to the CCM. Optionally, you can configure the same CCM to write the data to a cache file (usually with the extension .caw) on disk. When you execute the package, the Source Component will send the rows down the pipeline into the input of the Cache Transformation. The Cache Transformation will call the CCM, which loads the data into a local memory cache. If configured, the CCM will also save the cache to disk so you can use it again later. This method enables you to create persisted caches that you can share with other users, solutions, and packages.
www.it-ebooks.info
c07.indd 249
22-03-2014 19:55:39
250â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
➤➤
Alternatively, you can open up the CCM editor and directly specify the filename of an existing cache file (.caw file). This option requires that a cache file has previously been created for you to reuse. At execution time, the CCM loads the cache directly from disk and populates its internal memory structures.
Figure 7-32
When you configure a CCM, you can specify which columns of the input data set should be used as index fields and which columns should be used as reference fields (see Figure 7-33). This is a necessary stepâ•–—╖╉the CCM needs to know up front which columns you will be joining on, so that it can create internal index structures to optimize the process.
Figure 7-33
Whichever way you created the CCM, when you execute the package, the CCM will contain an in-memory representation of the data you specified. That means that the cache is now immediately available for use by the Lookup Transformation. Note that the Lookup Transformation is the only
www.it-ebooks.info
c07.indd 250
22-03-2014 19:55:40
Cache Connection Manager and Cache Transformç‹€ ❘â•… 251
component that uses the caching aspects of the CCM; however, the Raw File Source can also read .caw files, which can be useful for debugging. If you are using the Lookup Transformation in full-cache mode, you can load the cache using the CCM (instead of specifying a SQL query as described earlier in this chapter). To use the CCM option, open the Lookup Transformation and select Full Cache and Cache Connection Manager in the general pane of the editor, as shown in Figure 7-34. Then you can either select an existing CCM or create a new one. You can now continue configuring the Lookup Transformation in the same way you would if you had used a SQL query. The only difference is that in the Columns tab, you can only join on columns that you earlier specified as index columns in the CCM editor.
Figure 7-34
The CCM gives you several benefits. First of all, you can reuse caches that you previously saved to file (in the same or a different package). For instance, you can load a CCM using the Customer table and then save the cache to a .caw file on disk. Every other package that needs to do a Lookup against customers can then use a Lookup Transformation configured in full-cache/CCM mode, with the CCM pointing at the .caw file you created. Second, reading data from a .caw file is generally faster than reading from OLE DB, so your packages should run faster. Of course, because the .caw file is an offline copy of your source data, it can become stale; therefore, it should be reloaded every so often. Note that you can use an expression for the CCM filename, which means that you can dynamically load specific files at runtime. Third, the CCM enables you to reuse caches across loop iterations. If you use a Lookup Transformation in full-cache/OLE DB mode within an SSIS For Loop Container or Foreach Loop Container, the cache will be reloaded on every iteration of the loop. This may be your intended design, but if not, then it is difficult to mitigate the performance overhead. However, if you used a Lookup configured in full-cache/CCM mode, the CCM would be persistent across loop iterations, improving your overall package performance.
www.it-ebooks.info
c07.indd 251
22-03-2014 19:55:40
252â•…â•› Chapter 7╇╇Joining Data ❘â•…â•›
Summary This chapter explored different ways of joining data within an SSIS solution. Relational databases are highly efficient at joining data within their own stores; however, you may not be fortunate enough to have all your data living in the same databaseâ•–—╖╉for example, when loading a data warehouse. SSIS enables you to perform these joins outside the database and provides many different options for doing so, each with different performance and resource-usage characteristics. The Merge Join Transformation can join large volumes of data without much memory impact; however, it has certain requirements, such as sorted input columns, that may be difficult to meet. Remember to use the source query to sort the input data, and avoid the Sort Transformation when possible, because of performance issues. The Lookup Transformation is very flexible and supports multiple modes of operation. The Cache Connection Manager adds more flexibility to the Lookup by allowing caches to be explicitly shared across Data Flows and packages. With the CCM, the Lookup cache is also maintained across loop iterations. In large-scale deployments, many different patterns can be used to optimize performance, one of them being cascaded Lookups. As with all SSIS solutions, there are no hard-and-fast rules that apply to all situations, so don’t be afraid to experiment. If you run into any performance issues when trying to join data, try a few of the other options presented in this chapter. Hopefully, you will find one that makes a difference.
www.it-ebooks.info
c07.indd 252
22-03-2014 19:55:40
8
Creating an End-to-End Package What’s in This Chapter? ➤➤
Walking through a basic transformation
➤➤
Performing mainframe ETL with data scrubbing
➤➤
Making packages dynamic
Wrox.com Code Downloads for this Chapter
You can find the wrox.com code downloads for this chapter at www.wrox.com/go/ prossis2014 on the Download Code tab.
Now that you’ve learned about all the basic tasks and transformations in SSIS, you can jump into some practical applications for SSIS. You’ll first start with a normal transformation of data from a series of flat files into SQL Server. Next you’ll add some complexity to a process by archiving the files automatically. The last example demonstrates how to make a package that handles basic errors and makes the package more dynamic. As you run through the tutorials, remember to save your package and your project on a regular basis to avoid any loss of work.
Basic Transformation Tutorial As you can imagine, the primary reason people use SSIS is to read the data from a source and write it to a destination after it’s potentially transformed. This tutorial walks you through a common scenario: you want to copy data from a Flat File Source to a SQL Server table without altering the data. This may be a simple example, but the examples will get much more complex in later chapters.
www.it-ebooks.info
c08.indd 253
22-03-2014 18:14:32
254â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
Start the tutorial by going online to the website for this book and downloading the sample extract that contains zip code information about cities. The zip code extract was retrieved from public record data from the 1990 census and has been filtered to contain only Florida cities, in order to reduce download time. You’ll use this in the next tutorial as well, so it’s very important not to skip this first tutorial. You can download the sample extract file, called ZipCodeExtract.csv, from this book’s web page at www.wrox.com. Place the file into a directory called C:\ProSSIS\Data\Ch08. Open SQL Server Data Tools (SSDT) and select File ➪ New ➪ Project. Then select Integration Services Project as your project type. Type ProSSISCh08 as the project name, and accept the rest of the defaults (as shown in Figure 8-1). You can place the project anywhere on your computer; the default location is under the users\"Your User Name"\Documents\Visual Studio 2012\ Projects folder. In this example, the solution is created in c:\ProSSIS\Data\Ch08, but you may use the default location if you so desire.
Figure 8-1
The project will be created, and you’ll see a default Package.dtsx package file in the Solution Explorer. Right-click the file, select Rename, and rename the file ZipLoad.dtsx. If the package isn’t open yet, double-click it to open it in the package designer.
Creating Connections Now that you have the package created, you need to add a connection that can be used across multiple packages. In the Solution Explorer, right-click the Connection Managers folder and select New Connection Manager. This opens the Add SSIS Connection Manager window.
www.it-ebooks.info
c08.indd 254
22-03-2014 18:14:32
Basic Transformation Tutorialâ•…â•›❘â•…â•› 255
Select the OLEDB connection and click Add, which opens the Configure OLEDB Connection Manager window. If this is the first time you are creating a connection, then the Data Connection list in the left window will be empty. Note╇ There are many ways to create the connection. For example, you could create it as you’re creating each source and destination. You can also use the new Source and Destination Assistants when building a Data Flow. Once you’re more experienced with the tool, you’ll find what works best for you.
Click the New button at the bottom of the window. Your first Connection Manager for this example will be to SQL Server, so select Native OLE DB\SQL Native Client 11.0 as the Provider. For the Server Name option, type the name of your SQL Server and enter the authentication mode that is necessary for you to read and write to the database, as shown in Figure 8-2. If you are using a local instance of SQL Server, then you should be able to use Localhost as the server name and Windows authentication for the credentials. Lastly, select the AdventureWorks database. If you don’t have the AdventureWorks database, select any other available database on the server. You can optionally test the connection. Now click OK. You will now have a Data Source in the Data Source box that should be selected. Click OK. You will now see a Data Source under the Connection Manager folder in the Solution Explorer, and the same Data Source in the Connection Manager window in the ZipLoad Package Figure 8-2 you first created. This new Data Source will start with “(project)”. The name of the connection automatically contains the server name and the database name. This is not a good naming convention because packages are almost always moved from server to server — for example, development, QA, and production servers. Therefore, a better naming convention would be the connection type and the database name. Right-click on the Connection Manager you just created in the Solution Explorer and rename it to OLEDB_AdventureWorks. You can create connections in the package and convert them to project connections also. Any project-level connections automatically appear in the Connection Manager window in all packages in that project. If a connection is needed only in
www.it-ebooks.info
c08.indd 255
22-03-2014 18:14:33
256â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
one package in the project, then it can be created at the package level. To create a connection at the package level, right-click in the Connection Manager window in a package and follow the same steps used to create the project-level connection described previously. Next, create a Flat File connection and point it to the ZipCodeextract.csv file in your C:\ProSSIS\Data\Ch08 directory. Right-click in the Connection Manager area of the package designer, and select New Flat File Connection. Name the connection FF_ZipCode_CSV, and add any description you like. Point the File Name option to C:\ProSSIS\Data\Ch08\ZipCodeExtract .csv or browse to the correct location by clicking Browse. Note╇ If you can’t find the file, ensure that the file type filter is adjusted so
you’re not looking for just *.txt files, which is the default setting. Set the filter to All Files to ensure you can see the file. Set the Format dropdown box to Delimited, with set for the Text Qualifier option; these are both the default options. The Text Qualifier option enables you to specify that character data is wrapped in quotes or some type of qualifier. This is helpful when you have a file that is delimited by commas and you also have commas inside some of the text data that you do not wish to separate by. Setting a Text Qualifier will ignore any commas inside the text data. Lastly, check the “Column names in the first data row” option. This specifies that your first row contains the column names for the file. Select the Columns tab from the left side of the editor to see a preview of the file data and to set the row and column delimiters. The defaults are generally fine for this screen. The Row Delimiter option should be set to {CR}{LF}, which means that a carriage return and line feed separates each row. The Column Delimiter option is carried over from the first page and is therefore set to “Comma {,}”. In some extracts that you may receive, the header record may be different from the data records, and the configurations won’t be exactly the same as in the example. Now select the Advanced tab from the left side of the editor. Here, you can specify the data types for each of the three columns. The default for this type of data is a 50-character string, which is excessive in this case. Click Suggest Types to comb through the data and find the best data type fit for it. This will open the Suggest Column Types dialog, where you should accept the default options and click OK. At this point, the data types in the Advanced tab have changed for each column. One column in particular was incorrectly changed. When combing through the first 100 records, the Suggest Column Types dialog selected a “four-byte signed integer [DT_I4]” for the zip code column, but your Suggest Types button may select a smaller data type based on the data. While this would work for the data extract you currently have, it won’t work for states that have zip codes that begin with a zero in the northeast United States. Change this column to a string by selecting “string [DT_STR]” from the DataType dropdown, and change the length of the column to 5 by changing the OutputColumnWidth option (see Figure 8-3). Finally, change the TextQualified option to False, and then click OK.
www.it-ebooks.info
c08.indd 256
22-03-2014 18:14:33
Basic Transformation Tutorialâ•… ❘â•… 257
Figure 8-3
Creating the Control Flow With the first two connections created, you can now define the package’s Control Flow. In this tutorial, you have only a single task, the Data Flow Task. In the Toolbox, drag the Data Flow Task over to the design pane in the Control Flow tab. Next, right-click the task and select Rename to rename the task Load ZipCode Data.
Creating the Data Flow This section reflects the most detailed portion of almost all of your packages. The Data Flow is where you will generally spend 70 percent of your time as an SSIS developer. To begin, double-click the Data Flow Task to drill into the Data Flow. Note╇ When opening a task for editing, always double-click the icon on the left
side of the task. Otherwise, you may start renaming the task instead of opening it. You can also right-click on the task and select Edit.
www.it-ebooks.info
c08.indd 257
22-03-2014 18:14:33
258â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
Double-clicking on the Data Flow Task will automatically take you to the Data Flow tab. Note that the name of the task — Load ZipCode Data — is displayed in the Data Flow Task dropdown. If you had more than one Data Flow Task, the names of each would appear as options in the dropdown. Drag and drop a Source Assistant from the Toolbox onto the Data Flow design pane. Select Flat File in the left pane of the Source Assistant. Then select FF_ZipCode_CSV in the right pane and Click OK. Rename the source Florida ZipCode File in the Properties window. Note╇ All the rename instructions in these tutorials are optional, but they will
keep you on the same page as the tutorials in this book. In a real-world Â�situation, ensuring the package task names are descriptive will make your package Â� self-Â� documenting. This is due to the names of the tasks being logged by SSIS. Logging is discussed in Chapter 22. Double-click on the source, and you will see it is pointing to the Connection Manager called FF_ZipCode_CSV. Select the Columns tab and note the columns that you’ll be outputting to the path. You’ve now configured the source, and you can click OK. Next, drag and drop Destination Assistant onto the design pane. Select SQL Server in the left pane and OLEDB_AdventureWorks in the right pane in the Destination Assistant window and click OK. Rename the new destination AdventureWorks. Select the Florida ZipCode File Source and then connect the path (blue arrow) from the Florida ZipCode File Source to AdventureWorks. Double-click the destination, and AdventureWorks should already be selected in the Connection Manager dropdown box. For the Name of the Table or View option, click the New button next to the dropdown box. This is how you can create a table inside SSDT without having to go back to SQL Server Management Studio. The default DDL for creating the table uses the destination’s name (AdventureWorks), and the data types may not be exactly what you’d like. You can edit this DDL, as shown here: CREATE TABLE [AdventureWorks] ( [StateFIPCode] smallint, [ZipCode] varchar(5), [StateAbbr] varchar(2), [City] varchar(16), [Longitude] real, [Latitude] real, [Population] int, [AllocationPercentage] real )
However, suppose this won’t do for your picky DBA, who is concerned about performance. In this case, you should rename the table ZipCode (taking out the brackets) and change each column’s data type to a more suitable size and type, as shown in the ZipCode and StateAbbr columns (Ch08SQL. txt): CREATE TABLE [ZipCode] ( [StateFIPCode] smallint, [ZipCode] char(5), [StateAbbr] char(2),
www.it-ebooks.info
c08.indd 258
22-03-2014 18:14:33
Basic Transformation Tutorialâ•… ❘â•… 259
[City] varchar(16), [Longitude] real, [Latitude] real, [Population] int, [AllocationPercentage] real )
When you are done changing the DDL, click OK. The table name will be transposed into the Table dropdown box. Finally, select the Mapping tab to ensure that the inputs are mapped to the outputs correctly. SSIS attempts to map the columns based on name; in this case, because you just created the table with the same column names, it should be a direct match, as shown in Figure 8-4. After confirming that the mappings look like Figure 8-4, click OK.
Figure 8-4
Completing the Package With the basic framework of the package now constructed, you need to add one more task to the Control Flow tab to ensure that you can run this package multiple times. To do this, click the Control Flow tab and drag an Execute SQL task over to the design pane. Rename the task Purge ZipCode Table. Double-click the task and select OLEDB_AdventureWorks from the Connection dropdown. Finally, type the following query for the SQLStatement option (you can also click the ellipsis button and enter the query): TRUNCATE TABLE ZipCode
www.it-ebooks.info
c08.indd 259
22-03-2014 18:14:34
260â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
Click OK to complete the task configuration. To connect the task as a parent to the Load ZipCode Info Task, click the Purge ZipCode Table Task and drag the green arrow onto the Load ZipCode Info Task.
Saving the Package Your first package is now complete. To save the package, click the Save icon in the top menu or select File ➪ Save Selected Items. Note here that by clicking Save, you’re saving the .DTSX file to the project, but you have not saved it to the server yet. To do that, you have to deploy the project. Also, SSIS does not version control your packages independently. To version control your packages, you need to integrate a solution like Subversion (SVN) into SSIS, as described in Chapter 17.
Executing the Package With the package complete, you can attempt to execute it by right-clicking on the package name in the solution explorer and selecting Execute Package. This is a good habit to get into when executing, because other methods, like using the green debug arrow at the top, can cause more to execute than just the package. The package will take a few moments to validate, and then it will execute. You can see the progress under the Progress tab or in the Output window. In the Control Flow tab, the two tasks display a small yellow circle that begins to spin (in the top right of the task). If all goes well, they will change to green circles with a check. If both get green checks, then the package execution was successful. If your package failed, you can check the Output window to see why. The Output window should be open by default, but in case it isn’t, you can open it by clicking View ➪ Output. You can also see a graphical version of the Output window in the Progress tab (it is called the Execution Results tab if your package stops). The Execution Results tab will always show the results from the latest package run in the current SSDT session. You can go to the Data Flow tab, shown in Figure 8-5, to see how many records were copied over. Notice the number of records displayed in the path as SSIS moves from source to destination.
Figure 8-5
www.it-ebooks.info
c08.indd 260
22-03-2014 18:14:34
Typical Mainframe ETL with Data Scrubbingâ•… ❘â•… 261
By default, when you execute a package, you are placed in debug mode. Changes you make in this mode are not made available until you run the package again, and you cannot add new tasks or enter some editors. To break out of this mode, click the square Stop icon or click Debug ➪ Stop Debugging.
Typical Mainframe ETL with Data Scrubbing With the basic ETL out of the way, you will now jump into a more complex SSIS package that attempts to scrub data. You can start this scenario by downloading the 010305c.dat public data file from the website for this book into a directory called C:\ProSSIS\Data\Ch08. This file contains public record data from the Department of State of Florida. In this scenario, you run a credit card company that’s interested in marketing to newly formed domestic corporations in Florida. You want to prepare a data extract each day for the marketing department to perform a mail merge and perform a bulk mailing. Yes, your company is an oldfashioned, snail-mail spammer. Luckily, the Florida Department of State has an interesting extract you can use to empower your marketing department. The business goals of this package are as follows: ➤➤
Create a package that finds the files in the C:\ProSSIS\Data\Ch08 directory and loads the file into your relational database.
➤➤
Archive the file after you load it to prevent it from being loaded twice.
➤➤
The package must self-heal. If a column is missing data, the data should be added automatically.
➤➤
If the package encounters an error in its attempt to self-heal, output the row to an error queue.
➤➤
You must audit the fact that you loaded the file and how many rows you loaded.
Start a new package in your existing ProSSISCh08 SSDT project from the first tutorial. Right-click the SSIS Packages folder in the Solution Explorer and select New SSIS Package. This will create Package1.dtsx, or some numeric variation on that name. Rename the file CorporationLoad.dtsx. Double-click the package to open it if it is not already open. Since the OLEDB_AdventureWorks connection you created earlier was a project-level connection, it should automatically appear in the Connection Manager window of the package. You now have two packages using the same project-level connection. If you were to change the database or server name in this connection, it would change for both packages. Next, create a new Flat File Connection Manager just as you did in the last tutorial. When the configuration screen opens, call the connection FF_Corporation_DAT in the General tab. Note╇ Using naming conventions like this are a best practice in SSIS. The name tells you the type of file and the type of connection.
www.it-ebooks.info
c08.indd 261
22-03-2014 18:14:35
262â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
Enter any description you like. For this Connection Manager, you’re going to configure the file slightly differently. Click Browse and point to the C:\ProSSIS\Data\Ch08\010305c.dat file (keep in mind that the default file filter is *.txt so you will have to change the filter to All Files in order to see the file). You should also change the Text Qualifier option to a single double-quote (“). Check the “Column names in the first data row” option. The final configuration should resemble Figure 8-6. Go to the Columns tab to confirm that the column delimiter is Comma Delimited.
Figure 8-6
Next, go to the Advanced tab. By default, each of the columns is set to a 50-character [DT_STR] column. However, this will cause issues with this file, because several columns contain more than 100 characters of data, which would result in a truncation error. Therefore, change the AddressLine1 and AddressLine2 columns to String [DT_STR], which is 150 characters wide, as shown in Figure 8-7. After you’ve properly set these two columns, click OK to save the Connection Manager.
www.it-ebooks.info
c08.indd 262
22-03-2014 18:14:35
Typical Mainframe ETL with Data Scrubbingâ•… ❘â•… 263
Figure 8-7
Creating the Data Flow With the mundane work of creating the connections now out of the way, you can create the transformations. As you did in the last package, you must first create a Data Flow Task by dragging it from the Toolbox. Name this task Load Corporate Data. Double-click the task to go to the Data Flow tab. Drag and drop a Flat File Source onto the design pane and rename it Uncleansed Corporate Data. (You could also use the Source Assistant as shown previously; you are being shown a different method here intentionally.) Double-click the source and select FF_Corporation_DAT as the Connection Manager you’ll be using. Click OK to close the screen. You’ll add the destination and transformation in a moment after the scenario is expanded a bit.
Handling Dirty Data Before you go deeper into this scenario, take a moment to look more closely at this data. As you were creating the connection, if you are a very observant person (I did not notice this until it was too late), you may have noticed that some of the important data that you’ll need is missing. For example, the city and state are missing from some of the records. Note╇ The Data Profiling Task can also help with this situation; it is Â�covered in Chapter 12.
www.it-ebooks.info
c08.indd 263
22-03-2014 18:14:35
264â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
To fix this for the marketing department, you’ll use some of the transformations that were discussed in the last few chapters to send the good records down one path and the bad records down a different path. You will then attempt to cleanse the bad records and then send those back through the main path. There may be some records you can’t cleanse (such as corporations with foreign postal codes), which you’ll have to write to an error log and deal with another time. First, standardize the postal code to a five-digit format. Currently, some have five digits and some have the full nine-digit zip code with a dash (five digits, a dash, and four more digits). Some are nine-digit zip codes without the dash. To standardize the zip code, you use the Derived Column Transformation. Drag it from the Toolbox and rename it Standardize Zip Code. Connect the source to the transformation and double-click the Standardize Zip Code Transformation to configure it. Expand the Columns tree in the upper-left corner, find [ZipCode], and drag it onto the Expression column in the grid below. This will prefill some of the information for you in the derived column’s grid area. You now need to create an expression that will take the various zip code formats in the [ZipCode] output column and output only the first five characters. An easy way to do this is with the SUBSTRING function. The SUBSTRING function code would look like this: SUBSTRING ([ZipCode],1,5)
This code should be typed into the Expression column in the grid. Next, specify that the derived column will replace the existing ZipCode output by selecting that option from the Derived Column dropdown box. Figure 8-8 shows the completed options. When you are done with the transformation, click OK.
Figure 8-8
The Conditional Split Transformation Now that you have standardized the data slightly, drag and drop the Conditional Split Transformation onto the design pane and connect the blue arrow from the Derived Column Transformation called Standardize Zip Code to the Conditional Split. Rename the Conditional Split Transformation Find Bad Records. The Conditional Split Transformation enables you to push bad records into a data-cleansing process.
www.it-ebooks.info
c08.indd 264
22-03-2014 18:14:35
Typical Mainframe ETL with Data Scrubbingâ•… ❘â•… 265
To cleanse the data that lacks city or state, you’ll write a condition specifying that any row missing a city or state should be moved to a cleansing path in the Data Flow. Double-click the Conditional Split Transformation after you have connected it from the Derived Column Transformation in order to edit it. Create a condition called Missing State or City by typing its name in the Output Name column. You now need to write an expression that looks for empty records. One way to do this is to use the LTRIM function. The two vertical bars (||) in the following code are the same as a logical OR in your code. Two & operators would represent a logical AND condition. (You can read much more about the expression language in Chapter 5.) The following code will check for a blank Column 6 or Column 7: LTRIM([State]) == "" || LTRIM([City]) == ""
The last thing you need to do is give a name to the default output if the coded condition is not met. Call that output Good Data, as shown in Figure 8-9. The default output contains the data that did not meet your conditions. Click OK to close the editor.
Figure 8-9
www.it-ebooks.info
c08.indd 265
22-03-2014 18:14:36
266â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
Note╇ If you have multiple cases, always place the conditions that you think
will capture most of the records at the top of the list, because at runtime the list is evaluated from top to bottom, and you don’t want to evaluate records more times than needed.
The Lookup Transformation Next, drag and drop the Lookup Transformation onto the design pane. When you connect to it from the Conditional Split Transformation, you’ll see the Input Output Selection dialog (shown in Figure 8-10). Select Missing State or City and click OK. This will send any bad records to the Lookup Transformation from the Conditional Split. Rename the Lookup Transformation Fix Bad Records.
Figure 8-10
The Lookup Transformation enables you to map a city and state to the rows that are missing that information by looking the record up in the ZipCode table you loaded earlier. Open the transformation editor for the Lookup Transformation. Then, in the General tab, ensure that the Full Cache property is set and that you have the OLE DB Connection Manager property set for the Connection Type. Change the No Matching Entries dropdown box to “Redirect rows to no match output,” as shown in Figure 8-11.
Figure 8-11
www.it-ebooks.info
c08.indd 266
22-03-2014 18:14:36
Typical Mainframe ETL with Data Scrubbingâ•… ❘â•… 267
In the Connection tab, select OLEDB_AdventureWorks as the Connection Manager that contains your Lookup table. Select ZipCode from the “Use a Table or View” dropdown menu. For simplicity, you are just selecting the table, but the best practice is always to type in a SQL command and select only the needed columns. Next, go to the Columns tab and drag ZipCode from the left Available Input Columns to the right ZipCode column in the Available Lookup Columns table. This will create an arrow between the two tables, as shown in Figure 8-12. Then, check the StateAbbr and City columns that you wish to output. This will transfer their information to the bottom grid. Change the Add as New Column option to Replace for the given column name as well. Specify that you want these columns to replace the existing City and State. Refer to Figure 8-12 to see the final configuration. Click OK to exit the transformation editor. There are many more options available here, but you should stick with the basics for the time being. With the configuration you just did, the potentially blank or bad city and state columns will be populated from the ZipCode table.
Figure 8-12
The Union All Transformation Now that your dirty data is cleansed, send the sanitized data back into the main data path by using a Union All Transformation. Drag and drop the Union All Transformation onto the design pane and connect the Fix Bad Records Lookup Transformation and the Find Bad Records Conditional Split Transformation to the Union All Transformation. When you drag the blue line from the Lookup Transformation, you are prompted to define which output you want to send to the Union All Transformation. Select the Lookup Match Output. There is nothing more to configure with the Union All Transformation.
www.it-ebooks.info
c08.indd 267
22-03-2014 18:14:37
268â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
Finalizing The last step in the Data Flow is to send the data to an OLE DB Destination. Drag the OLE DB Destination to the design pane and rename it Mail Merge Table. Connect the Union All Transformation to the destination. Double-click the destination and select OLEDB_AdventureWorks from the Connection Manager dropdown. For the Use a Table or View option, select the New button next to the dropdown. The default DDL for creating the table uses the destination’s name (AdventureWorks), and the data types may not be exactly what you want, as shown here: CREATE TABLE [Mail Merge Table] ( [CorporateNumber] varchar(50), [CorporationName] varchar(50), [CorporateStatus] varchar(50), [FilingType] varchar(50), [AddressLine1] varchar(150), [AddressLine2] varchar(150), [City] varchar(50), [State] varchar(50), [ZipCode] varchar(50), [Country] varchar(50), [FilingDate] varchar(50) )
Go ahead and change the schema to something a bit more useful. Change the table name and each column to something more meaningful, as shown in the following example (Ch08SQL.txt). These changes may cause the destination to show warnings about truncation after you click OK. If so, these warnings can be ignored for the purpose of this example.
Note╇ Warnings in a package do not indicate the package will fail. In this case
the zip code is trimmed to 5 characters so you know the data is not going to be truncated as indicated by the warning. It is acceptable to run packages with warning, especially in cases where unnecessary tasks would need to be added to remove the warning.
CREATE TABLE MarketingCorporation( CorporateNumber varchar(12), CorporationName varchar(48), FilingStatus char(1), FilingType char(4), AddressLine1 varchar(150), AddressLine2 varchar(50), City varchar(28), State char(2), ZipCode varchar(10), Country char(2), FilingDate varchar(10) NULL )
www.it-ebooks.info
c08.indd 268
22-03-2014 18:14:37
Typical Mainframe ETL with Data Scrubbingâ•… ❘â•… 269
You may have to manually map some of the columns this time because the column names are different. Go to the Mappings tab and map each column to its new name. Click OK to close the editor.
Handling More Bad Data The unpolished package is essentially complete, but it has one fatal flaw that you’re about to discover. Execute the package. As shown in Figure 8-13, when you do this, you can see, for example, that in the 010305c.dat file, four records were sent Figure 8-13 to be cleansed by the Lookup Transformation. Of those, only two had the potential to be cleansed. The other two records were for companies outside the country, so they could not be located in the Lookup Transformation that contained only Florida zip codes. These two records were essentially lost because you specified in the Lookup Transformation to redirect the rows without a match to a “no match output” (refer to Figure 8-11), but you have not set up a destination for this output. Recall that the business requirement was to send marketing a list of domestic addresses for their mail merge product. They didn’t care about the international addresses because they didn’t have a business presence in other countries. In this example, you want to send those two rows to an error queue for further investigation by a business analyst and to be cleaned manually. To do this properly, you need to audit each record that fails the match and create an ErrorQueue table on the SQL Server. Drag over the Audit Transformation found under the Other Transformations section of the SSIS Toolbox. Rename the Audit Transformation Add Auditing Info and connect the remaining blue arrow from the Fix Bad Records Transformation to the Audit Transformation. With the Lookup problems now being handled, double-click the Audit Transformation to configure it. Add two additional columns to the output. Select Task Name and Package Name from the dropdown boxes in the Audit Type column. Remove the spaces in each default output column name, as shown in Figure 8-14, to make it easier to query later. You should output this auditing information because you may have multiple packages and tasks loading data into the corporation table, and you’ll want to track from which package the error actually originated. Click OK when you are done.
Figure 8-14
The last thing you need to do to polish up the package is send the bad rows to the SQL Server ErrorQueue table. Drag another OLE DB Destination over to the design pane and connect the Audit Transformation to it. Rename the destination Error Queue. Double-click the destination and select
www.it-ebooks.info
c08.indd 269
22-03-2014 18:14:37
270â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
OLEDB_AdventureWorks as the Connection Manager, and click New to add the ErrorQueue table. Name the table ErrorQueue and follow a schema similar to the one shown here (Ch08SQL.txt): CREATE TABLE [ErrorQueue] ( [CorporateNumber] varchar(50), [CorporationName] varchar(50), [CorporateStatus] varchar(50), [FilingType] varchar(50), [AddressLine1] varchar(150), [AddressLine2] varchar(150), [City] varchar(50), [StateAbbr] varchar(50), [ZipCode] varchar(50), [Country] varchar(50), [FilingDate] varchar(50), [TaskName] nvarchar(19), [PackageName] nvarchar(15) )
Note╇ In error queue tables like the one just illustrated, be very generous when
defining the schema. In other words, you don’t want to create another transformation error trying to write into the error queue table. Instead, consider defining everything as a varchar column, providing more space than actually needed.
You may have to map some of the columns this time because of the column names being different. Go to the Mappings tab and map each column to its new name. Click OK to close the editor. You are now ready to re-execute the package. This time, my data file contained four records that need to be fixed, and two of those were sent to the error queue. The final package would look something like the one shown in Figure 8-15 when executed.
Figure 8-15
www.it-ebooks.info
c08.indd 270
22-03-2014 18:14:37
Typical Mainframe ETL with Data Scrubbingâ•… ❘â•… 271
Looping and the Dynamic Tasks You’ve gone a long way in this chapter toward creating a self-healing package, but it’s not very reusable yet. Your next task in the business requirements is to configure the package so that it reads a directory for any .DAT file and performs the preceding tasks on that collection of files. To simulate this example, copy the rest of the *.DAT files from the Chapter 8 download content for this book available at www.wrox.com into C:\ProSSIS\Data\Ch08.
Looping Your first task is to loop through any set of .DAT files in the C:\ProSSIS\Data\Ch08 folder and load them into your database just as you did with the single file. To meet this business requirement, you need to use the Foreach Loop Container. Go to the Control Flow tab in the same package that you’ve been working in, and drag the container onto the design pane. Then, drag the “Load Corporate Data” Data Flow Task onto the container. Rename the container Loop Through Files. Double-click the container to configure it. Go to the Collection tab and select Foreach File Enumerator from the Enumerator dropdown box. Next, specify that the folder will be C:\ProSSIS\ Data\Ch08 and that the files will have the *.DAT extension, as shown in Figure 8-16.
Figure 8-16
www.it-ebooks.info
c08.indd 271
22-03-2014 18:14:38
272â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
You need to now map the variables to the results of the Foreach File enumeration. Go to the Variable Mappings tab inside the Foreach Loop Editor and select from the Variable column dropdown box. This will open the Add Variable dialog. For the container, you’ll remain at the package level. You could assign the scope of the variable to the container, but keep things simple for this example. Name the variable strExtractFileName and click OK, leaving the rest of the options at their default settings. You will then see the User::strExtractFileName variable in the Variable column and the number 0 in the Index option. Because the Foreach File Enumerator option has only one column, you’ll see only an index of 0 for this column. If you used a different enumerator option, you could enter a number for each column that was returned from the enumerator. Click OK to leave the Foreach Loop editor.
Making the Package Dynamic Now that the loop is created, you need to set the filename in the Corporation Extract Connection Manager to be equal to the filename that the enumerator retrieves dynamically. To meet this business requirement, right-click the Corporation Extract Connection Manager and select Properties (note that you’re clicking Properties, not Edit as you’ve done previously). In the Properties pane for this Connection Manager, click the ellipsis button next to the Expressions option. By clicking the ellipsis button, you open the Property Expressions Editor. Select ConnectionString from the Property dropdown box and then click the ellipsis under the Expression column next to the connection string property you just selected, this will open the Expression Builder window, as shown in Figure 8-17. You can either type @[User::strExtractFileName] in the Expression column or click the ellipsis button, and then drag and drop the variable into the expression window. By entering @[User::strExtractFileName], you are setting the filename in the Connection Manager to be equal to the current value of the strExtractFileName variable that you set in the Foreach Loop earlier. Click OK to exit the open windows. Note in the Property window that there is a single expression by clicking the plus sign next to Expressions.
Figure 8-17
www.it-ebooks.info
c08.indd 272
22-03-2014 18:14:38
Typical Mainframe ETL with Data Scrubbingâ•… ❘â•… 273
As it stands right now, each time the loop finds a .DAT file in the C:\ProSSIS\Data\Ch08 directory, it will set the strExtractFileName variable to that path and filename. Then, the Connection Manager will use that variable as its filename and run the Data Flow Task one time for each file it finds. You now have a reusable package that can be run against any file in the format you designated earlier. The only missing technical solution to complete is the archiving of the files after you load them. Before you begin solving that problem, manually create an archive directory under C:\ProSSIS\ Data\Ch08 called C:\ProSSIS\Data\Ch08\Archive. Right-click in the Connection Manager window and select New File Connection. Select Existing Folder for the Usage Type, and point the file to the C:\ProSSIS\Data\Ch08\Archive directory. Click OK and rename the newly created Connection Manager Archive. Next, drag a File System Task into the Loop Through Files Container and connect the container to the “Load Corporate Data” Data Flow Task with an On Success constraint (the green arrow should be attached to the File System Task). Rename that task Archive File. Double-click the “Archive File” File System Task to open the editor (shown in Figure 8-18). Set the Operation dropdown box to Move file. Next, change the Destination Connection from a variable to the archive Connection Manager that you just created. Also, select True for the OverwriteDestination option, which overwrites a file if it already exists in the archive folder. The SourceConnection dropdown box should be set to the FF_Corporation_DAT Connection Manager that you created earlier in this chapter. You have now configured the task to move the file currently in the Foreach Loop to the directory in the Archive File Connection Manager. Click OK to close the editor.
Figure 8-18
www.it-ebooks.info
c08.indd 273
22-03-2014 18:14:38
274â•…â•› Chapter 8╇╇Creating an End-to-End Package ❘â•…â•›
Your complete package should now be ready to execute. Save the package before you execute it. If you successfully implemented the solution, your Control Flow should look something like Figure 8-19 when executed. When you execute the package, you’ll see the Control Flow items flash green once for each .DAT file in the directory. To run the package again, you must copy the files back into the working directory from the archive folder.
Summary This chapter focused on driving home the basic SSIS transformations, tasks, and containers. You performed Figure 8-19 a basic ETL procedure, and then expanded the ETL to self-heal when bad data arrived from your data supplier. You then set the package to loop through a directory, find each .DAT file, and load it into the database. The finale was archiving the file automatically after it was loaded. With this type of package now complete, you could use any .DAT file that matched the format you configured, and it will load with reasonable certainty. In the upcoming chapters, you’ll dive deeply into Script Tasks and Components.
www.it-ebooks.info
c08.indd 274
22-03-2014 18:14:38
9
Scripting in SSIS What’s in This Chapter? ➤
Selecting your scripting language and getting started
➤
Adding assemblies to SSIS Script objects
➤
Understanding Script Task usage
➤
Understanding Script Component usage
➤
Using external SSIS objects from within a script
➤
Using events and logging in debugging scripts
Wrox.com Downloads for This Chapter
You can find the wrox.com code downloads for this chapter at www.wrox.com/go/ prossis2014 on the Download Code tab.
Scripting is the Swiss Army knife of SSIS. As shown in previous chapters, many different SSIS features are available out-of-the-box. If you need to do something that you just can’t find anywhere else, you will find additional functionality in three features: the Script Task, the Script Component, and expressions. Expressions, covered in Chapter 5, are small scripts that set properties. The other two scripting concepts provide access into a scripting development environment using Microsoft Visual Studio Tools for Applications (VSTA) that enables SSIS developers to script logic into packages using Microsoft Visual Basic 2012 or Microsoft Visual C# 2012 .NET code. In this chapter, you will learn the differences between these script components and when you should use one over the other. You’ll also learn all about the various scripting options available and how to use them in your package development tasks to control execution flow, perform custom transformations, manage variables, and provide runtime feedback.
www.it-ebooks.info
c09.indd 275
3/24/2014 9:18:41 AM
276╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Introducing SSIS Scripting If you think of scripting as something that compiles at runtime and contains unstructured or unmanaged coding languages, then scripting in SSIS does not conform to your idea of scripting. Conversely, if you think of scripting as using small bits of code in specialized places to execute specific tasks, then SSIS scripting won’t be an alien concept. It is helpful to understand why scripting is separated according to functional usage. In this chapter, you will examine the differences and look into the scripting IDE environment, walking through the mechanics of applying programmatic logic into these components — including how to add your own classes and compiled assemblies. ETL developers have had creative ways of handling logic in their packages. Specifically, digging into what developers were doing, the functional activities can be divided into the following categories: ➤
Retrieving or setting the value of package variables
➤
Retrieving or setting properties within the package
➤
Applying business logic to validate or format data
➤
Controlling workflow in a package
Retrieving and setting the value of variables and package properties is so prevalent an activity that the SSIS team creates a completely separate feature that enabled this to be less of a programmatic task. Using the Expression Builder, you can easily alter package components by setting component properties to an expression or a variable that represents an expression.
Noteâ•… Refer to Chapter 5 for detailed information about how to use expressions, parameters, and variables.
To modify properties that connect to, manipulate, or define data, you can use the Data Flow Task to visually represent this activity. However, to achieve functionality not provided out of the box, you still need scripting, so the Script Component was added. The primary role of the Script Component is to extend the Data Flow capabilities and allow programmatic data manipulation within the context of the data flow. However, it can do more, as you’ll learn later in this chapter. To continue to enable the numerous miscellaneous tasks that are needed in ETL development, use the Script Task, which can be used only in the Control Flow design surface. In this task, you can perform various manipulations within the managed code framework of .NET. The Script Task and Script Component use the Visual Studio Tools for Applications (VSTA) environment. VSTA is essentially a scaled-down version of Visual Studio that can be added to an application that allows coding extensions using managed code and .NET languages. Even though SSIS packages are built inside of Visual Studio, when you are in the context of a Script Task or Script Component, you are actually coding in the VSTA environment that is, in fact, a mini-project within the package. The VSTA IDE provides IntelliSense, full edit-and-continue capabilities, and the ability to code in either Visual Basic or C#. You can even access some of the .NET assemblies and╯use web references for advanced scripting.
www.it-ebooks.info
c09.indd 276
3/24/2014 9:18:41 AM
Getting Started in SSIS Scripting╇
❘╇ 277
Noteâ•… To gain the most from this scripting chapter, you need a basic understanding of programming in either C# or Visual Basic. If you don’t already have it, you can obtain this knowledge from either Beginning Visual C# 2012 Programming by Karli Watson and colleagues (Wrox; ISBN: 978-1-11831441-8) or Beginning Visual Basic 2012 by Bryan Newsome (Wrox; ISBN: 978-1-118-31181-3).
Getting Started in SSIS Scripting The Script Task and Script Component have greatly increased your possibilities when it comes to script-based ETL development in SSIS. However, it is important to know when to use which component and what things can be done in each. The following matrix explains when to use each component: Component
When to Use
Script Task
This task is used in the Control Flow. Use this task when you need to program logic that either controls package execution or performs a task of retrieving or setting variables within a package during runtime.
Script Component
This component is used in the Data Flow. Use this component when moving data using the Data Flow Task. Here you can apply programmatic logic to massage, create, or consume data in the pipeline.
To get a good look at the scripting model, the next example walks through a simple “Hello World” coding project in SSIS. Although this is not a typical example of ETL programming, it serves as a good introduction to the scripting paradigm in SSIS, followed by the specific applications of the Script Task and Script Component.
Selecting the Scripting Language SSIS allows the developer to choose between two different scripting languages: C# or Visual Basic (VB). To see where you can make this choice, drop a Script Task onto the Control Flow design surface. Right-click the Script Task and click Edit from the context menu. The first thing you’ll notice is the availability Figure 9-1 of two scripting languages: Microsoft Visual C# 2012 and Microsoft Visual Basic 2012 in the ScriptLanguage property of the task. Figure 9-1 shows these options in the Script Task Editor. After clicking the Edit Script button, you’ll be locked into the script language that you chose and you won’t be able to change it without deleting and recreating the Script Task or Script Component.
www.it-ebooks.info
c09.indd 277
3/24/2014 9:18:41 AM
278╇
❘╇ CHAPTER 9╇ Scripting in SSIS
This is because each Script item contains its own internal Visual Studio project in VB or C#. You can create separate Script items whereby each one uses a different language within a package. However, using Script items in both languages within the same package is not recommended, as it makes maintenance of the package more complex. Anyone maintaining the package would have to be competent in both languages.
Using the VSTA Scripting IDE Clicking the Edit Script button on the editor allows you to add programmatic code to a Script Task or Script Component. Although the Script Task and Script Component editors look different, they both provide an Edit Script button to access the development IDE for scripting, as shown in Figure 9-2.
Figure 9-2
Once you are in the IDE, notice that it looks and feels just like Visual Studio. Figure 9-3 shows an example of how this IDE looks after opening the Script Task for the VB scripting language.
Figure 9-3
www.it-ebooks.info
c09.indd 278
3/24/2014 9:18:43 AM
Getting Started in SSIS Scripting╇
❘╇ 279
The code window on the left side of the IDE contains the code for the item selected in the Solution Explorer on the top-right window. The Solution Explorer shows the structure for the project that is being used within the Scripting Task. A complete .NET project is created for each Script Task or Component and is temporarily written to a project file on the local drive where it can be altered in the Visual Studio IDE. This persistence of the project is the reason why once you pick a scripting language, and generate code in the project, you are locked into that language for that Scripting item. Notice in Figure 9-3 that a project has been created with the namespace of ST_a8363e166ca246a3bedda7. However, you can’t open this project directly, nor need you worry about the project during deployment. These project files are extracted from stored package metadata. With the project created and opened, it is ready for coding.
Example: Hello World In the IDE, the Script Task contains only a class named ScriptMain. In the entry-point function, Main(), you’ll put the code that you want executed. Part of that code can make calls to additional functions or classes. However, if you want to change the name of the entry-point function for some reason, type the new name in the property called EntryPoint on the Script page of the editor. (Alternatively, you could change the name of the entry point at runtime using an expression.) In the VSTA co-generated class ScriptMain, you’ll also see a set of assembly references already added to your project, and namespaces set up in the class. Depending upon whether you chose VB or C# as your scripting language, you’ll see either: C# using using using using
System; System.Data; Microsoft.SqlServer.Dts.Runtime; System.Windows.Forms;
or VB Imports Imports Imports Imports
System System.Data System.Math Microsoft.SqlServer.Dts.Runtime
These assemblies are needed to provide base functionality as a jump-start to your coding. The remainder of the class includes VSTA co-generated methods for startup and shutdown operations, and finally the entry-point Main() function, shown here in both languages: C# public void Main() { // TODO: Add your code here Dts.TaskResult = (int)ScriptResults.Success; }
www.it-ebooks.info
c09.indd 279
3/24/2014 9:18:43 AM
280╇
❘╇ CHAPTER 9╇ Scripting in SSIS
VB Public Sub Main() ' ' Add your code here ' Dts.TaskResult = ScriptResults.Success End Sub
Note that the Script Task must return a result to notify the runtime of whether the script completed successfully or not. The result is passed using the Dts.TaskResult property. By your setting the result to ScriptResults.Success, the script informs the package that the task completed successfully.
NOTEâ•… The Script Component does not have to do this, since it runs in the con-
text of a Data Flow with many rows. Other differences pertaining to each component are discussed separately later in the chapter. To get a message box to pop up with the phrase “Hello World!” you need access to a class called MessageBox in a namespace called System.Windows.Forms. This namespace can be called directly by its complete name, or it can be added after the Microsoft.SqlServer.Dts.Runtime namespace to shorten the coding required in the class. Both of these methods are shown in the following code (ProSSIS\Code\Ch09_ProSSIS\02HelloWorld.dtsx) to insert the MessageBox code into the Main() function: C# using System.Windows.Forms; ... MessageBox.Show("Hello World!"); Or System.Windows.Forms.MessageBox.Show("Hello World!");
VB Imports System.Windows.Forms ... MessageBox.Show("Hello World!") Or System.Windows.Forms.MessageBox.Show("Hello World!")
Get in the habit now of building the project after adding this code. The Build option is directly on the menu when you are coding. Previous versions of SSIS gave you the opportunity to run in precompile or compiled modes. SSIS now will automatically compile your code prior to executing the package at runtime. Compiling gives you an opportunity to find any errors before the package finds them. Once the build is successful, close the IDE and the editor, and right-click and execute the Script Task. A pop-up message box should appear with the words “Hello World!” (see Figure 9-4).
Figure 9-4
www.it-ebooks.info
c09.indd 280
3/24/2014 9:18:43 AM
Getting Started in SSIS Scripting╇
❘╇ 281
Adding Code and Classes Using modal message boxes is obviously not the type of typical coding desired in production SSIS package development. Message boxes are synchronous and block until a click event is received, so they can stop a production job dead in its tracks. However, this is a basic debugging technique to demonstrate the capabilities in the scripting environments before getting into some of the details of passing values in and out using variables. You also don’t want to always put the main blocks of code in the Main() function. With just a little more work, you can get some code reuse from previously written code using some cut-and-paste development techniques. At the very least, code can be structured in a less procedural way. As an example, consider the common task of generating a unique filename for a file you want to archive. Typically, the filename might be generated by appending a prefix and an extension to a variable like a guid. These functions can be added within the ScriptMain class bodies to look like this (ProSSIS\Code\Ch09_ProSSIS\03BasicScript.dtsx): C# Public partial class ScriptMain { ... public void Main() { System.Windows.Forms.MessageBox.Show(GetFileName("bankfile", "txt")); Dts.TaskResult = (int)ScriptResults.Success; } public string GetFileName(string Prefix, string Extension) { return Prefix + "-" + Guid.NewGuid().ToString() + "." + Extension; } }
VB Partial Class ScriptMain ... Public Sub Main() System.Windows.Forms.MessageBox.Show(GetFileName("bankfile", "txt")) Dts.TaskResult = ScriptResults.Success End Sub Public Function GetFileName(ByVal Prefix As String, _ ByVal Extension As String) As String GetFileName = Prefix + "-" + Guid.NewGuid.ToString + _ "." + Extension End Function End Class
www.it-ebooks.info
c09.indd 281
3/24/2014 9:18:44 AM
282╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Instead of all the code residing in the Main() function, structured programming techniques can separate and organize SSIS scripting. In this example, the GetFileName function builds the filename and then returns the value, which is shown in a message box, as shown in Figure 9-5. Of course, copying and pasting the same code into multiple Script Tasks is pretty inefficient and produces solutions that are difficult to maintain. Figure 9-5 If you have preexisting compiled code, shouldn’t you be able to reuse this code without finding the original source for the copy-and-paste operation? What a great question! You can, with some caveats.
Using Managed Assemblies Reusing code, no matter what language it was written in, increases the maintainability and supportability of an application. While you can only write SSIS scripts using Visual Basic and C#, SSIS provides the capability to reuse code by reusing assemblies that are part of the .NET framework or any assembly created using a .NET-compliant language, including C#, J#, and even Delphi, but there are some important qualifications: ➤
For a managed assembly to be used in an Integration Service, you must install the assembly in the global assembly cache (GAC).
➤
All dependent or referenced assemblies must also be registered in the GAC. This implies that the assemblies must be strongly named.
➤
For development purposes only, VSTA can use managed assemblies anywhere on the local machine.
If you think about this it makes sense, but within SSIS, it might seem confusing at first. On the one hand, a subproject is created for the Script Task, but it is deployed as part of the metadata of the package; there is no separate physical DLL file for the assembly. In this case, you don’t have to worry about deployment of individual script projects. However, when you use an external assembly, it is not part of the package metadata, and here you do have to worry about deployment of the assembly. Where then do you deploy the assembly you want to use? Because SSIS packages are typically deployed within SQL Server, the most universal place to find the assembly would be in the GAC. If you are using any of the standard .NET assemblies, they are already loaded and stored in the GAC and the .NET Framework folders. As long as you are using the same framework for your development and production locations, using standard .NET assemblies requires no additional work in your environment. To use a standard .NET assembly in your script, you must reference it. To add a reference in a scripting project, open the VSTA environment for editing your script code — not the SSIS package itself. Right-click the project name in the Solution Explorer or go to the Project menu and select the Add Reference option. The new Reference Manager dialog will appear, as in Figure 9-6.
www.it-ebooks.info
c09.indd 282
3/24/2014 9:18:44 AM
Getting Started in SSIS Scripting╇
❘╇ 283
Figure 9-6
Select the assemblies from the list that you wish to reference and click the OK button to add the references to your project. Now you can use any objects located in the referenced assemblies either by directly referencing the full assembly or by adding the namespaces to your ScriptMain classes for shorter references, similar to the Windows Forms assembly used in the Hello World example. References can be removed from the project References screen. Find this screen by double-clicking the My Project node of the Solution Explorer. Select the References menu to see all references included in your project. To remove a reference, select the name and click the Delete key.
Example: Using Custom .NET Assemblies Although using standard .NET assemblies is interesting, being able to use your own compiled .NET assemblies really extends the capabilities of your SSIS development. Using code already developed and compiled means not having to copy-and-paste code into each Script Task, enabling you to reuse code already developed and tested. To examine how this works, in this section you’ll create an external custom .NET library that can validate a postal code and learn how to integrate this simple validator into a Script Task. (To do this, you need the standard class library project templates that are part of Visual Studio. If you installed only SQL Server Data Tools, these templates are not installed by default.) You can also download the precompiled versions of these classes, as well as any code from this chapter, at www.wrox.com/go/prossis2014. To start, open a standard class library project in the language of your choice, and create a standard utility class in the project that looks something like this (proSSIS\Code\Ch09_ProSSIS\ SSISUtilityLib_VB\SSISUtilityLib_VB\DataUtilities.vb): C# using System.Text.RegularExpressions; namespace SSISUtilityLib_CSharp { public static class DataUtilities {
www.it-ebooks.info
c09.indd 283
3/24/2014 9:18:44 AM
284╇
❘╇ CHAPTER 9╇ Scripting in SSIS
public static bool isValidUSPostalCode(string PostalCode) { return Regex.IsMatch(PostalCode, "^[0-9]{5}(-[0-9]{4})?$"); } } }
VB Imports System.Text.RegularExpressions Public Class DataUtilities Public Shared Function isValidUSPostalCode (ByVal PostalCode As String) As Boolean isValidUSPostalCode = Regex.IsMatch(PostalCode, "^[0-9]{5}(-[0-9]{4})?$") End Function End Class
Because you are creating projects for both languages, the projects (and assemblies) are named SSISUtilityLib_VB and SSISUtilityLib_Csharp. Notice the use of static or shared methods. This isn’t required, but it’s useful because you are simulating the development of what could later be a utility library loaded with many stateless data validation functions. A static or shared method allows the utility functions to be called without instantiating the class for each evaluation. Now sign the assembly by right-clicking the project to access the Properties menu option. In the Signing tab, note the option to “Sign the assembly,” as shown in Figure 9-7. Click New on the dropdown and name the assembly to have a strong name key added to it.
Figure 9-7
In this example, the VB version of the SSISUtilityLib project is being signed. Now you can compile the assembly by clicking the Build option in the Visual Studio menu. The in-process DLL will be built with a strong name, enabling it to be registered in the GAC.
www.it-ebooks.info
c09.indd 284
3/24/2014 9:18:44 AM
Getting Started in SSIS Scripting╇
❘╇ 285
On the target development machine, open a command-line prompt window to register your assembly with a command similar to this: C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\bin\NETFX 4.0 Tools> Gacutil /i C:\ProSSIS\Code\SSISUtilityLib_CSharp\SSISUtilityLib_CSharp\ bin\Release\SSISUtilityLib_CSharp.dll
Noteâ•… Note that you may have to run the command line as administrator or have the User Access Control feature turned off to register the assembly.
If you are running on a development machine, you also need to copy the assembly into the appropriate .NET Framework directory so that you can use the assembly in the Visual Studio IDE. Using the Microsoft .NET Framework Configuration tool, select Manage the Assembly Cache. Then select Add an Assembly to the Assembly Cache to copy an assembly file into the global cache.
Noteâ•… For a detailed step-by-step guide to the deployment, see the SSIS Developer’s Guide on Custom Objects located at http://msdn.microsoft.com/en-us/library/ms403356(v=SQL.110).aspx.
To use the compiled assembly in an SSIS package, open a new SSIS package and add a new Script Task to the Control Flow surface. Select the scripting language you wish and click Edit Script. You’ll need to right-click the Solution Explorer node for references and find the reference for SSISUtilityLib_VB.dll or SSISUtilityLib_CSharp.dll depending on which one you built. If you have registered the assembly in the GAC, you can find it in the .NET tab. If you are in a development environment, you can simply browse to the .dll to select it. Add the namespace into the ScriptMain class. Then add these namespaces to the ScriptMain class: C# using SSISUtilityLib_CSharp;
VB Imports SSISUtilityLib_VB
Note that the SSIS C# Script Task in the sample packages you’ll see if you download the chapter materials from www.wrox.com/go/prossis2014 use both the C# and the VB versions of the utility library. However, this is not required. The compiled .NET class libraries may be intermixed within the SSIS Script Task or Components regardless of the scripting language you choose.
www.it-ebooks.info
c09.indd 285
3/24/2014 9:18:44 AM
286╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Now you just need to code a call to the utility function in the Main() function like this (ProSSIS\ Code\Ch09_ProSSIS\04SSISPackageUsingAssembly.dtsx): C# public void Main() { string postalCode = "12345-1111"; string msg = string.Format( "Validating PostalCode {0}\nResult..{1}", postalCode, DataUtilities.isValidUSPostalCode(postalCode)); MessageBox.Show(msg); Dts.TaskResult = (int)ScriptResults.Success; }
VB Public Sub Main() Dim postalCode As String = "12345-1111" Dim msg As String = String.Format("Validating PostalCode {0}" + _ vbCrLf + "Result..{1}", postalCode, _ DataUtilities.isValidUSPostalCode(postalCode)) MessageBox.Show(msg) Dts.TaskResult = ScriptResults.Success End Sub
Compile the Script Task and execute it. The result should be a message box displaying a string to validate the postal code 12345-1111. The postal code format is validated by the DataUtility function IsValidUSPostalCode. There was no need to copy the function in the script project. The logic of validating the format of a U.S. postal code is stored in the shared DataUtility function and can easily be used in both Script Tasks and Components with minimal coding and maximum consistency. The only downside to this is that there is now an external dependency in the SSIS package upon this assembly. If the assembly changes version numbers, you’ll need to open and recompile all the script projects for each SSIS package using this. Otherwise, you could get an error if you aren’t following backward compatibility guidelines to ensure that existing interfaces are not broken. If you have a set of well-tested business functions that rarely change, using external assemblies may be a good idea for your SSIS development.
Using the Script Task Now that you have a good overview of the scripting environment in SSIS, it’s time to dig into the Script Task and give it a spin. The Script Task was used heavily to demonstrate how the SSIS scripting environment works with Visual Studio and during the execution of a package. Generally, anything that you can script in the .NET managed environment that should run once per package or code loop belongs in the Script Task. The Script Task is used in the Control Flow of a package. Script Tasks are extremely useful and end up being the general-purpose utility component if the desired functionality is not available in the out-of-the-box Control Flow tasks.
www.it-ebooks.info
c09.indd 286
3/24/2014 9:18:44 AM
Using the Script Task╇
❘╇ 287
Configuring the Script Task Editor An earlier look at the Script Task Editor pointed out that two selections are available for the scripting language, but there are other options as well. Drop a Script Task on the Control Flow surface to display the Script Task Editor shown in Figure 9-8. Here are the four properties on the Script tab to configure the Script Task: ➤
ScriptLanguage: This property defines the .NET language that will be used for the script. As demonstrated earlier, VB and C# are your two options.
➤
EntryPoint: This is the name of the method that will be called inside your script to begin execution.
Figure 9-8
➤
ReadOnlyVariables: This property enumerates a case-sensitive, comma-separated list of SSIS variables to which you allow explicit rights to be read by the Script Task.
➤
ReadWriteVariables: This property enumerates a case-sensitive, comma-separated list of SSIS variables to which you allow the script to read from and write to.
All scripts are precompiled by default, which improves performance and reduces the overhead of loading the language engine when running a package. The second tab on the left, General, contains the task name and description properties. The final page available on the left of this dialog is the Expressions tab. The Expressions tab provides access to the properties that can be set using an expression or expression-based variable. (See Chapter 5 for details about how to use expressions and variables.) Keep in mind that changing the ScriptLanguage property at runtime is neither possible nor desirable even though it is listed as a possibility in the Expression Editor. Once the script language is set and the script accessed, a project file with a class named ScriptMain and a default entry point named Main() is created. As a reminder, an example of the Main() function is provided here (ProSSIS\Code\Ch09_ProSSIS\01EmptyPackage.dtsx), without the supporting class: C# public void Main() { Dts.TaskResult = (int)ScriptResults.Success; }
www.it-ebooks.info
c09.indd 287
3/24/2014 9:18:45 AM
288╇
❘╇ CHAPTER 9╇ Scripting in SSIS
VB Public Sub Main() Dts.TaskResult = ScriptResults.Success End Try
The code provided includes the statement to set the TaskResult of the Dts object to the enumerated value for success. The Script Task itself is a task in the collection of tasks for the package. Setting the TaskResult property of the task sets the return value for the Script Task and tells the package whether the result was a success or a failure. By now, you have probably noticed all the references to Dts. What is this object and what can you do with it? This question is answered in the next section, as you peel back the layers of the Dts object.
The Script Task Dts Object The Dts object is actually a property on your package that is an instance of the Microsoft.SqlServer.Dts.Tasks.ScriptTask.ScriptObjectModel class. The Dts object provides a window into the package in which your script executes. Although you can’t change properties of the package as it executes, the Dts object has seven properties and one method that allow you to interact with the package. The following is an explanation of these members: ➤
Connections: A collection of Connection Managers defined in the package. You can use
these connections in your script to retrieve any extra data you may need. ➤
Events: A collection of events that are defined for the package. You can use this interface to fire off these predefined events and any custom events.
➤
ExecutionValue: A read-write property that enables you to specify additional information about your task’s execution using a user-defined object. This can be any information you want.
➤
TaskResult: This property enables you to return the success or failure status of your Script
Task to the package. This is the main way of communicating processing status or controlling flow in your package. This property must be set before exiting your script. ➤
Transaction: Obtains the transaction associated with the container in which your script is
running. ➤
VariableDispenser: Gets the VariableDispenser object, which you can use to retrieve variables when using the Script Task.
➤
Variables: A collection of all the variables available to any script.
➤
Log: You can use this method to write to any log providers that have been enabled.
The next few sections describe some of the common things that the Script Task can be employed to accomplish.
www.it-ebooks.info
c09.indd 288
3/24/2014 9:18:45 AM
Using the Script Task╇
❘╇ 289
Accessing Variables in the Script Task Variables and expressions are an important feature of the SSIS road map. In the following scenario, “variables” describe objects that serve as intermediate communication mediums between your Script Task and the rest of your package. As discussed in Chapter 5, variables are used to drive the runtime changes within a package by allowing properties to infer their values at runtime from variables, which can be static or defined through the expression language. The common method of using variables is to send them into a Script Task as decision-making elements or to drive downstream decisions by setting the value of the variable in the script based on some business rules. To use a variable in a script, the variable must be locked, accessed, and then unlocked. There are two ways of doing this: explicitly and implicitly. The explicit method uses the VariableDispenser object, which provides methods for locking variables for read-only or read-write access and then retrieving them. At one time, this was the standard way of accessing variables in scripts. The explicit locking mechanism allows control in the Script Task to keep two processes from competing for accessing and changing a variable. This will also reduce the amount of time the variable is locked, but forces the developer to write code. To retrieve a variable using the VariableDispenser object, you have to deal with the implementation details of locking semantics, and write code like the following (ProSSIS\Code\ Ch09_ProSSIS\13VarsScriptTask.dtsx): C# Variables vars = null; String myval = null; Dts.VariableDispenser.LockForRead("User::SomeStringVariable"); Dts.VariableDispenser.GetVariables(ref vars); myval = vars[0].Value.ToString; vars.Unlock(); //Needed to unlock the variables System.Windows.Forms.MessageBox.Show(myval);
VB Dim vars As Variables Dim myval As String Dts.VariableDispenser.LockForRead("User::SomeStringVariable") Dts.VariableDispenser.GetVariables(vars) myval = vars(0).Value.ToString vars.Unlock() 'Needed to unlock the variables MsgBox(myval)
The implicit option of handling variables is the alternative to manually locking, using, and unlocking the variable. This option is best when you simply want the variables that you are using in a Script Task to be locked when you are reading and writing; you don’t want to worry about the locking implementation details. The Variables collection on the Dts object and the ReadOnlyVariables and ReadWriteVariables properties for the Script Task allow you to set up the implicit variable locking. The only constraint is that you have to define up front which variables going into the Script Task can be read but not written to versus both readable and writable. The ReadOnlyVariables and ReadWriteVariables properties tell the Script Task which variables to lock and how. The Variables collection in the Dts object is then populated with these variables. This simplifies the code to retrieve a variable, and the complexities of locking are abstracted,
www.it-ebooks.info
c09.indd 289
3/24/2014 9:18:45 AM
290╇
❘╇ CHAPTER 9╇ Scripting in SSIS
so you have to worry about only one line of code to read a variable (ProSSIS\Code\Ch09_ ProSSIS\13VarsScriptTask.dtsx):
C# Dts.Variables["User::SomeStringVariable"].Value = "MyValue";
VB Dts.Variables("User::SomeStringVariable").Value = "MyValue"
It is safest to use the fully qualified variable name, such as User::SomeStringVariable. Attempting to read a variable from the Variables collection that hasn’t been specified in one of the variable properties of the task will throw an exception. Likewise, attempting to write to a variable not included in the ReadWriteVariables property also throws an exception. The biggest frustration for new SSIS developers writing VB script is dealing with the following error message: Error: 0xc0914054 at VB Script Task: Failed to lock variable "SomestringVariable" for read access with error 0xc0910001 "The variable cannot be found. This occurs when an attempt is made to retrieve a variable from the Variables collection on a container during execution of the package, and the variable is not there. The variable name may have changed or the variable is not being created."
The resolution is simple. Either the variable name listed in the Script Task Editor or the variable name in the script doesn’t match, so one must be changed to match the other. It is more confusing for the VB developers because this language is not case sensitive. However, the SSIS variables are case sensitive, even within the VB script.
Noteâ•… Although Visual Basic .NET is not case sensitive, SSIS variables are.
Another issue that happens occasionally is that a developer can create more than one variable with the same name with different scopes. When this happens, you have to ensure that you explicitly refer to the variable by the fully qualified variable name. SSIS provides a Select Variables dialog, shown in Figure 9-9, that enables selection of the variables. Fortunately, the Script Task property for the ReadOnlyVariables or ReadWriteVariables is auto-filled with the fully qualified names: User::DecisionIntVar and User::DecisionStrVar. This reduces most of the common issues that can occur when passing variables into the Script Task. All this Figure 9-9 information will now come in handy as you walk through an example using the Script Task and variables to control SSIS package flow.
www.it-ebooks.info
c09.indd 290
3/24/2014 9:18:45 AM
Using the Script Task╇
❘╇ 291
Example: Using Script Task Variables to Control Package Flow This example sets up a Script Task that uses two variables to determine which one of two branches of Control Flow logic should be taken when the package executes. First, create a new SSIS package and set up these three variables: Variable
T ype
Value
DecisionIntVar
Int32
45
DecisionStrVar
String
txt
HappyPathEnum
Int32
0
Then drop three Script Tasks on the Control Flow design surface so that the package looks like Figure 9-10. There are two variables, DecisionIntVar and DecisionStrVar, that represent the number of rows determined to be in a file and the file extension, respectively. These variables are fed into the Script Task. Assume that these values have been set by another process. Logic in the Script Task will determine whether the package should execute the CRD File Path Script Task or the TXT File Script Task. The control of the package is handled by the other external variable named HappyPathEnum. If the value of this variable is equal to 1, then the TXT File Script Task will be executed. If the value of the variable is equal to 2, then the CRD File Path Script Task will be executed. Open the Script Task Editor for the Parent Script Task to set up the properties (see Figure 9-11). Set the Script Language and then use the ellipsis button to bring up the variable selection user interface (refer to Figure 9-9). Select the variables for ReadOnlyVariables and ReadWriteVariables separately if you are using this dialog. You can also type these variables in, but remember that the variable names are case sensitive. As shown in Figure 9-12, note the ordinal positions of the variables for this example.
Figure 9-10
Figure 9-11
0
1
2
Figure 9-12
www.it-ebooks.info
c09.indd 291
3/24/2014 9:18:46 AM
292╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Keep this script simple for demonstration purposes. The most important parts are the retrieving and setting of the variables. This code uses the named references for the variables for the retrieval of the variable values: C# int rowCnt = (int)Dts.Variables["User::DecisionIntVar"].Value;
VB Dim rowCnt As Integer = Dts.Variables("User::DecisionIntVar").Value
The setting of variables uses the same syntax but reverses the assignment. The code that should be pasted into the Main() function of the ScriptMain class will evaluate the two variables and set the HappyPathEnum variable (ProSSIS\Code\Ch09_ProSSIS\05STVarControlFlow.dtsx): C# //Retrieving the value of Variables int rowCnt = (int)Dts.Variables["User::DecisionIntVar"].Value; string fileExt = (string)Dts.Variables["User::DecisionStrVar"].Value; if (fileExt.Equals("txt") && rowCnt > 0) { Dts.Variables["User::HappyPathEnum"].Value = 1; } else if (fileExt.Equals("crd") && rowCnt > 0) { Dts.Variables["User::HappyPathEnum"].Value = 2; } Dts.TaskResult = (int)ScriptResults.Success;
VB 'Retrieving the value of Variables Dim rowCnt As Integer = Dts.Variables("User::DecisionIntVar").Value Dim fileExt As String = Dts.Variables("User::DecisionStrVar").Value If (fileExt.Equals("txt") And rowCnt > 0) Then Dts.Variables("User::HappyPathEnum").Value = 1 ElseIf (fileExt.Equals("crd") And rowCnt > 0) Then Dts.Variables("User::HappyPathEnum").Value = 2 End If Dts.TaskResult = ScriptResults.Success
To alter the flow of the package, set the two precedence constraints in the package hierarchy to be based on a successful completion of the previous Script Task and an expression that tests the expected values of the HappyPathEnum variable. This precedence specifies that the Control Flow should go in a direction only if the value of an expression tests true. Set the precedence between each Script Task to one of these expressions going to the TXT and CRD tasks, respectively: @HappyPathEnum == 1
Or @HappyPathEnum == 2
www.it-ebooks.info
c09.indd 292
3/24/2014 9:18:46 AM
Using the Script Task╇
❘╇ 293
A sample of the precedence between the Script Task and the TXT File Script Task should look like Figure 9-13. Now, to give the package something to do, simply retrieve the value of the set variable in each child Script Task to provide visual proof that the HappyPathEnum variable was properly set. Add this code into the Main() function of each child Script Task (make sure you set the message to display TXT or CRD for each associated Script Task) (ProSSIS\Code\Ch09_ ProSSIS\05STVarControlFlow.dtsx): C#
Figure 9-13
int ival = (int)Dts.Variables[0]. Value; string msg = string.Format("TXT File Found\nHappyPathEnum Value = {0}", Dts.Variables[0].Value.ToString()); System.Windows.Forms.MessageBox.Show(msg); Dts.TaskResult = (int)ScriptResults.Success;
VB Dim ival As Integer = Dts.Variables(0).Value Dim msg As String = _ String.Format("TXT File Found" + vbCrLf + "HappyPathEnum Value = {0}", _ Dts.Variables(0).Value.ToString()) System.Windows.Forms.MessageBox.Show(msg) Dts.TaskResult = ScriptResults.Success
To see how this works, set the value of the User::DecisionIntVar variable to a positive integer number value, and the User::DecisionStrVar variable to either txt or crd, and watch the package switch from one Control Flow to the other. If you provide a value other than txt or crd (even "txt" with quotes will cause this), the package will not run either leg, as designed. This is a simple example that you can refer back to as your packages get more complicated and you want to update variables within a Script Task. Later in this chapter, you’ll see how the Script Component accesses variables in a slightly different way.
Connecting to Data Sources in a Script Task A common use of an ETL package is to grab a connection to retrieve decision-making data from various data sources, such as Excel files, INI files, flat files, or databases like Oracle or Access. This capability allows other data sources to configure the package or to retrieve data for objects that can’t use a direct connection object. In SSIS, with the Script Task you can make connections using any of the .NET libraries directly, or you can use connections that are defined in a package. Connections in SSIS are abstractions for connection strings that can be copied, passed around, and easily configured.
www.it-ebooks.info
c09.indd 293
3/24/2014 9:18:46 AM
294╇
❘╇ CHAPTER 9╇ Scripting in SSIS
The Connections collection is a property of the Dts object in the Script Task. To retrieve a connection, you call the AcquireConnection method on a specific named (or ordinal position) connection in the collection. The only thing you really should know ahead of time is what type of connection you are going to be retrieving, because you need to cast the returned connection to the proper connection type. In .NET, connections are not generic. Examples of concrete connections are SqlConnection, OleDbConnection, OdbcConnection, and the OracleConnection managers that connect using SqlClient, OLE DB, ODBC, and even Oracle data access libraries, respectively. There are some things you can do to query the Connection Manager to determine what is in the connection string or whether it supports transactions, but you shouldn’t expect to use one connection in SSIS for everything, especially with the additional Connection Managers for FTP, HTTP, and WMI. Assuming that you’re up to speed on the different types of connections covered earlier in this book, it’s time to look at how you can use them in everyday SSIS Script Tasks.
Example: Retrieving Data into Variables from a Database Although SSIS provides configurable abilities to set package-level values, there are use cases that require you to retrieve actionable values from a database that can be used for package Control Flow or other functional purposes. While this example could be designed using other components, we’ll use this to show how to access variables from a script. For example, some variable aspect of the application may change, like an e-mail address for events to use for notification. In this example, you’ll retrieve a log file path for a package at runtime using a connection within a Script Task. The database that contains the settings for the log file path stores this data using the package ID. You first need a table in the AdventureWorks database called SSIS_SETTING. Create the table with three fields, PACKAGE_ID, SETTING, and VALUE, or use this script (ProSSIS\Scripts\Ch09_ProSSIS\Ch09_Table_Create_ Script.sql): CREATE TABLE [dbo].[SSIS_SETTING]( [PACKAGE_ID] [uniqueidentifier] NOT NULL, [SETTING] [nvarchar](2080) NOT NULL, [VALUE] [nvarchar](2080) NOT NULL ) ON [PRIMARY] GO INSERT INTO SSIS_SETTING SELECT '{INSERT YOUR PACKAGE ID HERE}', 'LOGFILEPATH', 'c:\myLogFile.txt'
You can find the package identifier in the properties of the package. Then create an SSIS package with one ADO.NET Connection Manager to the AdventureWorks database called AdventureWorks and add a package-level variable named LOGFILEPATH of type╯String. Add a Script Task to the project and send in two variables: a read-only variable
www.it-ebooks.info
c09.indd 294
3/24/2014 9:18:46 AM
Using the Script Task╇
❘╇ 295
System::PackageID and a read-write variable User::LOGFILEPATH. Click the Edit Script button to open the Script project and add the namespace System.Data.SqlClient in the top of the class. Then add the following code to the Main() method (ProSSIS\Code\Ch09_ProSSIS\06aScript DataIntoVariable.dtsx):
C# public void Main() { string myPackageId = Dts.Variables["System::PackageID"].Value.ToString(); string myValue = string.Empty; string cmdString = "SELECT VALUE FROM SSIS_SETTING " + "WHERE PACKAGE_ID= @PACKAGEID And SETTING= @SETTINGID"; try { SqlConnection mySqlConn = (SqlConnection)Dts.Connections[0].AcquireConnection(null); mySqlConn = new SqlConnection(mySqlConn.ConnectionString); mySqlConn.Open(); SqlCommand cmd = new SqlCommand(); cmd.CommandText = cmdString; SqlParameter parm = new SqlParameter("@PACKAGEID", SqlDbType.UniqueIdentifier); parm.Value = new Guid(myPackageId); cmd.Parameters.Add(parm); parm = new SqlParameter("@SETTINGID", SqlDbType.NVarChar); parm.Value = "LOGFILEPATH"; cmd.Parameters.Add(parm); cmd.Connection = mySqlConn; cmd.CommandText = cmdString; SqlDataReader reader = cmd.ExecuteReader(); while (reader.Read()) { myValue = reader["value"].ToString(); } Dts.Variables["User::LOGFILEPATH"].Value = myValue; reader.Close(); mySqlConn.Close(); mySqlConn.Dispose(); } catch { Dts.TaskResult = (int)ScriptResults.Failure; throw; } System.Windows.Forms.MessageBox.Show(myValue); Dts.TaskResult = (int)ScriptResults.Success; }
VB Public Sub Main() Dim myPackageId As String = _ Dts.Variables("System::PackageID").Value.ToString() Dim myValue As String = String.Empty Dim cmdString As String = "SELECT VALUE FROM SSIS_SETTING " + _ "WHERE PACKAGE_ID= @PACKAGEID And SETTING= @SETTINGID" Try Dim mySqlConn As SqlClient.SqlConnection mySqlConn = DirectCast(Dts.Connections(0).AcquireConnection(Nothing), SqlClient.SqlConnection)
www.it-ebooks.info
c09.indd 295
3/24/2014 9:18:47 AM
296╇
❘╇ CHAPTER 9╇ Scripting in SSIS
mySqlConn = New SqlClient.SqlConnection(mySqlConn.ConnectionString) mySqlConn.Open() Dim cmd = New SqlClient.SqlCommand() cmd.CommandText = cmdString Dim parm As New SqlClient.SqlParameter("@PACKAGEID", _ SqlDbType.UniqueIdentifier) parm.Value = New Guid(myPackageId) cmd.Parameters.Add(parm) parm = New SqlClient.SqlParameter("@SETTINGID", SqlDbType.NVarChar) parm.Value = "LOGFILEPATH" cmd.Parameters.Add(parm) cmd.Connection = mySqlConn cmd.CommandText = cmdString Dim reader As SqlClient.SqlDataReader = cmd.ExecuteReader() Do While (reader.Read()) myValue = reader("value").ToString() Loop Dts.Variables("User::LOGFILEPATH").Value = myValue reader.Close() mySqlConn.Close() mySqlConn.Dispose() Catch ex As Exception Dts.TaskResult = ScriptResults.Failure Throw End Try System.Windows.Forms.MessageBox.Show(myValue) Dts.TaskResult = ScriptResults.Success End Sub
In this code, the package ID is passed into the Script Task as a read-only variable and is used to build a T-SQL statement to retrieve the value of the LOGFILEPATH setting from the SSIS_SETTING table. The AcquireConnection method creates an instance of a connection to the AdventureWorks database managed by the Connection Manager and allows other SqlClient objects to access the data source. The retrieved setting from the SSIS_SETTING table is then stored in the writable variable LOGFILEPATH. This is a basic example, but you use this exact same technique to retrieve a recordset into an object variable that can be iterated within your package as well. Let’s do that now.
Example: Retrieving Files from an FTP Server A frequent source of data to use in a solution is files retrieved from an FTP server. SSIS provides an FTP Connection Manager and FTP Task to assist in this function. To use these objects, you need to know what file you want to retrieve from the FTP server. But what do you do if you don’t know what the file name is, and you just want to pull everything from the server? This is a perfect use for a Script Task. The final package that we will create can be seen in Figure 9-14. Begin by adding an FTP Connection Manager that points to your FTP server and a Script Task to your package. The Script Task will use one
Figure 9-14
www.it-ebooks.info
c09.indd 296
3/24/2014 9:18:47 AM
Using the Script Task╇
❘╇ 297
read/write variable, named FileList, to pass back the list of files to be transferred from the FTP server. We can then add the following code inside the script (ProSSIS\Code\Ch09_ProSSIS\06bSTV ariableForEachLoop.dtsx): VB Dim Dim Dim Dim Dim
conn As ConnectionManager ftp As FtpClientConnection folderNames As String() fileNames As String() fileArray As New ArrayList
conn = Dts.Connections("FTPServer") ftp = New FtpClientConnection(conn.AcquireConnection(Nothing)) ftp.Connect() ftp.GetListing(folderNames, fileNames) For Each s As String In fileNames fileArray.Add(s) Next Dts.Variables("FileList").Value = fileArray ftp.Close() Dts.TaskResult = ScriptResults.Success
C# ConnectionManager conn = default(ConnectionManager); FtpClientConnection ftp = default(FtpClientConnection); string[] folderNames = null; string[] fileNames = null; ArrayList fileArray = new ArrayList(); conn = Dts.Connections("FTPServer"); ftp = new FtpClientConnection(conn.AcquireConnection(null)); ftp.Connect(); ftp.GetListing(folderNames, fileNames); foreach (string s in fileNames) { fileArray.Add(s); } Dts.Variables("FileList").Value = fileArray; ftp.Close(); Dts.TaskResult = ScriptResults.Success;
This code connects to the FTP server and returns a list of the files available for download. To allow the information to be used in a Foreach Loop Container, the file names are put into an ArrayList and then into the FileList variable. Our next step is to add the Foreach Loop Container, which will enumerate over the variable FileList. Each iteration will store the name of the file in the FileName variable. Finally, an FTP Task placed inside of the container will use the FileName variable as the source variable to retrieve the file.
www.it-ebooks.info
c09.indd 297
3/24/2014 9:18:47 AM
298╇
❘╇ CHAPTER 9╇ Scripting in SSIS
With just a few steps, we were able to find out what files are available on the server and download all of them. Next we will look at saving information to an XML file.
Example: Saving Data to an XML File Another common requirement is to generate data of a certain output format. When the output is a common format like Flat File, Excel, CSV, or other database format, you can simply pump the data stream into one of the Data Flow Destinations. If you want to save data to an XML file, the structure is not homogeneous and not as easy to transform from a column-based data stream into an XML structure without some logic or structure around it. This is where the Script Task comes in handy.
Noteâ•… If you want to parse out the XML file and put the data into a destina-
tion, a Script Component could also be used here.
The easiest way to get data into an XML file is to load and save the contents of a data set using the method WriteXML on the data set. With a new Script Task in a package with an ADO.NET connection to AdventureWorks, add a reference to System.Xml.dll and then add the namespaces for System.Data.SqlClient, System.IO, and System.Xml. Code the following (ProSSIS\Code\ Ch09_ProSSIS\07ScriptDataintoXMLFile.dtsx) into the Script Task to open a connection and get all the SSIS_SETTING rows and store them as XML:
Noteâ•… See the previous example for the DDL to create this table in the
AdventureWorks database. C# public void Main() { SqlConnection sqlConn; string cmdString = "SELECT * FROM SSIS_SETTING "; try { sqlConn = (SqlConnection)(Dts.Connections["AdventureWorks"]) .AcquireConnection(Dts.Transaction ); sqlConn = new SqlConnection(sqlConn.ConnectionString); sqlConn.Open(); SqlCommand cmd = new SqlCommand(cmdString, sqlConn); SqlDataAdapter da = new SqlDataAdapter(cmd); DataSet ds = new DataSet(); da.Fill(ds); ds.WriteXml(new System.IO.StreamWriter ("C:\\ProSSIS\\Files\\myPackageSettings.xml")); sqlConn.Close(); }
www.it-ebooks.info
c09.indd 298
3/24/2014 9:18:47 AM
Using the Script Task╇
❘╇ 299
catch { Dts.TaskResult = (int)ScriptResults.Failure; throw; } Dts.TaskResult = (int)ScriptResults.Success; }
VB Public Sub Main() Dim sqlConn As New SqlConnection Dim cmdString As String = "SELECT * FROM SSIS_SETTING " Try sqlConn = DirectCast(Dts.Connections("AdventureWorks") .AcquireConnection(Dts.Transaction), SqlConnection) sqlConn = New SqlConnection(sqlConn.ConnectionString) sqlConn.Open() Dim cmd = New SqlCommand(cmdString, sqlConn) Dim da = New SqlDataAdapter(cmd) Dim ds = New DataSet da.Fill(ds) ds.WriteXml(New StreamWriter("C:\\ProSSIS\\Files\\myPackageSettings.xml" )) sqlConn.Close() Catch Dts.TaskResult = ScriptResults.Failure Throw End Try Dts.TaskResult = ScriptResults.Success End Sub
There is not much to note about these results, except that the file is in XML format:
If you need more control of the data you are exporting, or you need to serialize data, you need to use the Script Task in a different way. The next example provides some tips on how to do this.
Example: Serializing Data to XML In the last example, you looked at simply dumping database data into an XML format by loading data into a DataSet and using the WriteToXML method to push the XML out to a file stream. If you need more control over the format, or the data is hierarchical, using .NET XML object-based serialization can be helpful. Imagine implementations that pull data from flat-file mainframe feeds and fill fully hierarchical object models. Alternatively, imagine serializing data into an object structure to pop an entry into an MSMQ application queue. This is easy to do using some of the same concepts.
www.it-ebooks.info
c09.indd 299
3/24/2014 9:18:47 AM
300╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Create another package with a connection to the AdventureWorks database; add a Script Task with a reference to the System.Data.SqlClient namespace. Use the data from the previous example and create a class structure within your ScriptMain to hold the values for each row of settings that looks like this (ProSSIS\Code\Ch09_ProSSIS\08ScriptDataintoSerializableObject.dtsx): C# [Serializable()] public class SSISSetting { public string PackageId { get; set; } public string Setting { get; set; } public string Value { get; set; } }
VB Public Class SSISSetting Private m_PackageId As String Private m_Setting As String Private m_Value As String Public Property PackageId() As String Get PackageId = m_PackageId End Get Set(ByVal Value As String) m_PackageId = Value End Set End Property Public Property Setting() As String Get PackageId = m_Setting End Get Set(ByVal Value As String) m_Setting = Value End Set End Property Public Property Value() As String Get Value = m_Value End Get Set(ByVal Value As String) m_Value = Value End Set End Property End Class
This class will be filled based on the data set shown in the last example. It is still a flat model, but more complex class structures would have collections within the class. An example would be a student object with a collection of classes, or an invoice with a collection of line items. To persist this type of data, you need to traverse multiple paths to fill the model. Once the model is filled, the rest is easy.
www.it-ebooks.info
c09.indd 300
3/24/2014 9:18:47 AM
Using the Script Task╇
❘╇ 301
First, add the namespaces System.Xml.Serialization, System.Collections.Generic, System.IO, and System.Data.SqlClient to your Script Task project. A simple example with the SSIS_SETTING table would look like this (ProSSIS\Code\Ch09_ProSSIS\08ScriptDatainto SerializableObject.dtsx): C# public void Main() { SqlConnection sqlConn; string cmdString = "SELECT * FROM SSIS_SETTING "; try { sqlConn = (SqlConnection)(Dts.Connections["AdventureWorks"]) .AcquireConnection(Dts.Transaction); sqlConn = new SqlConnection(sqlConn.ConnectionString); sqlConn.Open(); SqlCommand cmd = new SqlCommand(cmdString, sqlConn); SqlDataReader dR = cmd.ExecuteReader(); List arrayListSettings = new List(); while (dR.Read()) { SSISSetting oSet = new SSISSetting(); oSet.PackageId = dR["PACKAGE_ID"].ToString(); oSet.Setting = dR["SETTING"].ToString(); oSet.Value = dR["VALUE"].ToString(); arrayListSettings.Add(oSet); } StreamWriter outfile = new StreamWriter ("C:\\ProSSIS\\Files\\myObjectXmlSettings.xml"); XmlSerializer ser = new XmlSerializer(typeof(List)); ser.Serialize(outfile, arrayListSettings); outfile.Close(); outfile.Dispose(); sqlConn.Close(); } catch { Dts.TaskResult = (int)ScriptResults.Failure; throw; } Dts.TaskResult = (int)ScriptResults.Success; }
VB Public Sub Main() Dim sqlConn As SqlConnection Dim cmdString As String = "SELECT * FROM SSIS_SETTING " Try sqlConn = DirectCast(Dts.Connections("AdventureWorks") .AcquireConnection(Dts.Transaction), SqlConnection) sqlConn = New SqlConnection(sqlConn.ConnectionString) sqlConn.Open() Dim cmd As SqlCommand = New SqlCommand(cmdString, sqlConn) Dim dR As SqlDataReader = cmd.ExecuteReader() Dim arrayListSettings As New List(Of SSISSetting) Do While (dR.Read())
www.it-ebooks.info
c09.indd 301
3/24/2014 9:18:47 AM
302╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Dim oSet As New SSISSetting() oSet.PackageId = dR("PACKAGE_ID").ToString() oSet.Setting = dR("SETTING").ToString() oSet.Value = dR("VALUE").ToString() arrayListSettings.Add(oSet) Loop Dim outfile As New StreamWriter("C:\\ProSSIS\\Files\\myObjectXmlSettings.xml") Dim ser As New XmlSerializer(GetType(List(Of SSISSetting))) ser.Serialize(outfile, arrayListSettings) outfile.Close() outfile.Dispose() sqlConn.Close() Catch Dts.TaskResult = ScriptResults.Failure Throw End Try Dts.TaskResult = ScriptResults.Success End Sub
Noteâ•… Keep in mind that while this example uses a connection directly in the
code, you can also use an SSIS Connection Manager, as shown in the FTP example. Using a connection manager will make your package more portable to a production environment if you use parameters or configurations.
The StreamWriter here just gets an I/O stream from the file system to use for data output. The XmlSerializer does the heavy lifting and converts the data from the object format into an XML format. The only trick here is understanding how to deal with the Generic List or the collection of SSISSetting objects. This is handled by using the override, whereby you can add the specific types to the serializer along with the List. The resulting XML payload will now look like this: 34050406-2e0f-423a-8af3-1ec95399a6c2LOGFILEPATHc:\myLogFile.txt
Although the XML content looks a little bit different from dumping the content of the recordset directly to XML as shown in the earlier example, it is optimized for object serialization. This is the type of content that you could push into application queues or share with external applications.
www.it-ebooks.info
c09.indd 302
3/24/2014 9:18:47 AM
Using the Script Task╇
❘╇ 303
Raising an Event in a Script Task All existing SSIS Tasks and Components raise events that can be captured and displayed by the Execution Results tab by default. Optionally, these events can also be captured and logged into SSIS logging or event handlers. Event handlers are Control Flows that you set up and define to respond to specific events. They are literally Control Flow workflows within a package, and they enable you to customize the diagnostic information that the packages can provide at runtime. If you have done any Windows GUI programming, you are familiar with events. An event is simply a message sent from some object saying that something just happened or is about to happen. To raise or fire an event within a Script Task, you use the Events property of the Dts object. More information about events can be found in Chapter 18. The Events property on the Dts object is really an instance of the IDTSComponentEvents interface. This interface specifies seven methods for firing events: ➤
FireBreakpointHit: Supports the SQL Server infrastructure and is not intended to be used
directly in code. ➤
FireError: Fires an event when an error occurs.
➤
FireInformation: Fires an event with information. You can fire this event when you want a set of information to be logged, possibly for auditing later.
➤
FireProgress: Fires an event when a certain progress level has been met.
➤
FireQueryCancel: Fires an event to determine if package execution should stop.
➤
FireWarning: Fires an event that is less serious than an error but more than just
information. ➤
FireCustomEvent: Fires a custom-defined event.
In SSIS, any events you fire are written to all enabled log handlers that are set to log that event. Logging enables you to check what happened with your script when you’re not there to watch it run. Using events is a best practice for troubleshooting and auditing purposes, as you’ll see in the following example.
Example: Raising Some Events The default way to view events while designing your package is to use the Execution Results tab at the top of your package in the SQL Server Data Tools design environment. To fire off some sample events and view them in this Execution Results tab, create a new package with a Script Task and add the System variable System::TaskName as a read-only variable. Then add the following code to the Main() function (ProSSIS\Code\Ch09_ProSSIS\09RaisingEvents.dtsx): C# public void Main() { string taskName = Dts.Variables["System::TaskName"].Value.ToString();
www.it-ebooks.info
c09.indd 303
3/24/2014 9:18:47 AM
304╇
❘╇ CHAPTER 9╇ Scripting in SSIS
bool retVal = false; Dts.Events.FireInformation(0, taskName, String.Format ("Starting Loop Operation at {0} ", DateTime.Now.ToString("MM/dd/yyyy hh:mm:ss")), "", 0, ref retVal); for(int i=0; i <= 10; i++) { Dts.Events.FireProgress(String.Format("Loop in iteration {0}", i), i * 10, 0, 10, taskName, ref retVal); } Dts.Events.FireInformation(0, taskName, String.Format("Completion Loop Operation at {0} ", DateTime.Now.ToString("mm/dd/yyyy hh:mm:ss")), "", 0, ref retVal); Dts.Events.FireWarning(1, taskName, "This is a warning we want to pay attention to...", "", 0); Dts.Events.FireWarning(2, taskName, "This is a warning for debugging only...", "", 0); Dts.Events.FireError(0, taskName, "If we had an error it would be here", "", 0); }
VB Public Sub Main() Dim i As Integer = 0 Dim taskName As String = Dts.Variables("System::TaskName").Value.ToString() Dim retVal As Boolean = False Dts.Events.FireInformation(0, taskName, _ String.Format("Starting Loop Operation at {0} ", _ DateTime.Now.ToString ("MM/dd/yyyy hh:mm:ss")), "", 0, _ True) For i = 0 To 10 Dts.Events.FireProgress( _ String.Format("Loop in iteration {0}", i), _ i * 10, 0, 10, taskName, True) Next Dts.Events.FireInformation(0, taskName, _ String.Format("Completion Loop Operation at {0} ", _ DateTime.Now.ToString ("mm/dd/yyyy hh:mm:ss")), "", 0, False) Dts.Events.FireWarning(1, taskName, _ "This is a warning we want to pay attention to ...", _ "", 0) Dts.Events.FireWarning(2, taskName, _ "This is a warning for debugging only ...", _ "", 0) Dts.Events.FireError(0, taskName, _ "If we had an error it would be here", "", 0) End Sub
This code will perform a simple loop operation that demonstrates firing the information, progress, warning, and error events. If you run the package, you can view the information embedded in these fire event statements in the final tab, either named Execution Results or Progress, depending on
www.it-ebooks.info
c09.indd 304
3/24/2014 9:18:47 AM
Using the Script Task╇
❘╇ 305
whether the designer is in Debug mode or not. These events are shown in Figure 9-15. Note that raising the error event results in the Script Task’s failure. You can comment out the FireError event to see the task complete successfully. All the statements prefixed with the string [Script Task] were generated using these events fired from the Script Task. You can comment out the Dts.Events.FireError method calls to demonstrate to yourself Figure 9-15 that the task can complete successfully for warnings and informational events. Note that with the firing of an error, you can also force the task to generate a custom error with an error code and description. In fact, each of the events has a placeholder as the first parameter to store a custom information code. Continue to the next example to see how you can create an error handler to respond to the warning events that are fired from this Script Task.
Example: Responding to an Event If you have already created a package for the preceding example, navigate to the Event Handlers tab. Event handlers are separate Control Flows that can be executed in response to an event. In the Raising Some Events example, you generated two warning events. One had an information code of one (1) and the other had the value of two (2). In this example, you are going to add an event handler to respond to those warning events and add some logic to respond to the event if the information code is equal to one (1). Select the Script Task executable and then select the OnWarning event handler. Click the hot link that states the following: Click here to create an 'OnWarning' event handler for executable 'Script Task'
This will create a Control Flow surface onto which you can drop SSIS Control Tasks that will execute if an OnWarning event is thrown from the Script Task you added to the package earlier. Drop a new Script Task into the Event Handler Control Flow surface and name it OnWarning Script Task. Your designer should look like Figure 9-16.
Figure 9-16
www.it-ebooks.info
c09.indd 305
3/24/2014 9:18:48 AM
306╇
❘╇ CHAPTER 9╇ Scripting in SSIS
To retrieve the information code sent in the Dts.Events.FireWarning method call, add two system-level variables, System::ErrorCode and System::ErrorDescription, to the Read-Only Variables collection of the OnWarning Script Task. These variables will contain the values of the InformationCode and Description parameters in the Dts.Events() methods. You can then retrieve and evaluate these values when an event is raised by adding the following code (ProSSIS\ Code\Ch09_ProSSIS\09RaisingEvents.dtsx): C# long lWarningCode = long.Parse(Dts.Variables[0].Value.ToString()); String sMsg = string.Empty; if(lWarningCode == 1) { sMsg = String.Format( "Would do something with this warning:\n{0}: {1}", lWarningCode.ToString(), Dts.Variables(1).ToString()); System.Windows.Forms.MessageBox.Show(sMsg); } Dts.TaskResult = (int)ScriptResults.Success;
VB Dim lWarningCode As Long = _ Long.Parse(Dts.Variables(0).Value.ToString()) Dim sMsg As String If lWarningCode = 1 Then sMsg = String.Format("Would do something with this warning: " _ + vbCrLf + "{0}: {1}", _ lWarningCode.ToString(), Dts.Variables(1).ToString()) System.Windows.Forms.MessageBox.Show(sMsg) End If Dts.TaskResult = ScriptResults.Success
The code checks the value of the first parameter, which is the value of the System::ErrorCode and the value raised in the Dts.Events.FireWarning method. If the value is equivalent to one (1), an action is taken to show a message box. This action could just as well be logging an entry to a database or sending an e-mail. If you rerun the package now, you’ll see that the first FireWarning event will be handled in your event handler and generate a message box warning. The second FireWarning event will also be captured by the event handler, but no response is made. The event handler counter in the Progress or Execution Results tab is incremented to two (2). Raising events in the Script Tasks is a great way to get good diagnostic information without resorting to message boxes in your packages. See Chapter 18 for more details about handling errors and events in SSIS.
Example: Logging Event Information Scripts can also be used to fire custom event information, which can then be logged as described previously. To configure the previous example events SSIS package to log event information, go to SSIS ➪ Logging in the SQL Server Data Tools application. The Configure SSIS Logs dialog will appear. Select “SSIS log provider for XML files” in the Provider Type dropdown and click Add. Click the Configuration column and then select from the list to create an XML
www.it-ebooks.info
c09.indd 306
3/24/2014 9:18:48 AM
Using the Script Task╇
❘╇ 307
File Editor. For Usage type, select Create File and specify a path to a filename similar to C:\ProSSIS\Files\myLogFile.xml.
Noteâ•… In a production package you would set this value using an expression or
parameter during runtime.
Click OK to close the File Connection Manager Editor dialog box. Your screen should look something like Figure 9-17.
Figure 9-17
Now click the Package Node to start selecting what tasks in the package should log to the new provider, and check the box next to the provider name so that the log will be used. In the Details tab, select the specific OnWarning events to log. You can choose to log any of the available event types to the providers by also selecting them in the Details tab. Now your provider configuration should look like Figure 9-18.
www.it-ebooks.info
c09.indd 307
3/24/2014 9:18:48 AM
308╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Figure 9-18
You can also use the Advanced tab for each selected event to control exactly what properties of the event are logged as well. If you run the package again, the file specified in the logging provider will be created with content similar to the following: OnWarningThis is a warning we want to pay attention to ...MYCOMPUTERMYCOMPUTER\ADMIN{D86FF397-6C9B-4AD9-BACF-B4D41AC89ECB}{8B6F6392-1818-4EE5-87BF-EDCB5DC37ACB}1/22/2012 9:30:08 PM1/22/2012 9:30:08 PM20x
You’ll have other events in the file too, such as Package Start and Package End, but the preceding code snippet focuses on the event that your code fired. This record contains basic information about the event, including the message, event execution time, and the computer and user that raised the event. Using the Script Task to raise an event is just one way to get more diagnostic information into your SSIS log files. Read on to get a brief look at generating simple log entries.
www.it-ebooks.info
c09.indd 308
3/24/2014 9:18:49 AM
Using the Script Task╇
❘╇ 309
Writing a Log Entry in a Script Task Within a Script Task, the Log method of the Dts object writes a message to all enabled log providers. The Log method is simple and has three arguments: ➤
messageText: The message to log
➤
dataCode: A field for logging a message code
➤
dataBytes: A field for logging binary data
The Log method is similar to the FireInformation method of the Events property, but it is easier to use and more efficient — and you don’t need to create a specialized event handler to respond to the method call. All you need to do is set up a log provider within the package. In the previous section, you learned how to add a log provider to a package. The code in the next section logs a simple message with some binary data to all available log providers. This is quite useful for troubleshooting and auditing purposes. You can write out information at important steps in your script and even print out variable values to help you track down a problem.
Example: Scripting a Log Entry This example demonstrates how to script a log entry by adding a few lines of code to the package in the previous examples that you used to raise events. First, add the following lines to the appropriate Script Task that matches the language you chose in the previous example (ProSSIS\Code\Ch09_ ProSSIS\09RaisingEvents.dtsx): C# Byte[] myByteArray[] = new byte[0]; Dts.Log("Called procedure: usp_Upsert with return code 4", 0, myByteArray);
VB Dim myByteArray(0) As Byte Dts.Log("Called procedure: usp_Upsert with return code 4", 0, myByteArray)
Next, select the events for the ScriptTaskLogEntry event in the Details tab of the logging configuration. This tells the SSIS package logger to handle any custom logging instructions such as the one you just coded. Then run the package. You’ll see a set of additional logging instructions that look like this: User:ScriptTaskLogEntryCalled Procedure: usp_Upsert with return code 4MYCOMPUTERMYCOMPUTER\ADMIN{CE53C1BB-7757-47FF-B173-E6088DA0A2A3}{B7828A35-C236-451E-99DE-F679CF808D91}4/27/2008 2:54:04 PM4/27/2008 2:54:04 PM00x
www.it-ebooks.info
c09.indd 309
3/24/2014 9:18:49 AM
310╇
❘╇ CHAPTER 9╇ Scripting in SSIS
As you can see, the Script Task is highly flexible with the inclusion of the .NET-based VSTA capabilities. As far as controlling package flow or one-off activities, the Script Task is clearly very important. However, the Script Task doesn’t do all things well. If you want to apply programmatic logic to the Data Flow in an SSIS package, then you need to add to your knowledge of scripting in SSIS with the Script Component.
Using the Script Component The Script Component provides another area where programming logic can be applied in an SSIS package. This component, which can be used only in the Data Flow portion of an SSIS package, allows programmatic tasks to occur in the data stream. This component exists to provide, consume, or transform data using .NET code. To differentiate between the various uses of the Script Component, when you create one you have to choose one of the following three types: ➤
Source Type Component: The role of this Script Component is to provide data to your Data Flow Task. You can define outputs and their types and use script code to populate them. An example would be reading in a complex file format, possibly XML or something that requires custom coding to read, like HTTP or RSS Sources.
➤
Destination Type Component: This type of Script Component consumes data much like an Excel or Flat File Destination. This component is the end of the line for the data in your data stream. Here, you’ll typically put the data into a DataSet variable to pass back to the Control Flow for further processing, or send the stream to custom output destinations not supported by built-in SSIS components. Examples of these output destinations can be web service calls, custom XML formats, and multi-record formats for mainframe systems. You can even programmatically connect and send a stream to a printer object.
➤
Transformation Type Component: This type of Script Component can perform custom transformations on data. It consumes input columns and produces output columns. You would use this component when one of the built-in transformations just isn’t flexible enough.
In this section, you’ll get up to speed on all the specifics of the Script Component, starting first with an explanation of the differences between the Script Task and the Script Component, and then looking at the coding differences in the two models. Finally, you’ll see an example of each implementation type of the Script Component to put all of this information to use.
Differences from a Script Task You might ask, “Why are there two controls, both the Script Task and the Script Component?” Well, underlying the SSIS architecture are two different implementations that define how the VSTA environment is used for performance. Each Script Task is called only once within a Control Flow, unless it is in a looping control. The Script Component has to be higher octane because it is going to be called per row of data in the data stream. You are also in the context of being able to access the data buffers directly, so you will be able to perform more tasks. When you are working with these two controls, the bottom line is that there are slightly different ways of doing the same types of things in each. This section of the chapter cycles back through some
www.it-ebooks.info
c09.indd 310
3/24/2014 9:18:49 AM
Using the Script Component╇
❘╇ 311
of the things you did with the Script Task and points out the differences. First you’ll look at the differences in configuring the editor. Then you’ll see what changes when performing programmatic tasks such as accessing variables, using connections, raising events, and logging. Finally, you’ll look at an example that ties everything together.
Configuring the Script Component Editor You’ll notice the differences starting with the item editor. Adding a Script Component to the Data Flow designer brings up the editor shown in Figure 9-19, requesting the component type.
Noteâ•… In order to add the Script Component, you must first add a Data Flow Task to a package.
Selecting one of these options changes how the editor is displayed to configure the control. Essentially, you are choosing whether the control has input buffers, output buffers, or both. Figure 9-20 shows an example of a Script Component Transformation that has both buffers.
Figure 9-19
Figure 9-20
The Script Component Source has only output buffers available, and the Script Component Destination has only input buffers available. You are responsible for defining these buffers by providing the set of typed columns for either the input or outputs. If the data is being fed into the
www.it-ebooks.info
c09.indd 311
3/24/2014 9:18:49 AM
312╇
❘╇ CHAPTER 9╇ Scripting in SSIS
component, the editor can set these up for you. Otherwise, you have to define them yourself. You can do this programmatically in the code, or ahead of time using the editor. Just select the input or output columns collection on the user interface, and click the Add Column button to add a column, as shown in Figure 9-21. A helpful tip is to select the Output Columns node on the tree view, so that the new column is added to the bottom of the collection. Once you add a column, you can’t move it up or down. After adding the column, you need to set the Data Type, Length, Precision, and Scale. For details about the SSIS data types, see Chapter 5. Figure 9-21
When you access the scripting environment, you’ll notice some additional differences between the Script Component and the Script Task. Namely, some new classes have been added to the Solution Explorer, as shown in Figure 9-22. The name of the class that is used to host custom code is different from that used for the Script Task. Rather than ScriptMain, the class is called main. Internally there are also some differences. The primary difference is the existence of more than one entry point method. The methods you’ll see in the main class depend upon the Script Component type. At least three of the following methods are typically coded and can be used as entry points in the Script Component: ➤
PreExecute is used for preprocessing tasks like creating
expensive connections or file streams. ➤
PostExecute is used for cleanup tasks or setting
variables at the completion of each processed row. ➤
CreateNewOutputRows is the method to manage the
output buffers. ➤
Input0_ProcessInputRow is the method to manage
anything coming from the input buffers. Note that the Input0 part of the name will differ based on the name of
the input set in the editor.
Figure 9-22
The remaining classes are generated automatically based on your input and output columns when you enter into the script environment, so don’t make any changes to these; otherwise, they will be overwritten when you reenter the script environment. One problem you might encounter in the Script Component Editor and the generation of the BufferWrapper class is that you can name columns in the editor that use keywords or are otherwise
www.it-ebooks.info
c09.indd 312
3/24/2014 9:18:50 AM
Using the Script Component╇
❘╇ 313
invalid when the BufferWrapper class is generated. An example would be an output column named 125K_AMOUNT. If you create such a column, you’ll get an error in the BufferWrapper class stating the following: Invalid Token 125 in class, struct, or interface member declaration
Don’t attempt to change the property in the buffer class to something like _125K_AMOUNT, because this property is rebuilt the next time you edit the script. Change the name of the output column to _125K_AMOUNT, and the buffer class will change automatically. The biggest difference that you need to pay attention to with the Script Component is that if you make any changes to this editor, you’ll need to open the script environment so that all these base classes can be generated. Last, but not least, you’ll notice a Connection Managers tab that is not available in the Script Task Editor. This enables you to name specifically the connections that you want to be able to access within the Script Component. Figure 9-23 Although you are not required to name these connections up front, it is extremely helpful to do so. You’ll see why later, when you connect to a data source. Figure 9-23 shows an example of the AdventureWorks connection added to a Script Component. Now that you understand the differences between the Script Task and Script Component from a setup perspective, you can examine how the coding differs.
Accessing Variables in a Script Component The same concepts behind accessing variables also apply to the Script Component. You can send the variables into the control by adding them to the ReadOnlyVariables or ReadWriteVariables properties of the editor. You can also choose not to specify them up front and just use the variable dispenser within your Script Component to access, lock, and manipulate variables. We recommend using the properties in the editor for this component because the variables provided in the editor are added to the auto-generated base class variables collection as strongly typed variables. In this control, adding variables to the editor not only removes the need to lock and unlock the variables but also means you don’t have to remember the variable name within the component. Keep in mind that variables can’t be modified in all aspects of the Script Component. Here’s an example of setting the variable ValidationErrors within a Script Component: C# this.Variables.ValidationErrors = 1;
VB me.Variables.ValidationErrors = 1
www.it-ebooks.info
c09.indd 313
3/24/2014 9:18:50 AM
314╇
❘╇ CHAPTER 9╇ Scripting in SSIS
As you can see, using variables is easier and more maintainable than in the Script Task because the variable names are available in IntelliSense and checked at compile time. However, if you don’t want to add a variable to each Script Component for some reason, you can still use the variable dispenser in this component. It is located on the base class and can be accessed using the base class, instead of the Dts object. Other than these differences, the variable examples in the Script Task section of this chapter are still applicable. The remaining tasks of connecting to data sources, raising events, and logging follow a similar pattern. The methods for performing the tasks are more strongly named, which makes sense because any late binding (or runtime type checking) within a high-performing Data Flow Task would slow it down.
Connecting to Data Sources in a Script Component A typical use of a connection is in the Source type of the Script Component, because in these types of Data Flow Tasks, the mission is to create a data stream. The origination of that data is usually another external source. If you had a defined SSIS Source Component, then it would be used and you wouldn’t need the Script Component to connect to it. The coding to connect to a Connection Manager is very simple. You can instantiate a specific Connection Manager and assign the reference to a connection in the component’s collection. Using the connections collection in the Script Component is very similar to using the variables collection. The collection of strongly typed Connection Managers is created every time the script editor is opened. Again, this is helpful because you don’t have to remember the names, and you get compiletime verification and checking. For example, if you had a package with an OLE DB Connection Manager named myOracleServer and added it to the Script Component with the name OracleConnection, you’d have access to the connection using this code: C# ConnectionManagerOleDb oracleConnection = (ConnectionManagerOleDb)base.Connections.OracleConnection;
VB Dim oracleConnection as ConnectionManagerOleDb oracleConnection = Connections.OracleConnection
Raising Events For the Script Task, you’ve looked at SSIS’s ability to raise events, and you walked through some examples that demonstrated its scripting capabilities for managing how the package can respond to these events. These same capabilities exist in Script Components, although you need to keep in mind that Script Components run in a data pipeline or stream, so the potential for repeated calls is highly likely. You should fire events sparingly within a Script Component that is generating or processing data in the pipeline to reduce overhead and increase performance. The methods are essentially the same, but without the static Dts object. Noteâ•… Event handling is covered in more detail in Chapter 18.
www.it-ebooks.info
c09.indd 314
3/24/2014 9:18:50 AM
Using the Script Component╇
❘╇ 315
Here is the code to raise an informational event in a Script Component (ProSSIS\Code\Ch09_ ProSSIS\09RaisingEvents.dtsx):
C# Boolean myBool=false; this.ComponentMetaData.FireInformation(0, "myScriptComponent", "Removed non-ASCII Character", "", 0, ref myBool);
VB Dim myBool As Boolean Me.ComponentMetaData.FireInformation(0, _ "myScriptComponent", "Removed non-ASCII Character", "", 0, myBool)
Either version of code will generate an event in the Progress Tab that looks like this: [myScriptComponent] Information: Removed non-ASCII Character
Raising an event is preferred to logging because it enables you to develop a separate workflow for handling the event, but in some instances logging may be preferred.
Logging Like the Script Task, logging in the Script Component writes a message to all enabled log providers. It has the same interface as the Script Task, but it is exposed on the base class. Remember that Script Components run in a data pipeline or stream, so the potential for repeated calls is highly likely. Follow the same rules as those for raising events, and log sparingly within a Script Component that is generating or processing data in the pipeline to reduce overhead and increase performance. If you need to log a message within a Data Flow, you can improve performance by logging only in the PostExecute method, so that the results are logged only once.
Example: Scripting a Log Entry This example shows how to log one informational entry to the log file providers at the end of a Data Flow Task. To use this code, create a package with a Data Flow Task and add a Script Component as a source with one output column named NewOutputColumn. Create these integer variables as private variables to the main.cs class: validationBadChars, validationLength, and validationInvalidFormat. Then add the following code to the CreateNewOutputRows() method in the main.cs class (ProSSIS\Code\Ch09_ProSSIS\11aSCBasicLogging.dtsx): C# int validationLengthErrors = 0; int validationCharErrors = 0; int validationFormatErrors = 0; //..in the CreateNewOutputRows() Method string validationMsg = string.Format("Validation Errors:\nBad Chars {0}\nInvalid Length " + "{1}\nInvalid Format {2}", validationCharErrors, validationLengthErrors, validationFormatErrors); this.Log(validationMsg, 0, new byte[0]); //This is how to add rows to the outputrows Output0Buffer collection.
www.it-ebooks.info
c09.indd 315
3/24/2014 9:18:50 AM
316╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Output0Buffer.AddRow(); Output0Buffer.NewOutputColumn = 1;
VB Dim validationLengthErrors As Integer = 0 Dim validationCharErrors As Integer = 0 Dim validationFormatErrors As Integer = 0 '..in the CreateNewOutputRows() Method Dim validationMsg As String validationMsg = String.Format("Validation Errors:" + _ vbCrLf + "Bad Chars {0}" + _ vbCrLf + "Invalid Length {1}" + _ vbCrLf + "Invalid Format {2}", _ validationCharErrors, validationLengthErrors, _ validationFormatErrors) Dim myByteArray(0) As Byte Me.Log(validationMsg, 0, myByteArray) Output0Buffer.AddRow() Output0Buffer.AddNewOutputColumn = 1
In order for this sample to produce a log entry, remember that you have to set up a logging provider (use the menu option SSISâ•–➪â•–Logging). Make sure you specifically select the Data Flow Task in which the Script Component is hosted within SSIS and the logging events specifically for the Script Component. Running the package will produce logging similar to this: User:ScriptComponentLogEntry,MYPC,MYPC\ADMIN,"CSharp Basic Logging Script Component" (1),{00000001-0000-0000-0000-000000000000}, {3651D743-D7F6-43F8- 8DE2-F7B40423CC28}, 4/27/2012 10:38:56 PM,4/27/2008 10:38:56 PM,0,0x, Validation Errors: Bad Chars 0 Invalid Length 0 Invalid Format 0 OnPipelinePostPrimeOutput, MYPC,MYPC\ADMIN,Data Flow Task, {D2118DFD-DAEE-470B- 9AC3-9B01DFAA993E}, {3651D743-D7F6-43F8-8DE2-F7B40423CC28},4/27/2008 10:38:55 PM, 4/27/2008 10:38:55 PM,0,0x,A component has returned from its PrimeOutput call. : 1 : CSharp Basic Logging Script Component
Example: Data Validation Compared to the Script Task, the Script Component has a steeper learning curve. The example presented in this section is more comprehensive and should enable you to get the bigger picture of how you can use this component in your everyday package development. A typical use of the Script Component is to validate data within a Data Flow. In this example, contact information from a custom application did not validate its data entry, resulting in poor data quality. Because the destination database has a strict set of requirements for the data, your task is to validate the contact information from a Flat File Source and separate valid from invalid records into two streams: the good stream and the error stream. The good records can continue to another Data Flow; the error records will be sent to an error table for manual cleansing.
www.it-ebooks.info
c09.indd 316
3/24/2014 9:18:50 AM
Using the Script Component╇
❘╇ 317
Create the contacts table with the following script (ProSSIS\Scripts\Ch09_ProSSIS\Ch09_ Table_Create_Script.sql): CREATE TABLE [dbo].[Contacts]( [ContactID] [int] IDENTITY(1,1) NOT NULL, [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL, [City] [varchar](25) NOT NULL, [State] [varchar](15) NOT NULL, [Zip] [char](11) NULL ) ON [PRIMARY]
The error queue table is virtually identical except it has no strict requirements and a column has been added to capture the rejection reason. All data fields are nullable and set to the maximum known size (ProSSIS\Scripts\Ch09_ProSSIS\Ch09_Table_Create_Script.sql): CREATE TABLE dbo.ContactsErrorQueue ( ContactErrorID int NOT NULL IDENTITY (1, 1), FirstName varchar(50) NULL, LastName varchar(50) NULL, City varchar(50) NULL, State varchar(50) NULL, Zip varchar(50) NULL, RejectReason varchar(50) NULL ) ON [PRIMARY]
Finally, the incoming data format is fixed-width and is defined as follows: Field
Starting Position
New Field Name
First Name
1
FirstName
Last Name
11
LastName
City
26
City
State
44
State
Zip
52
Zip
The data file provided as a test sample looks like this (ProSSIS\Files\Ch09_ProSSIS\contacts. dat): Jason Gerard Jacksonville FL 32276-1911 Joseph McClung JACKSONVILLE FLORIDA 322763939 Andrei Ranga Jax fl 32276 Chad Crisostomo Orlando FL 32746 Andrew Ranger Jax fl
www.it-ebooks.info
c09.indd 317
3/24/2014 9:18:50 AM
318╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Create a sample of this data file or download a copy from www.wrox.com/go/ prossis2014. Create a new package and add a Data Flow Task. Click on the Data Flow design surface and add a Connection Manager to the Connection Managers tab. Name the Connection Manager “Contacts Mainframe Extract,” browse to the data file, and set the file format to Ragged Right. Flat files with spaces at the end of the specifications are typically difficult to process in some ETL platforms. The Ragged Right option in SSIS provides a way to handle these easily without having to run the file through a Script Task to put a character into a consistent spot or without having the origination system reformat its extract files. Figure 9-24 Use the Columns tab to visually define the columns. Flip to the Advanced tab to define each of the column names, types, and widths to match the desired values and the new database field name. (You may need to delete an unused column if this is added by the designer.) The designer at this point looks like Figure 9-24. Typically, you may want to define some data with strong types. You can decide to do that here in the Connection Manager or you can do so later using a derived column depending on how confident you are in the source of the data. If the data source is completely unreliable, import data using Unicode strings and use your Data Flow Tasks to validate the data. Then move good data into a strong data type using the Derived Column Transformation. On the Data Flow surface, drag a Flat File Source to the Data Flow editor pane. Edit the Flat File Source and set the Connection Manager to the Contract Mainframe Extract Connection Manager. This sets up the origination of the data to stream into the Data Flow Task. Check the box labeled “Retain null values from the source as null values in the Data Flow.” This feature provides the consistent testing of null values later. Now add a Script Component to the Data Flow. When you drop the Script Component, you will be prompted to pick the type of component to create. Select Transformation and click OK. Connect the output of the Flat File Source to the Script Component to pipe the data into this component, where you can program some validation on the data. Open the Script Component and set the ScriptLanguage property to the language of your choice. On the Input Columns tab, you will notice that Input Name is a dropdown with the name Input 0. It is possible to have more than one source pointed to this Script Component. If so, this dropdown would allow you to individually configure the inputs and select the columns from each input. For this example, select all the input columns. Set the Usage Type for the State and Zip columns to ReadWrite. The reason will be clear later.
www.it-ebooks.info
c09.indd 318
3/24/2014 9:18:51 AM
Using the Script Component╇
❘╇ 319
Select the Inputs and Outputs tab to see the collection of inputs and outputs and the input columns defined previously. Here you can create additional input and output buffers and columns within each. Expand all the nodes and add these two output columns:
Column Name
T ype
Size
GoodFlag
DT_BOOL
N/A
RejectReason
DT_STR
50
You’ll use the flag to separate the data from the data stream. The rejection reason will be useful to the person who has to perform any manual work on the data later. The designer with all nodes expanded should look like Figure 9-25. Back on the Script tab, click the Edit Script button to enter the VSTA scripting IDE. In the main class, the rules for validation need to be programmatically applied to each data row. In the Input0_ProcessInputRow method that was co-generated by SSIS using the Script Component designer, add the rules for data validation: ➤➤
All fields are required except for the zip code.
➤➤
The zip code must be in the format #####-#### or ##### and use numeric digits from 0 through 9. If the zip code is valid for the first five characters but the whole string is not, strip the trailing records and use the first five.
➤➤
The state must be two uppercase characters.
Figure 9-25
Here’s the overall plan: the contents of the file will be sent into the Script Component. This is where programmatic control will be applied to each row processed. The incoming row has three data fields that need to be validated to determine whether all necessary data is present. The State and Zip columns need to be validated additionally by rule, and even cleaned up if possible. The need to fix the data in the stream is why the Zip and State column usage types had to be set to ReadWrite in the designer earlier. To aid in accomplishing these rules, the data will be validated using regular expressions. Regular expressions are a powerful utility that should be in every developer’s tool belt. They enable you to perform powerful string matching and replacement routines. You can find an excellent tutorial on regular expressions at www.regular-expressions.info. The regular expressions for matching the data are shown here:
www.it-ebooks.info
c09.indd 319
3/24/2014 9:18:51 AM
320╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Regul ar Expression
Validation Description
^\d{5}([\-]\d{4})?$
Matches a five-digit or nine-digit zip code with dashes
\b([A-Z]{2})\b
Ensures that the state is only two capital characters
To use the regular expression library, add the .NET System.Text.RegularExpressions namespace to the top of the main class. For performance reasons, create the instances of the RegEx class to validate the ZipCode and the State in the PreExecute() method of the Script Component. This method and the private instances of the Regex classes should look like this (ProSSIS\Code\Ch09_ ProSSIS\10SCContactsExample.dtsx): C# private Regex zipRegex; private Regex stateRegex; public override void PreExecute() { base.PreExecute(); zipRegex = new Regex("^\\d{5}([\\-]\\d{4})?$", RegexOptions.None); stateRegex = new Regex("\\b([A-Z]{2})\\b", RegexOptions.None); }
VB Private zipRegex As Regex Private stateRegex As Regex Public Overrides Sub PreExecute() MyBase.PreExecute() zipRegex = New Regex("^\d{5}([\-]\d{4})?$", RegexOptions.None) stateRegex = New Regex("\b([A-Z]{2})\b", RegexOptions.None) End Sub
To break up the tasks, create two new private functions to validate the ZipCode and State. Using byRef arguments for the reason and the ZipCode enables the data to be cleaned and the encapsulated logic to return both a true or false and the reason. The ZipCode validation functions should look like this (ProSSIS\Code\Ch09_ProSSIS\10SCContactsExample.dtsx): C# private bool ZipIsValid(ref string zip, ref string reason) { zip = zip.Trim(); if (zipRegex.IsMatch(zip)) { return true; } Else { if (zip.Length > 5) { zip = zip.Substring(0, 5); if (zipRegex.IsMatch(zip))
www.it-ebooks.info
c09.indd 320
3/24/2014 9:18:51 AM
Using the Script Component╇
{ return } Else { reason return } } Else { reason return } } }
❘╇ 321
true;
= "Zip larger than 5 Chars, " + "Retested at 5 Chars and Failed"; false;
= "Zip Failed Initial Format Rule"; false;
VB Private Function ZipIsValid(ByRef zip As String, _ ByRef reason As String) As Boolean zip = zip.Trim() If (zipRegex.IsMatch(zip)) Then Return True Else If (zip.Length> 5) Then zip = zip.Substring(0, 5) If (zipRegex.IsMatch(zip)) Then Return True Else reason = "Zip larger than 5 Chars, " + _ "Retested at 5 Chars and Failed" Return False End If Else reason = "Zip Failed Initial Format Rule" Return False End If End If End Function
The state validation functions look like this (ProSSIS\Code\Ch09_ProSSIS \10SCContactsExample.dtsx): C# private bool StateIsValid(ref string state, ref string reason) { state = state.Trim().ToUpper(); if (stateRegex.IsMatch(state)) { return true; } Else { reason = "Failed State Validation";
www.it-ebooks.info
c09.indd 321
3/24/2014 9:18:51 AM
322╇
❘╇ CHAPTER 9╇ Scripting in SSIS
return false; } }
VB Private Function StateIsValid(ByRef state As String, _ ByRef reason As String) As Boolean state = state.Trim().ToUpper() If (stateRegex.IsMatch(state)) Then Return True Else reason = "Failed State Validation" Return False End If End Function
Now, to put it all together, add the driver method Input0_ProcessInputRow() that is fired upon each row of the flat file (ProSSIS\Code\Ch09_ProSSIS\10SCContactsExample.dtsx): C# public override void Input0_ProcessInputRow(Input0Buffer Row) { Row.GoodFlag = false; string myZip = string.Empty; string myState = string.Empty; string reason = string.Empty; if (!Row.FirstName_IsNull && !Row.LastName_IsNull && !Row.City_IsNull && !Row.State_IsNull && !Row.Zip_IsNull) { myZip = Row.Zip; myState = Row.State; if (ZipIsValid(ref myZip, ref reason) && StateIsValid(ref myState, ref reason)) { Row.Zip = myZip; Row.State = myState; Row.GoodFlag = true; } Else { Row.RejectReason = reason; } } Else { Row.RejectReason = "All Required Fields not completed"; } }
VB Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) Dim myZip As String = String.Empty Dim myState As String = String.Empty Dim reason As String = String.Empty If (Row.FirstName_IsNull = False And _ Row.LastName_IsNull = False And _ Row.City_IsNull = False And _ Row.State_IsNull = False And _
www.it-ebooks.info
c09.indd 322
3/24/2014 9:18:51 AM
Using the Script Component╇
❘╇ 323
Row.Zip_IsNull = False) Then myZip = Row.Zip myState = Row.State If (ZipIsValid(myZip, reason) And _ StateIsValid(myState, reason)) Then Row.Zip = myZip Row.State = myState Row.GoodFlag = True Else Row.RejectReason = reason End If Else Row.RejectReason = "All Required Fields not completed" End If End Sub
Notice that all fields are checked for null values using a property on the Row class that is the field name and an additional tag _IsNull. This is a property code generated by SSIS when you set up the input and output columns on the Script Component. Properties like Zip_IsNull explicitly allow the checking of a null value without encountering a null exception. This is handy as the property returns true if the particular column is NULL. Next, if the Zip column is not NULL, its value is matched against the regular expression to determine whether it’s in the correct format. If it is, the value is assigned back to the Zip column as a cleaned data element. If the value of the Zip column doesn’t match the regular expression, the script checks whether it is at least five characters long. If true, then the first five characters are retested for a valid ZipCode pattern. Nonmatching values result in a GoodFlag in the output columns being set to False. The state is trimmed of any leading or trailing white space, and then converted to uppercase and matched against the regular expression. The expression simply checks to see if it contains two uppercase letters between A and Z. If it does, the GoodFlag is set to True and the state value is updated; otherwise, the GoodFlag is set to False. To send the data to the appropriate table based on the GoodFlag, you must use the Conditional Split Transformation. Add this task to the Data Flow designer and connect the output of the Script Component Task to the Conditional Split Transformation. Edit the Conditional Split Transformation, and add an output named Good with the condition GoodFlag == TRUE and name the default output Bad. This separates the data rows coming out of the Script Component Task into two separate streams. The Conditional Split Transformation Editor should look like Figure 9-26.
Figure 9-26
www.it-ebooks.info
c09.indd 323
3/24/2014 9:18:52 AM
324╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Add an OLE DB Connection Manager that uses the database you created for the Contacts and ContactsErrorQueue tables. Add two SQL Server Destinations to the Data Flow designer. One, named Validated Contacts SQL Server Destination, should point to the Contacts table; the other, named Error Contacts SQL Server Destination, should point to the ContactsErrorQueue table. Drag the output of the Conditional Split Transformation to the Validated Destination. Set the output stream named Good to the destination. Then open the Mappings tab in the Destination to map the input stream to the columns in the Contacts table. Repeat this for the other Bad output of the Conditional Split Transformation to the Error Destination. Your final Data Flow should look something like Figure 9-27. If you run this package with the Contacts.dat file described at the top of the use case, three contacts will validate, and two will fail with these rejection reasons: Failed State Validation Joseph McClung JACKSONVILLE FLORIDA Zip Failed Initial Format Rule Andrew Ranger Jax fl
322763939
Figure 9-27
Synchronous versus Asynchronous Data Flow transformations can handle data rows in one of two ways: synchronously or asynchronously. ➤
A synchronous component performs its stated operation for every row in the buffer of rows. It does not need to copy the buffer to a new memory space, and does not need to look at multiple rows to create its output. Examples of synchronous components include the Derived Column Transformation and the Row Count Transformation.
➤
The second type of transformation, an asynchronous component, creates another buffer for the output. It typically used multiple (or all) of the input rows to create a new output. The output usually looks quite different from the input, and the component tends to be
www.it-ebooks.info
c09.indd 324
3/24/2014 9:18:52 AM
Using the Script Component╇
❘╇ 325
slower because of the copying of memory. Asynchronous component examples include the Aggregate Transformation and Sort Transformation. Script components can be written to act synchronously or asynchronously. The Data Validation example previously discussed is an example of a synchronous component. Let’s create an asynchronous example for comparison. This example will show how to derive the median value from a set of source values.
Example: Creating a Median Value Asynchronously As a starting point, use the AdventureWorks database to pull a set of values using an OLE DB Source, such as the TotalDue column from the Sales.SalesOrderHeader table. Similar to when you create a synchronous component, you can use a Script Component from the SSIS Toolbox as a transformation object and select the appropriate input columns, which in this case is the TotalDue column. The Input and Outputs menu is where you veer off the same path that you would have followed with the synchronous component. The output property named SynchronousInputID needs to be set to None, which lets the component know that it should create a new buffer. The inputs and outputs created can be seen in Figure 9-28.
Figure 9-28
Once the inputs and outputs are prepared, it is time to write the script to perform the median calculation. The full script can be seen here in both languages (ProSSIS\Code\Ch09_ ProSSIS\11bSCAsync.dtsx): VB Private valueArray As ArrayList Public Overrides Sub PreExecute() MyBase.PreExecute() valueArray = New ArrayList End Sub Public Overrides Sub CreateNewOutputRows() MedianOutputBuffer.AddRow() End Sub Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) valueArray.Add(Row.TotalDue) End Sub Public Overrides Sub FinishOutputs() valueArray.Sort() If valueArray.Count Mod 2 = 0 Then MedianOutputBuffer.Value = (CDec(valueArray(valueArray.Count / 2 - 1)) + _ CDec(valueArray(valueArray.Count / 2))) / 2 Else MedianOutputBuffer.Value = CDec(valueArray(Floor(valueArray.Count / 2))) End If
www.it-ebooks.info
c09.indd 325
3/24/2014 9:18:52 AM
326╇
❘╇ CHAPTER 9╇ Scripting in SSIS
End Sub
C# private ArrayList valueArray; public override void PreExecute() { base.PreExecute(); valueArray = new ArrayList(); } public override void CreateNewOutputRows() { MedianOutputBuffer.AddRow(); } public override void Input0_ProcessInputRow(Input0Buffer Row) { valueArray.Add(Row.TotalDue); } public override void FinishOutputs() { base.FinishOutputs(); valueArray.Sort(); if (valueArray.Count % 2 == 0) { MedianOutputBuffer.Value = (Convert.ToDecimal(valueArray[valueArray.Count / 2 - 1]) + Convert.ToDecimal(valueArray[valueArray.Count / 2])) / 2; } else { MedianOutputBuffer.Value = Convert.ToDecimal(valueArray[Convert.ToInt32( Math.Floor(valueArray.Count / 2.0))]); } }
Note that there is an ArrayList that sits outside of the methods. This variable is accessed by multiple functions throughout the execution of the component, so it needs to be accessible by all. When then component runs its preexecute phase, it will initialize the ArrayList and prepare it to be used. Then as each input row is processed, the value will be added to the ArrayList. Finally, in the FinishOutputs method, the median is calculated by sorting the values and pulling the middle value. This value is added to the output buffer and can be inserted into a file or database. The finished and executed package is shown in Figure 9-29. Figure 9-29
www.it-ebooks.info
c09.indd 326
3/24/2014 9:18:52 AM
Essential Coding, Debugging, and Troubleshooting Techniques ╇
❘╇ 327
At this point, you have a good overview of how scripting works in SSIS and the difference between the Script Task and the Script Component, but as with any programming environment, you need to know how to troubleshoot and debug your code to ensure that everything works correctly. The next section describes some techniques you can use for more advanced SSIS scripting development.
Essential Coding, Debugging, and Troubleshooting Techniques You have now been all over the VSTA development environment and have been introduced to the different languages that move SSIS development into the managed code arena. Now, it is time to dig into some of the techniques for hardening your code for unexpected issues that may occur during runtime, and to look at some ways to troubleshoot SSIS packages. Any differences between the Script Task and the Script Component for some of these techniques are highlighted.
Structured Exception Handling Structured exception handling (SEH) enables you to catch specific errors as they occur and perform any appropriate action needed. In many cases, you just want to log the error and stop execution, but in some cases you may want to try a different plan of action, depending on the error. Here is an example of exception handling in SSIS scripting code in both languages (ProSSIS\Code\ Ch09_ProSSIS\12ScriptErrorSEH.dtsx): C# public void Main() { Try { string fileText = string.Empty; fileText = System.IO.File.ReadAllText("c:\\data.csv"); } catch (System.IO.FileNotFoundException ex) { //Log Error Here //MessageBox here for demo purposes only System.Windows.Forms.MessageBox.Show (ex.ToString()); Dts.TaskResult = (int)ScriptResults.Failure; } Dts.TaskResult = (int)ScriptResults.Success; }
VB Public Sub Main() Try Dim fileText As String fileText = FileIO.FileSystem.ReadAllText("C:\data.csv") Catch ex As System.IO.FileNotFoundException 'Log Error Here 'MessageBox here for demo purposes only System.Windows.Forms.MessageBox.Show (ex.ToString()) Dts.TaskResult = ScriptResults.Failure
www.it-ebooks.info
c09.indd 327
3/24/2014 9:18:52 AM
328╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Return End Try Dts.TaskResult = ScriptResults.Success End Sub
This trivial example attempts to read the contents of the file at C:\data.csv into a string variable. The code makes some assumptions that might not be true. An obvious assumption is that the file exists. That is why this code was placed in a Try block. It is trying to perform an action that has the potential for failure. If the file isn’t there, a System.IO.FileNotFoundException is thrown. A Try block marks a section of code that contains function calls with potentially known exceptions. In this case, the FileSystem ReadAllText function has the potentia l to throw a concrete exception. The Catch block is the error handler for this specific exception. You would probably want to add some code to log the error inside the Catch block. For now, the exception is sent to the message box as a string so that it can be viewed. This code obviously originates from a Scripting Task, as it returns a result. The result is set to Failure, and the script is exited with the Return statement if the exception occurs. If the file is found, no exception is thrown, and the next line of code is executed. In this case, it would go to the line that sets the TaskResult to the value of the Success enumeration, right after the End Try statement. If an exception is not caught, it propagates up the call stack until an appropriate handler is found. If none is found, the exception stops execution. You can have as many Catch blocks associated with a Try block as you wish. When an exception is raised, the Catch blocks are walked from top to bottom until an appropriate one is found that fits the context of the exception. Only the first block that matches is executed. Execution does not fall through to the next block, so it’s important to place the most specific Catch block first and descend to the least specific. A Catch block specified with no filter will catch all exceptions. Typically, the coarsest Catch block is listed last. The previous code was written to anticipate the error of a file not being found, so not only does the developer have an opportunity to add some recovery code, but the framework assumes that you will handle the details of the error itself. If the same code contained only a generic Catch statement, the error would simply be written to the package output. To see what this looks like, replace the Catch statement in the preceding code snippet with these: C# Catch()
VB Catch
In this case, the error would simply be written to the package output like this: SSIS package "Package.dtsx" starting. Error: 0x1 at VB Script Task: System.Reflection.TargetInvocationException, mscorlib System.IO.FileNotFoundException, mscorlib System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.IO.FileNotFoundException: Could not find file 'C:\data.csv'. File name: 'C:\data.csv'
www.it-ebooks.info
c09.indd 328
3/24/2014 9:18:53 AM
Essential Coding, Debugging, and Troubleshooting Techniques ╇
❘╇ 329
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy) ... Task failed: VB Script Task SSIS package "Package.dtsx" finished: Success.
The full stack is omitted for brevity and to point out that the task status shows that it failed. Another feature of structured error handling is the Finally block. The Finally block exists inside a Try block and executes after any code in the Try block and any Catch blocks that were entered. Code in the Finally block is always executed, regardless of what happens in the Try block and in any Catch blocks. You would put code to dispose of any resources, such as open files or database connections, in the Finally block. Following is an example of using the Finally block to free up a connection resource: C# public void OpenConnection(string myConStr) { SqlConnection con = new SqlConnection(myConStr); Try { con.Open(); //do stuff with con } catch (SqlException ex) { //log error here } Finally { if (con != null) { con.Dispose(); } } }
VB Public Sub OpenConnection(myConStr as String) Dim con As SqlConnection = New SqlConnection(myConStr) Try con.Open() 'do stuff with con Catch ex As SqlException 'Log Error Here Dts.TaskResult = Dts.Results.Failure Return Finally If Not con Is Nothing Then con.Dispose() End Try End Sub
www.it-ebooks.info
c09.indd 329
3/24/2014 9:18:53 AM
330╇
❘╇ CHAPTER 9╇ Scripting in SSIS
In this example, the Finally block is hit regardless of whether the connection is open or not. A logical If statement checks whether the connection is open and closes it to conserve resources. Typically, you want to follow this pattern if you are doing anything resource intensive like using the System.IO or System.Data assemblies.
Noteâ•… For a full explanation of the Try/Catch/Finally structure in Visual Basic
or C#, see the language reference in MSDN or Books Online.
Script Debugging and Troubleshooting Debugging is an important feature of scripting in SSIS. You can still use the technique of popping up a message box function to see the value of variables, but there are more sophisticated techniques that will help you pinpoint the problem. Using the Visual Studio Tools for Applications environment, you now have the capability to set breakpoints, examine variables, and even evaluate expressions interactively.
Breakpoints Breakpoints enable you to flag a line of code where execution pauses while debugging. Breakpoints are invaluable for determining what’s going on inside your code, as they enable you to step into it to see what’s happening as it executes.
Noteâ•… A new feature since Integration Services 2012 is the ability to debug
Script Components, which includes breakpoints and step abilities.
You can set a breakpoint in several ways. One way is to click in the gray margin at the left of the text editor at the line where you wish to stop execution. Another way is to move the cursor to the line you wish to break on and press F9. Yet another way is to select Debug ➪ Toggle Breakpoint. To continue execution from a breakpoint, press F10 to step to the next line, or F5 to run all the way through to the next breakpoint. When you have a breakpoint set on a line, the line has a red highlight like the one shown in Figure 9-30 (though you can’t see the color in this figure). When a Script Task has a breakpoint set somewhere in the code, it will have a red dot on it similar to the one in Figure 9-31.
Figure 9-30
Figure 9-31
www.it-ebooks.info
c09.indd 330
3/24/2014 9:18:53 AM
Essential Coding, Debugging, and Troubleshooting Techniques ╇
❘╇ 331
Row Count Component and Data Viewers Previously, you looked at using the Visual Studio Tools for Applications environment to debug a Script Task or Script Component using breakpoints and other tools. Alternatively, you can inspect the data as it moves through the data flow using the Row Count Component or a Data Viewer. The Row Count Component is very straightforward; it simply states how many rows passed through it. The Data Viewer contains additional information if desired. To add a Data Viewer, select the connector arrow that leaves the component for which you want to see data. In the previous example, this would be the connector from the Script Component to the Conditional Split Task. Right-click this connection, and select Enable Data Viewer. This automatically adds a Data Viewer that will show all columns on the stream. To remove any columns, double click the connector and select the Data Viewer menu. Figure 9-32 shows how to turn on the Data Viewer on the Data Flow Path. Now when you run this package again, you will get a Data Viewer window after the Script Component has executed. This view will show the data output by the Script Component. Figure 9-33 shows an example. Click the play button to continue package execution, or simply close the window.
Figure 9-32
Figure 9-33
While using the Data Viewer certainly helps with debugging, it is no replacement for being able to step into the code. An alternative is to use the FireInformation event on the ComponentMetaData class in the Script Component. It is like the message box but without the modal effect.
Autos, Locals, and Watches The SQL Server Data Tools environment provides you with some powerful views into what is happening with the execution of your code. These views consist of three windows: the Autos window, the Locals window, and the Watch window. These windows share a similar layout and display the value of expressions and variables, though each has a distinct method determining what data to display.
www.it-ebooks.info
c09.indd 331
3/24/2014 9:18:53 AM
332╇
❘╇ CHAPTER 9╇ Scripting in SSIS
The Locals window displays variables that are local to the current statement, as well as three statements behind and in front of the current statement. For a running example, the Locals window would appear (see Figure 9-34).
Figure 9-34
Watches are another very important feature of debugging. Watches enable you to specify a variable to watch. You can set up a watch to break execution when a variable’s value changes or some other condition is met. This enables you to see exactly when something is happening, such as a variable that has an unexpected value. To add a watch, select the variable you want to watch inside the script, right-click it, and select Add Watch. This will add an entry to the Watch window. You can also use the Quick Watch window, accessible from the Debug menu, or through the Ctrl+Alt+Q key combination. The Watch window shown in Figure 9-35 is in the middle of a breakpoint, and you can see the value of Iterator as it is being assigned the variable value of 2.
Figure 9-35
This window enables you to evaluate an expression at runtime and see the result in the window. You can then click the Add Watch button to move it to the Watch window.
The Immediate Window The Immediate window enables you to evaluate expressions, execute procedures, and print out variable values. It is really a mode of the Command window, which enables you to issue commands to the IDE. Unfortunately, this too is useful only when you are within a breakpoint, and this can be done only within a Script Task.
www.it-ebooks.info
c09.indd 332
3/24/2014 9:18:54 AM
Summary╇
❘╇ 333
Noteâ•… If you can’t find the Immediate window but see the Command window,
just type the command immed and press Enter.
The Immediate window is very useful while testing. You can see the outcome of several different scenarios. Suppose you have an object obj of type MyType, and MyType declares a method called DoMyStuff() that takes a single integer as an argument. Using the Immediate window, you could pass different values into the DoMyStuff() method and see the results. To evaluate an expression in the Immediate window and see its results, you must start the command with a question mark (?): ?obj.DoMyStuff(2) "Hello"
Commands are terminated by pressing the Enter key. The results of the execution are printed on the next line. In this case, calling DoMyStuff() with a value of 2 returns the string “Hello.” You can also use the Immediate window to change the value of variables. If you have a variable defined in your script and you want to change its value, perhaps for negative error testing, you can use this window, shown in Figure 9-36.
Figure 9-36
In this case, the value of the variable greeting is printed out on the line directly below the expression. After the value is printed, it is changed to “Goodbye Cruel World.” The value is then queried again, and the new value is printed. If you are in a Script Task and need to get additional information, this is a useful way to do it.
Summary In this chapter, you learned about the available scripting options in SSIS, including those that support managed code development and a robust IDE development environment. You used the Visual Studio Tools for Applications IDE to develop some basic Script Tasks. Then, to see how all this fits together in SSIS, you dove right in to using the Script Task to retrieve data into variables and to save data into external XML files, and used some .NET serialization techniques that enable custom serialization into MSMQ queues or web services. To understand how to leverage existing code libraries, you even created a utility class, registered it into the GAC, and accessed it in an SSIS script to validate data. SSIS scripting is powerful, but it has been difficult for some developers to differentiate between when to use a Script Task and when a Script Component is appropriate. You have now examined both of these in detail in this chapter and should be able to use them with confidence in your daily development.
www.it-ebooks.info
c09.indd 333
3/24/2014 9:18:54 AM
334╇
❘╇ CHAPTER 9╇ Scripting in SSIS
Experiment with the scripting features of SSIS using the examples in this chapter, and you will find all kinds of uses for them. Don’t forget to review Chapter 5, which covers expressions, to learn about the capabilities of controlling properties within the SSIS model at runtime. Now you are going to take what you have learned so far about SSIS’s capabilities — from Control Flow and Data Flow Tasks to expressions and Scripting Tasks and Components — and put it to work. In the next chapter, you’ll learn all about the techniques you have at your disposal to do a typical job of loading a data warehouse using SSIS.
www.it-ebooks.info
c09.indd 334
3/24/2014 9:18:54 AM
10
Advanced Data Cleansing in SSIS What’s in This Chapter? ➤➤
Using the Derived Column Transformation for advanced data cleansing
➤➤
Applying the Fuzzy Lookup and Fuzzy Grouping transformations and understanding how they work
➤➤
Introducing Data Quality Services
➤➤
Introducing Master Data Services
Wrox.com Code Downloads for this Chapter
You can find the wrox.com code downloads for this chapter at http://www.wrox.com/go/ prossis2014 on the Download Code tab.
In this chapter, you will learn the ins and outs of data cleansing in SSIS, from the basics to the advanced. In a broad sense, one of SSIS’s main purposes is to cleanse data — that is, transform data from a source to a destination and perform operations on it along the way. In that sense, someone could correctly say that every transformation in SSIS is about data cleansing. For example, consider the following transformations: ➤➤
The Data Conversion adjusts data types.
➤➤
The Sort removes duplicate data.
➤➤
The Merge Join correlates data from two sources.
www.it-ebooks.info
c10.indd 335
3/22/2014 9:14:22 AM
336╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
➤➤
The Derived Column applies expression logic to data.
➤➤
The Data Mining predicts values and exceptions.
➤➤
The Script applies .NET logic to data.
➤➤
The Term Extraction and Term Lookup perform text mining.
In a stricter sense, data cleansing is about identifying incomplete, incorrect, or irrelevant data and then updating, modifying, or removing the “dirty” data. From this perspective, SSIS has four primary data cleansing transformations, which are reviewed in this chapter: ➤➤
Derived Column Transformation: This transformation can perform advanced expressionbased data cleansing. If you have just basic data cleansing needs, like blanks or nulls or simple text parsing, this is the right place to start. The next section will walk through some examples.
➤➤
Fuzzy Lookup Transformation: Capable of joining to external data based on data similarity, the Fuzzy Lookup Transformation is a core data cleansing tool in SSIS. This transformation is perfect if you have dirty data input that you want to associate to data in a table in your database based on similar values. Later in the chapter, you’ll take a look at the details of the Fuzzy Lookup Transformation and what happens behind the scenes.
➤➤
Fuzzy Grouping Transformation: The main purpose is de-duplication of similar data. The Fuzzy Grouping Transformation is ideal if you have data from a single source and you know you have duplicates that you need to find.
➤➤
DQS Cleansing: The Data Quality Services Cleansing Transformation leverages the DQS engine to perform predefined data quality rules and mapping. If you have any advanced cleansing where you would like to apply rules and manage cleansing logic, the DQS Transformation using the DQS engine is the right choice for you.
In addition to these data cleansing transformations, SSIS also has a Data Profiling Task that can help you identify any issues within your dirty data as you plan its necessary data cleansing. See Chapter 3 for an overview of the Data Profiling Task and Chapter 12 for a more detailed review of its functionality. This chapter will also explore Master Data Services as a way of standardizing reference data. MDS give users the familiar interface of Excel to manage and correct data to truly have one version of the truth.
Advanced Derived Column Use If you’ve used the data flow in SSIS for any amount of data transformation logic, you will no doubt have used the Derived Column Transformation. It has many basic uses, from basic replacing of NULLs or blanks to text parsing and manipulation.
www.it-ebooks.info
c10.indd 336
3/22/2014 9:14:22 AM
Advanced Derived Column Use╇
❘╇ 337
Using SSIS expressions, the Derived Column Transformation can be used for more advanced data cleansing operations than a simple single expression, such as the following: ➤➤
Advanced text code logic to identify and parse text values
➤➤
Checking for data ranges and returning a specified value
➤➤
Mathematical operations with advanced logic
➤➤
Date comparison and operations
Chapter 5 reviews the expression language in thorough detail. Figure 10-1 highlights the Derived Column Transformation expression toolbox within the Derived Column Transformation Editor.
Figure 10-1
One challenge with the Derived Column Transformation is parsing more complicated text strings and effectively using expressions without duplicating expression logic. This next section walks you through an example of pulling out information from values.
www.it-ebooks.info
c10.indd 337
3/22/2014 9:14:22 AM
338╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
Text Parsing Example To see an example of text mining, consider the example source data from the following list. It contains oceanographic buoy locations off the coast of the United States. Some of them are near cities, while others are in locations farther off the coast. In addition to the location, the text values also contain some codes and switches irrelevant to what you need to parse. 6N26 /V S. HATTERAS, NC 3D13 /A EDISTO, SC 3D14 /A GRAYS REEF 6N46 /A CANAVERAL, FL 6N47 /A CANAVERAL EAST, FL 3D56 /A ST. AUGUSTINE, FL 3D55 /A FRYING PAN SHOALS 3D36 /D BILOXI, MS 3D35 /D LANEILLE, TX 3D44 /D EILEEN, TX Can you use the Derived Column Transformation to pull out the locations embedded within the text? For locations that are near cities, can you also identify the appropriate state code? More important, can you do this efficiently and clearly? Most ETL developers would try to do this in a single Derived Column step with one expression. They would end up with something like this: SUBSTRING((ISNULL(Location) ? "Unknown" : TRIM(Location)),FINDSTRING((ISNULL(Location) ? "Unknown" : TRIM(Location)),"/",1) + 3,(FINDSTRING((ISNULL(Location) ? "Unknown" : TRIM(Location)),",",1) == 0 ? (LEN((ISNULL(Location) ? "Unknown" : TRIM(Location))) FINDSTRING((ISNULL(Location) ? "Unknown" : TRIM(Location)),"/",1) + 4) : (FINDSTRING((ISNULL(Location) ? "Unknown" : TRIM(Location)),",",1) FINDSTRING((ISNULL(Location) ? "Unknown" : TRIM(Location)),"/",1) - 3)))
To be sure, this code will work. It identifies text values, where the location begins, and when a location has a state code appended to it. However, the clarity of the code leaves much to be desired. One thing you can notice in the preceding code is the redundancy of some expressions. For example, it is replacing a NULL value in the Location column with "Unknown". In addition, several FINDSTRING functions are used to locate the “/” in the code. A better approach is to break the code into multiple steps. Figure 10-2 illustrates a Data Flow that contains two Derived Column Transformations. Figure 10-2
www.it-ebooks.info
c10.indd 338
3/22/2014 9:14:23 AM
Advanced Derived Column Use╇
❘╇ 339
The first Derived Column Transformation performs a few preparation steps in the data that is then used in the second transformation. Figure 10-3 highlights the expressions used in the first “Parsing Preparation” transformation.
Figure 10-3
This transformation performs the three common expressions needed to handle the string logic that pulls out the location information from the data: ➤➤
LocationPosition: This new column simply identifies where the “/” is in the code, since that is immediately before the location is named.
➤➤
StatePosition: This expression looks for the existence of a comma (,), which would indicate that the location is a city with an accompanying state as part of the location description.
➤➤
Location: This column is replaced with “Unknown” if the Location value is missing.
With these preparation steps, the expression logic needed to perform the parsing of the text becomes a lot cleaner. The following code is part of the second Derived Column Transformation, which parses out the name of the location:
www.it-ebooks.info
c10.indd 339
3/22/2014 9:14:23 AM
340╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
SUBSTRING(Location,LocationPosition + 3,(StatePosition == 0 ? (LEN(Location) - LocationPosition + 4) : (StatePosition - LocationPosition - 3)))
Now the expression is more readable and easier to follow. Note that to employ this approach, you need to break your Data Flow into two Derived Column Transformations because in order for expression logic to reference a Data Flow column, it must be available in the input of the transformation.
Advanced Fuzzy Lookup and Fuzzy Grouping The two fuzzy transformations within SSIS, Fuzzy Lookup and Fuzzy Grouping, deal with associating data through data similarity, rather than exact data matching. The “fuzzy” part of the transformation name refers to data coupling based on selected data mapping using defined similarity and confidence measurements. Here is a brief description of each: ➤➤
Fuzzy Lookup: The Fuzzy Lookup Transformation takes input data from a Data Flow and matches it to a specified table within SQL Server joined across data similarity column matching. The Fuzzy Lookup is like the Lookup Transformation, except that the column mapping can be adjusted to evaluate data likeness and the output can be tuned to return one or more potential results.
➤➤
Fuzzy Grouping: This transformation takes a single input from the Data Flow and performs a comparison with itself to try to identify potential duplicates in the data. The grouping doesn’t evaluate all the columns in the source input; it only searches for duplicates across the columns you select based on the similarity settings that you define.
This section begins with the Fuzzy Lookup Transformation by reviewing its general functionality. It then digs a little deeper to reveal how it works under the covers. The Fuzzy Grouping Transformation works very similarly to the Fuzzy Lookup Transformation.
Fuzzy Lookup The very basic purpose of the Fuzzy Lookup is to match input data to a lookup table whose columns you are matching across that do not necessarily match exactly. The Fuzzy Lookup Transformation is therefore very similar to the Lookup Transformation, except you are not joining with identical values; you are joining with similar values. Figure 10-4 shows the (regular) Lookup Transformation, whereby several columns are mapped to a lookup table and a key column is returned. The input data in Figure 10-4 is from Excel, a common source of dirty data due to issues with data conversion, missing data, or typographical errors. The simple Data Flow in Figure 10-5 shows that the Lookup has the error output configured to redirect missing rows; as you can see, seven rows do not match to the Lookup table when the Data Flow is executed. To find the missing record matches for the seven rows, you can use the Fuzzy Lookup Transformation. The best way to use the Fuzzy Lookup is when you have a set of data rows that you have already tried matching with a Lookup, but there were no matches. The Fuzzy Lookup does not use cached data and requires SQL Server to help during the processing, so it is more efficient to take advantage of a cached Lookup to handle the large majority of records before using the Fuzzy Lookup.
www.it-ebooks.info
c10.indd 340
3/22/2014 9:14:23 AM
Advanced Fuzzy Lookup and Fuzzy Grouping╇
❘╇ 341
Figure 10-4
Figure 10-5
Figure 10-6 shows the Fuzzy Lookup Transformation Editor. The first tab, Reference Table, requires you to select the reference table that the Fuzzy Lookup needs to match, just like the Lookup Transformation. Later in this section, you will see the advanced settings. On the Columns tab, you need to join the matching columns from the input to the Lookup reference table. Because the purpose is to find matches, you can then determine which columns in the lookup reference table need to be added to the Data Flow. The Fuzzy Lookup example in Figure 10-7 is identical to the Lookup mapping in Figure 10-4, where the primary key column, CustomerID, is returned to the Data Flow.
www.it-ebooks.info
c10.indd 341
3/22/2014 9:14:23 AM
342╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
Figure 10-6
Figure 10-7
www.it-ebooks.info
c10.indd 342
3/22/2014 9:14:24 AM
Advanced Fuzzy Lookup and Fuzzy Grouping╇
❘╇ 343
The Fuzzy Lookup Transformation has a few advanced features (see Figure 10-8) to help you determine what should be considered a match:
Figure 10-8 ➤➤
For every input row, the “Maximum number of matches to output per lookup” option will limit the potential matches to the number that is set. The Fuzzy Lookup will always select the top matches ordered by similarity, highest to lowest.
➤➤
The “Similarity threshold” option defines whether you want to limit the matches to only values above a defined likeness (or similarity). If you set this to 0, you will always get the same number of lookup rows per input row as defined in the “Maximum number of matches to output per lookup” setting.
➤➤
Because the Fuzzy Lookup is matching on text, some custom features enable the Lookup to determine when to identify a separation in characters (like more than one word). These are the token delimiters.
www.it-ebooks.info
c10.indd 343
3/22/2014 9:14:24 AM
344╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
In the example we are building, once the Fuzzy Lookup is configured a Union All is added to the Data Flow and the output of the Lookup and the Fuzzy Lookup are both connected to the Union All. The output of the Union All is then connected to the destination. Figure 10-9 shows the completed Data Flow with the execution results. The seven rows that didn’t match the Lookup Transformation have been successfully matched with the Fuzzy Lookup, and the data has been brought back together with the Union All. In order to better understand how the Fuzzy Lookup is matching the data, you can add a Data Viewer to the output path in the Fuzzy Lookup. As Figure 10-10 demonstrates, right-click on the path and select Enable Data Viewer.
Figure 10-9
Figure 10-10
The Fuzzy Lookup has added more than just the reference table’s lookup column, as shown in the Data Viewer output in Figure 10-11: ➤➤
_Similarity: This is the overall similarity of the source input row to the match row that the Fuzzy Lookup found.
➤➤
_Confidence: This is not about the current row but how many other rows are close in similarity. If other rows are identified as close in similarity, the confidence drops, because the Fuzzy Lookup is less confident about whether the match found is the right match.
www.it-ebooks.info
c10.indd 344
3/22/2014 9:14:25 AM
Advanced Fuzzy Lookup and Fuzzy Grouping╇
➤➤
❘╇ 345
_Similarity_[Column Name]: For every column used in the match (refer to Figure 10-7), the Fuzzy Lookup includes the individual similarity of the input column to the match row in the reference table. These columns begin with “_Similarity_” and have the original column name as a suffix.
Figure 10-11
As you can see in the Data View output from Figure 10-11, the similarity of the matching rows varies between 91 and 96 percent. The columns on the right-hand side of Figure 10-11 indicate the degree of similarity between the matching columns. Notice that many of them have a value of 1, which indicates a perfect match. A value less than 1 indicates the percentage of similarity between the input and reference join. Note that the confidence is in the 50 percent range. This is because most of the sample data is from matching cities and states, which increases the similarity of other rows and therefore reduces the confidence. One final feature of the Fuzzy Lookup Transformation is the capability to define similarity thresholds for each column in the match. Referring back to Figure 10-7, if you double-click on one of the relationship lines, it will open the Create Relationships dialog, shown in Figure 10-12.
Figure 10-12
In this example, the StateProvinceName has been set to an Exact Match type, which is a minimum similarity of 1. Therefore, the Fuzzy Lookup will identify a potential match between rows only when the StateProvinceName is identical for both the input row and the reference table. The easiest way to understand how the Fuzzy Lookup Transformation works (behind the scenes) is to open the Fuzzy Lookup Transformation Editor, edit the reference table, and then check the “Store new index” checkbox, as Figure 10-13 shows.
www.it-ebooks.info
c10.indd 345
3/22/2014 9:14:25 AM
346╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
Figure 10-13
The Fuzzy Lookup requires a connection to a SQL Server database using the OLE DB provider because the transformation uses SQL Server to compute the similarity. To see how this works, begin by using SSMS to connect to the server and database where the lookup table is located. Expand the Tables folder, as shown in Figure 10-14.
Figure 10-14
www.it-ebooks.info
c10.indd 346
3/22/2014 9:14:27 AM
Advanced Fuzzy Lookup and Fuzzy Grouping╇
❘╇ 347
The Fuzzy Lookup has created a few tables. The FuzzyLookupMatchIndex tables contain the data in the reference table, tokenized for the Fuzzy Lookup operation. In addition, if you checked the “Maintain stored index” checkbox (refer to Figure 10-13), you will also get a couple of additional tables that contain data for inserts and deletes from the reference table. Not shown are the indexes on the reference table, which keep the data updated. Figure 10-15 shows sample data from the FuzzyLookupMatchIndex table. The Token column contains partial data from the values for each row in the reference table. The ColumnNumber is the ordinal of the column from the input data set (basically, which column is being referenced in each row). The values in the Rids column look quite strange. This is because SSMS cannot display the binary data in text. However, this column contains the Row Identifiers (RIDs) for every row in the reference table that contains the same token. If you trace the Fuzzy Lookup during package execution, you will find that the input row is also tokenized and matched to the data in the Match Index table, which is how the engine determines the similarity.
Figure 10-15
As you may have guessed from looking at how the Fuzzy Lookup Transformation works, it can consume a lot of server resources. This is why you may want to handle the exact matches first using a standard Lookup Transformation.
Fuzzy Grouping The Fuzzy Grouping Transformation is similar to the Fuzzy Lookup in that it uses the same approach to find matches and it requires SQL Server. Rather than reference an external table, however, the Fuzzy Grouping matches the input data to itself in order to find duplicates. This process is commonly referred to as de-duplication. Figure 10-16 shows an example Data Flow that performs several common transformations. Data is imported from Excel and transformed in a few steps. Right before the destination, a Fuzzy Grouping is added to the Data Flow. Figure 10-16
www.it-ebooks.info
c10.indd 347
3/22/2014 9:14:27 AM
348╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
When you edit the Fuzzy Grouping, you will find some similar settings to the Fuzzy Lookup. Note that on the Connection Manager tab, shown in Figure 10-17, the only property is the connection to SQL Server. This is because there is no reference table that the Fuzzy Grouping needs to join. It just needs the connection where it can store its temporary data.
Figure 10-17
Each column in the input has two settings that you can set. The first is the checkbox (Figure 10-18 shows a few columns selected). This determines whether the Fuzzy Grouping will use this column to identify duplicates. The Pass Through column enables columns to appear downstream even when they are not used in the identification of duplicates. Another thing that Figure 10-18 highlights is that the Fuzzy Grouping Transformation provides the same capability as the Fuzzy Lookup to set a minimum similarity on a column-by-column basis. On the Advanced tab, shown in Figure 10-19, you can fine-tune the Fuzzy Grouping to specify the overall Similarity threshold. If a potential matching row does not meet this threshold, it is not considered in the de-duplication. You can also set the output columns.
www.it-ebooks.info
c10.indd 348
3/22/2014 9:14:27 AM
Advanced Fuzzy Lookup and Fuzzy Grouping╇
❘╇ 349
Figure 10-18
Figure 10-19
www.it-ebooks.info
c10.indd 349
3/22/2014 9:14:28 AM
350╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
Just as in the Fuzzy Lookup, you can see the output by adding a Data Viewer to the output path from the Fuzzy Grouping. Figure 10-20 illustrates how the Fuzzy Grouping works. A _key_in column and a _key_out column are added to the Data Flow. When the Fuzzy Grouping identifies a potential match, it moves the row next to the potential match row. The first row key is shared in the _key_out column. The _key_in identifies where the rows originated.
Figure 10-20
As the example in Figure 10-20 shows, there are a couple of matches. LastName was misspelled in _key_in value of 6, but because the similarity _score is 95 percent, the engine determined it was a match (it was above the similarity threshold of 80 percent defined in Figure 10-19). In another couple of rows highlighted, the street address is slightly different. The key to the Fuzzy Grouping is the _score column. If you wanted to just go with the Fuzzy Grouping results and de-duplicate your source, you would add a Conditional Split Transformation to the Data Flow and allow only rows through the Condition Split whose _score == 1 (the double equals is the expression language Boolean logic match check). Alternately, you could define custom expression logic to choose an alternate row. As the preceding two sections have demonstrated, both the Fuzzy Lookup and the Fuzzy Grouping provide very powerful data cleansing features that can be used in a variety of data scenarios.
DQS Cleansing Introduced in SQL Server 2012 was a component called Data Quality Services (DQS). This is not a feature of Integration Services, but it is very much connected to the data cleansing processes within SSIS. In fact, there is a data transformation called the DQS Cleansing Task. This task connects to DQS, enabling you to connect incoming Data Flow data and perform data cleansing operations.
www.it-ebooks.info
c10.indd 350
3/22/2014 9:14:29 AM
DQS Cleansing╇
❘╇ 351
Because this book focuses on SSIS, a full DQS tutorial is not included; however, this section provides a brief overview of DQS and highlights a few data quality examples. To gain more understanding, you can also watch the DQS one day course by the Microsoft DQS Team at http://technet.microsoft.com/en-us/sqlserver/hh780961.aspx.
Data Quality Services The workflow to use DQS within SSIS requires a few preparatory steps. These need to be performed within the DQS client tool connected to a DQS service. The DQS client is available in the SQL Server 2014 Programs folder (from the Start button). There is a 32-bit version and a 64-bit version. In order to use Data Quality Services, you must have installed it during the SQL Server setup and run the configuration executable, called DQSInstaller.exe. The full setup instructions can be found on MSDN, http://msdn.microsoft.com/en-us/library/ gg492277(v=SQL.120).aspx. Once you pull up the client and connect to the server, you will be in the DQS main screen, shown in Figure 10-21.
Figure 10-21
www.it-ebooks.info
c10.indd 351
3/22/2014 9:14:30 AM
352╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
You can perform three primary tasks with DQS: ➤➤
Knowledge Base Management is how you define the data cleansing rules and policies.
➤➤
Data Quality Projects are for applying the data quality definitions (from the knowledge base) against real data. We will not be considering projects in this chapter; instead, you will see how to use the SSIS DQS Cleansing Task to apply the definitions.
➤➤
Administration is about configuring and monitoring the server and external connections.
To begin the process of cleansing data with DQS, you need to perform two primary steps within the Knowledge Base Management pane:
1.
Create a DQS Knowledge Base (DQS KB). A DQS KB is a grouping of related data quality definitions and rules (called domains) that are defined up front. These definitions and rules are applied against data with various outcomes (such as corrections, exceptions, etc.). For example, a DQS KB could be a set of domains that relate to address cleansing, or a grouping of valid purchase order code rules and code relationships within your company.
2.
Define DQS domains and composite domains. A DQS domain is a targeted definition of cleansing and validation properties for a given data point. For example, a domain could be “Country” and contain the logic on how to process values that relate to countries around the world. The value mapping and rules define what names are valid and how abbreviations map to which countries.
When you select the Open knowledge base option, you are presented with a list of KBs that you have worked with. The built-in KB included with DQS, DQS Data, contains several predefined domains and rules, and connections to external data. Figure 10-22 shows the right-click context menu, which enables you to open the KB and see the definition details.
Figure 10-22
www.it-ebooks.info
c10.indd 352
3/22/2014 9:14:30 AM
DQS Cleansing╇
❘╇ 353
Knowledge bases are about domains, which are the building blocks of DQS. A domain defines what the DQS engine should do with data it receives: Is it valid? Does it need to be corrected? Should it look at external services to cleanse the data? For example, Figure 10-23 highlights the Domain Values tab of the State domain. It shows how values are cleansed and which values should be grouped. In this example, it lists state abbreviations and names and the Correct To value.
Figure 10-23
In the next example, a composite domain is selected. A composite domain is just what it sounds like: a group of domains. In this case, the domains involve companies, based on the business name, city, country, and state. Figure 10-24 shows the partial configuration of a composite domain. In this case, there is an external web service reference called “D&B - D&B Company Cleanse & Match” through which data will be processed. There are many sources you could connect to, such as Melissa Data for address cleansing (www.melissadata.com) or a host of data sources from the Windows Azure Data Marketplace (https://datamarket.azure.com). There are a variety of premium data sources available here. Some can be free on a trial basis, while others have a paid subscription–based fee. Domains can also contain rules that validate the data as it is processed through DQS. In the example in Figure 10-25, the Zip (Address Check) field is validated so that the length is equal to 6. You can also see some of the other options in the list. Multiple rules can be applied with logical AND or OR conditions. If a data element fails the rules, it is marked as bad data during the processing.
www.it-ebooks.info
c10.indd 353
3/22/2014 9:14:31 AM
354╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
Figure 10-24
Figure 10-25
www.it-ebooks.info
c10.indd 354
3/22/2014 9:14:32 AM
DQS Cleansing╇
❘╇ 355
Other common rules include range rules to check that numeric data values fall within a given range and value lists to make sure that the data coming in meets specific requirements. As shown in these few examples, DQS can serve as a powerful data quality engine for your organization. In addition to the common data validation and cleansing operations, you can apply a host of custom rules, matching criteria, and external links. The next step, after your knowledge base is defined, is to process your data through SSIS.
DQS Cleansing Transformation SSIS can connect to DQS using the DQS Cleansing Transformation. This is one of two ways that data can be applied against the knowledge bases within DQS. (A data quality project is the primary way to process data if you are not using SSIS for ETL. This is found in the DQS client tool, but it’s not described in this book, which focuses on SSIS.) In order to use the DQS Cleansing Transformation, you will first connect to a source within your Data Flow that contains the data you plan to associate with the knowledge base. The next step is to connect the source (or other transformation) to a DQS Cleansing Transformation and edit the task. Figure 10-26 shows the Connection Manager tab of the DQS Cleansing Transformation. You need to connect to the DQS server and choose the knowledge base that you will be using for your source input data within SSIS.
Figure 10-26
www.it-ebooks.info
c10.indd 355
3/22/2014 9:14:32 AM
356╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
In this example, the source data contains states/provinces and countries, so you will use the built-in DQS Data KB to connect the states and countries. To see the list of domains, choose DQS Data from the Data Quality Knowledge Base dropdown, as shown in Figure 10-27.
Figure 10-27
The Mapping tab contains the list of input columns that can be used against the KB domains. In Figure 10-28, both the StateProvinceCode and the CountryRegionName columns are selected in the input column list and matched to the US - State (2-letter leading) and Country/Region domains in the Domain dropdown. You are also able to redirect the errors to the error output for the rows that do not meet the domain criteria and rules, using the Configure Error Output dropdown at the bottom of the DQS editor. Figure 10-29 shows the simple Data Flow with a couple of Multicast Transformations so that the data can be viewed (for demo purposes).
www.it-ebooks.info
c10.indd 356
3/22/2014 9:14:32 AM
DQS Cleansing╇
❘╇ 357
Figure 10-28
Figure 10-29
www.it-ebooks.info
c10.indd 357
3/22/2014 9:14:33 AM
358╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
In addition to mapping the inputs to the DQS domain, the DQS Cleansing Transformation also provides additional data in the output of the transformation. Figure 10-30 shows a Data Viewer with the output rows and columns resulting from the DQS cleansing process.
Figure 10-30
In this example, note the highlighted row indicating where the country was corrected and standardized to the DQS domain definition. Besides the original and corrected value returned, you can also see a reason code, as well as a confidence level on the correction. These are similar to the Fuzzy Component outputs shown earlier, except you have much more control and flexibility in terms of how you define your data cleansing process within DQS and apply it in SSIS. An alternate way to see the data flowing through the DQS transformation is to use a Data Tap. This is for when your package is deployed to an SSIS server catalog. Chapter 22 covers how to use a Data Tap in SSIS.
Master Data Management Master data management (MDM) is the process an organization goes through to discover and define data with the ultimate goal of compiling a master list of data. Gartner, the well-known technology research and advisory company, defines Master data management as “a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets” (http://www.gartner.com/it-glossary/ master-data-management-mdm).
www.it-ebooks.info
c10.indd 358
3/22/2014 9:14:34 AM
Master Data Management╇
❘╇ 359
Along with choosing an MDM technology, any project will include an evaluation phase by the business to check and validate records. A successful MDM solution is reliable, centralized data that can be analyzed, resulting in better and more accurate business decisions. Having a tool that can handle the management of a consistent data source can ease many common headaches that occur during data warehouse projects. For example, say your company recently acquired a former competitor. As the data integration expert your first task is to merge the newly acquired customer data into your data warehouse. As expected, your former competitor had many of the same customers you have listed in their transactional database. You clearly need a master customer list, which stores the most accurate data about customers from your database and also the most accurate data from the newly acquired data set. This is a very typical scenario where having an MDM solution can retain the data integrity of your data warehouse. Without such a solution the customer table in the data warehouse will start to have less accurate information and even duplicates of data.
Master Data Services Master Data Services (MDS) was originally released in SQL Server 2008 R2 as Microsoft SQL Server’s solution for master data management. Master Data Services includes the following components and tools to help configure, manage, and administrate each feature: ➤➤
Master Data Services Configuration Manager is the tool you use to create and configure the database and web applications that are required for MDS.
➤➤
Master Data Manager is the web application that users can access to update data and also where administrative tasks may be performed.
➤➤
MDSModelDeploy.exe is the deployment tool used to create packages of your model objects that can be sent to other environments.
➤➤
Master Data Services web service is an access point that .NET developers can use to create custom solutions for MDS.
➤➤
Master Data Services Add-in for Excel is used to manage data and create new entities and attributes.
Again, because this book focuses on SSIS, a full MDS tutorial is not included; however, this section provides a brief overview of MDS. To gain more understanding, you can also watch the MDS one day course by the Microsoft MDS Team at http://msdn.microsoft.com/en-us/sqlserver/ ff943581.aspx. To get started you must first run the Master Data Service Configuration Manger with the executable MDSConfigTool.exe. This requires the creation of a configuration database that stores the system settings that are enabled. Once the database is created, you can configure the web application called the Master Data Manager. You can find the full setup instructions on MSDN, http://technet .microsoft.com/en-us/library/ee633884(SQL.120).aspx.
www.it-ebooks.info
c10.indd 359
3/22/2014 9:14:34 AM
360╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
The majority of the work in Master Data Services post-configuration is done by an information worker. They will use both the Master Data Manager web application and the Master Data Services Add-in for Excel. The Excel Add-in is a free download that is available here, http://go.microsoft.com/fwlink/?LinkId=219530. The Master Data Manager allows the administrator to create new models and derived hierarchy, while the information worker uses the web interface to work with the model and hierarchy relationships. For example, if your company’s data steward found that an entire subcategory of products were placed under the wrong category, then the data steward could use the Master Data Manger to quickly and easily correct that problem. Figure 10-31 shows the interface that the data steward would use to drag and drop hierarchy corrections.
Figure 10-31
After installing the Master Data Services Add-in for Excel (Figure 10-32), the information worker has MDS in the environment they’re most comfortable in. Using Excel a user can connect to the MDS server, shown in Figure 10-33, and then import appropriate data sets into MDS. To do this the user simply selects a table of data in Excel and then clicks Create Entity using the Excel Master Data Ribbon (Figure 10-34).
Figure 10-32
www.it-ebooks.info
c10.indd 360
3/22/2014 9:14:35 AM
Master Data Management╇
❘╇ 361
Figure 10-33
Figure 10-34
www.it-ebooks.info
c10.indd 361
3/22/2014 9:14:35 AM
362╇
❘╇ CHAPTER 10╇ Advanced Data Cleansing in SSIS
Any changes that are made through MDS remain in the MDS database. However, when you’re ready to push these changes back to the destination database you can create a simple MDS view through the Master Data Manger to sync the tables using a T-SQL update statement. You would likely schedule these updates to occur once a day or even more frequently depending on your needs. Because the heavy lifting with MDS is done by the information worker, there is no direct integration with SSIS. Tables that are updated from MDS views are used with traditional components like the Lookup Transformation to ensure incoming new data fits appropriately into the organization’s master list.
Summary In this chapter, you looked at various data cleansing features within SSIS — in particular, the Derived Column, the Fuzzy Grouping and Fuzzy Lookup, and the DQS Cleansing Transformation. These can be categorized by basic data cleansing, dirty data matching, and advanced data rules. The Derived Column, as the basic feature, will allow you to do the most common data cleansing tasks — blanks and text parsing. When you go to the next level, you can use the Fuzzy Lookup and Fuzzy Grouping to find values that should match or should be the same, but because of bad data in the source, do not match. When your requirements necessitate advanced rules and domain-based cleansing, the DQS tool with SQL Server provides a thorough platform to handle data quality needs. Outside of SSIS you also explored tools like Master Data Services and Data Quality clients, which make the information worker now part of the data quality solution. The bottom line: no two data cleansing solutions are exactly the same, but SSIS gives you the flexibility to customize your solution to whatever ETL needs you have.
www.it-ebooks.info
c10.indd 362
3/22/2014 9:14:35 AM
11
Incremental Loads in SSIS What’s in This Chapter? ➤➤
Using the Control Table pattern for incrementally loading data
➤➤
Working with Change Data Capture
Wrox.com Code Downloads for this Chapter
You can find the wrox.com code downloads for this chapter at http://www.wrox.com/go/ prossis2014 on the Download Code tab.
So far, most of the data loading procedures that have been explained in this book have done a full load or a truncate and load. While this is fine for smaller amounts of rows, it would be unfeasible to do with millions of rows. In this chapter, you’re going to learn how to take the knowledge you’ve gained and apply the practices to an incremental load of data. The first pattern will be a control table pattern. In this pattern, you’ll use a table to determine when the last load of the data was. Then the package will determine which rows to load based on the last load date. The other alternative used in this chapter is a Change Data Capture (CDC) pattern. This pattern will require that you have Enterprise Edition of SQL Server and will automatically identify the rows to be transferred based on a given date.
Control Table Pattern The most conventional incremental load pattern is the control table pattern. The pattern uses a table that the developer creates to store operational data about the last load. A sample table looks like this: CREATE TABLE [dbo].[ControlTable]( [SourceTable] [varchar](50) NOT NULL, [LastLoadID] [int] NOT NULL,
www.it-ebooks.info
c11.indd 363
3/22/2014 9:21:32 AM
364╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
[LastLoadDate] [datetime] NOT NULL, [RowsInserted] [int] NOT NULL, CONSTRAINT [PK_ControlTable] PRIMARY KEY CLUSTERED ( [SourceTable] ASC ) ) ON [PRIMARY]
In this pattern, you would have a row in the control table for each table that you wish to create a load process for. This table is not only used by your SSIS package to determine how much data to load but it also becomes an audit table to see which tables have and have not been loaded. Each of the incremental load patterns in this chapter follow these steps: An Execute SQL Task reads the last load data from the control table into a variable.
1. 2.
3.
An Execute SQL Task sets the last load date in the control table to the time when the package began.
A Data Flow Task reads from the source table where they were modified or created after the date in the variable.
To start the example, run the Control Table Example Creation.sql file in the chapter’s accompanying material (which you can download from http://www.wrox.com/go/prossis2014). This will create a table to read from called SourceTable and a table to load called DestTable. It will also create and load the control table. Notice the ControlTable table shows a LastLoadDate column of 1900-0101, meaning the SourceTable has never been read from. The TableName column holds a record for each table you wish to read from. Optionally, there’s a LastLoadID that could be used for identity columns.
Querying the Control Table Querying the control table you created is simply done through an Execute SQL Task. Start by creating an OLE DB connection manager to whichever database you ran the setup script in. For the purpose of this example, we’ll assume you created the tables in a variant of the AdventureWorks database. To configure the Execute SQL Task, direct the task to use the previously created connection manager. Then, use a query similar to the one that follows for your SQLStatement property. This query will find out the last time you retrieved data from the table called SourceTable. SELECT LastLoadDate from ControlTable where SourceTable = 'SourceTable'
The answer to the query should be stored into a variable to be used later. To do this, set the ResultSet property (shown in Figure 11-1) to Single Row. Doing this will allow you to use the ResultSet tab. Go to that tab, and click the Add button to create a new resultset. Then change the ResultName property from NewResultName to 0. This stores the result from the first column into the variable of your choosing. You could have also typed the column name from the query as well (LastLoadDate) into the property.
www.it-ebooks.info
c11.indd 364
3/22/2014 9:21:32 AM
Control Table Pattern╇
❘╇ 365
Figure 11-1
Next, select New Variable from the drop-down box in the ResultSet tab. This will open the Add Variable dialog box (shown in Figure 11-2). Ensure the variable is scoped to the package and call it SourceTableLoadDate. Define the data type of the variable as a DateTime and set the default value to 2099-01-01. This ensures that if someone were to run the package without running this Execute SQL Task, that no date will be retrieved.
Figure 11-2
www.it-ebooks.info
c11.indd 365
3/22/2014 9:21:33 AM
366╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
Querying the Source Table With the date now set in the variable, you’re ready to retrieve any new data from your table called SourceTable. You’ll do this with a Data Flow Task that you connect to the Execute SQL Task. Create an OLE DB Source in the Data Flow and have it use the connection manager you created earlier. Then, set the Data Access Mode property to SQL Command and type the following query in the query window below: SELECT * from SourceTable WHERE CreatedDate BETWEEN ? and ?
The two question marks represent input parameters that will be passed into the query. To set the values for the placeholders click Parameters, which opens the Set Query Parameters dialog box (shown in Figure 11-3). Set the first parameter to User::SourceTableLoadDate and the second parameter to System::StartTime. The StartTime variable represents the start time of the package. When both parameters are passed into the query, it essentially requests all the rows that have not been loaded since the last load until the time the package started. With the OLE DB Source now configured, drag an OLE DB Destination over to the Data Flow and connect it to the OLE DB Source. Configure the OLE DB Destination to use the same connection manager and load the table called DestTable. After configuring the mappings, the simple Data Flow is complete and the final Data Flow should resemble Figure 11-4.
Figure 11-3
Updating the Control Table Back in the Control Flow you need one more Execute SQL Task to update the control table. Connect it to the Data Flow Task. To configure the Execute SQL Task, connect it to the same connection manager you have been using and type the following query into the SQLStatement property.
Figure 11-4
UPDATE ControlTable SET LastLoadDate = ? WHERE SourceTable = 'SourceTable'
The question mark in this case represents the time the package began, and to pass it in, go to the Parameter Mapping page and configure it as shown in Figure 11-5. Click the Add button and set the variable name being passed in as System::StartTime, the Data Type as Date, and the Parameter Name as 0.
www.it-ebooks.info
c11.indd 366
3/22/2014 9:21:33 AM
SQL Server Change Data Capture╇
❘╇ 367
Figure 11-5
Once you’re done configuring this, your package is ready to run (shown in Figure 11-6). This task is going to update the Control Table and set the last load date as the time the package started, so the next time the package runs it will only get current values. The first time you run the package, you should see three rows go through the data flow. Subsequent runs should show zero rows but try also adding a row to the SourceTable table and running the package over again. If configured correctly, only the single row will go through. You’ll also see that the control table is constantly being updated with a new LastLoadDate column.
Figure 11-6
SQL Server Change Data Capture The advantage to the control table pattern you saw previously is it works across any database platform you may use. The negative is it requires a date or identity column to hook into it. It also doesn’t handle deleted records. The alternative is using the Change Data Capture (CDC) feature built into SQL Server. This feature works only in the Enterprise Edition of SQL Server but handles deletes and is easier to configure than the previous control table example. This section is focused on how to configure CDC for SSIS but if you want more information on CDC, it is covered extensively in Books Online (by the actual developer of CDC) and demonstrated in more detail in the related CodePlex samples (www.codeplex.com). In most nightly batches for your ETL, you want to ensure that you are processing only the most recent data — for instance, just the data from the preceding day’s operations. Obviously, you don’t want to process every transaction from the last five years during each night’s batch. However, that’s the ideal world, and sometimes the source system is not able to tell you which rows belong to the time window you need. This problem space is typically called Change Data Capture, or CDC. The term refers to the fact that you want to capture just the changed data from the source system within a specified window of time. The changes may include inserts, updates, and deletes, and the required window of time may vary, anything from “all the changes from the last few minutes” all the way through to “all the changes from the last day/week/year,” and so on. The key requisite for a CDC solution is that it needs to identify the rows that were affected since a specific, granular point in time.
www.it-ebooks.info
c11.indd 367
3/22/2014 9:21:33 AM
368╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
Following are some common techniques to handle this problem: ➤➤
Adding new date/time columns to the source system: This isn’t usually feasible, either because it is a legacy system and no one knows how to add new functionality, or it is possible but the risk and change management cost is too high, or simply because the DBA or data steward won’t let you! On some systems, such as ERP applications, this change is impossible because of the sheer number and size of tables and the prohibitive cost thereof.
➤➤
Adding triggers to the source system: Such triggers may watch for any data changes and then write an audit record to a separate logging table that the ETL then uses as a source. Though this is less invasive than the previous method, the same challenges apply. An issue here is that every database operation now incurs more I/O cost — when a row is inserted or updated, the original table is updated, and then the new log table is updated too in a synchronous manner. This can lead to decreased performance in the application.
➤➤
Complex queries: It is academically possible to write long complex queries that compare every source row/column to every destination row/column, but practically, this is usually not an alternative because the development and performance costs are too high.
➤➤
Dump and reload: Sometimes there is no way around the problem, and you are forced to delete and recopy the complete set of data every night. For small data sets, this may not be a problem, but once you start getting into the terabyte range you are in trouble. This is the worst possible situation and one of the biggest drivers for non-intrusive, low-impact CDC solutions.
➤➤
Third-party solutions: Some software vendors specialize in CDC solutions for many different databases and applications. This is a good option to look into, because the vendors have the experience and expertise to build robust and high-performance tools.
➤➤
Other solutions: Besides the preceding options, there are solutions such as using queues and application events, but some of these are nongeneric and tightly coupled.
➤➤
Change Data Capture: Last, but not least — and the subject of this section — is the functionality called Change Data Capture, which provides CDC right out of the box. This technology is delivered by the SQL Replication team, but it was designed in concert with the SSIS team. Note that there is another similarly named technology called Change Tracking, which is a synchronous technique that can also be used in some CDC scenarios.
Benefits of SQL Server CDC Here are some of the benefits that SQL Server 2014 CDC (hereafter referred to as CDC) provides: ➤➤
Low impact: You do not need to change your source schema tables in order to support CDC. Other techniques for Change Data Capture, such as triggers and replication, require you to add new columns (such as timestamps and GUIDs) to the tables you want to track. With CDC, you can be up and running immediately without changing the schema. Obviously, your source system needs to be hosted on SQL Server 2008 or higher in order to take advantage of the CDC functionality.
➤➤
Low overhead: The CDC process is a job that runs asynchronously in the background and reads the changes off the SQL Server transaction log. What this means in plain English is
www.it-ebooks.info
c11.indd 368
3/22/2014 9:21:34 AM
SQL Server Change Data Capture╇
❘╇ 369
that, unlike triggers, any updates to the source data do not incur a synchronous write to a logging table. Rather, the writes are delayed until the server is idle, or the writes can be delayed until a time that you specify (for instance, 2:00 a.m. every day). ➤➤
Granular configuration: The CDC process allows you to configure the feature on a per-table basis, which means it is not an all-or-nothing proposition. You can try it out on one table, and once you iron out any issues, you can slowly start using it on more tables.
➤➤
High fidelity capture: The technology flags which rows were inserted, updated, and deleted. It can also tell you exactly which columns changed during updates. Other auditing details such as the event timestamp, as well as the specific transaction ID, are also provided.
➤➤
High fidelity requests: The CDC infrastructure allows you to make very granular requests to the CDC store, so that you can find out exactly when certain operations occurred. For instance, you can ask for changes within any batch window, ranging from a few minutes (near real time) to hours, days, weeks, or more. You can ask for the final aggregated image of the rows, and you can ask for the intermediate changes too.
➤➤
Ease of use: The APIs that you use to request the data are based on the same SQL semantics you are already used to — SELECT statements, user-defined functions, and stored procedures.
➤➤
Resilient to change: The replication team built the technology with change management in mind, meaning that if you set up CDC to work on a certain table, and someone adds or deletes a column in that table, the process is robust enough in most cases to continue running while you make the appropriate fixes. This means you don’t lose data (or sleep!).
➤➤
Transactional consistency: The operations enable you to request changes in a transactionally consistent manner. For instance, if two tables in the source were updated within the context of the same source transaction, you have the means to establish that fact and retrieve the related changes together.
➤➤
SSIS CDC components: CDC and SSIS work hand in hand in the 2014 version of SQL Server because of the components added to the SSIS toolbox. These tools make it easier to query your CDC solution.
Preparing CDC There are a few steps you need to take to get CDC working. CDC is intended for sources that reside on a SQL Server 2008 or later database. If your data resides on an earlier version of SQL Server or another vendor’s solution, unless you migrate the data, this solution is probably not for you. However, you may still want to test the waters and see what benefits you can gain from the functionality — in which case find yourself a test server and follow these same steps. First, the DBA or a member of the SQL sysadmin fixed server role needs to enable CDC on the SQL Server database. This is a very important point; there should be a clear separation of roles and duties, and open dialog between the DBA and the ETL developer. The ETL developer may be tempted to turn CDC on for every single table, but that is a bad idea. Although CDC has low overhead, it does not have zero overhead. DBAs, conversely, may be protective of their data store and not want anyone to touch it.
www.it-ebooks.info
c11.indd 369
3/22/2014 9:21:34 AM
370╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
Whether the DBA and the ETL developer are different individuals or the same person, the respective parties should consider the pros and cons of the solution from all angles. Books Online has more details about these considerations, so this section will forge ahead with the understanding that much of this may be prototypical. The rest of this discussion assumes that you are using a variant of the AdventureWorks on a SQL Server 2014 installation. In the below script, you will need Enterprise or Developer edition and you may have to change the AdventureWorks database to your own flavor or AdventureWorks. Here is how to enable the functionality at a database level: USE AdventureWorks; GO EXEC sp_changedbowner 'sa' GO --Enable CDC on the database EXEC sys.sp_cdc_enable_db; GO --Check CDC is enabled on the database SELECT name, is_cdc_enabled FROM sys.databases WHERE database_id = DB_ID();
When you flip this switch at the database level, SQL Server sets up some of the required infrastructure that you will need later. For instance, it creates a database schema called cdc, as well as the appropriate security, functions, and procedures. The next step is to ensure that SQL Server Agent is running on the same server on which you just enabled CDC. Agent allows you to schedule when the CDC process will crawl the database logs and write entries to the capture instance tables (also known as shadow tables; the two terms are used interchangeably here). If these terms don’t make sense to you right now, don’t worry; they soon will. The important thing to do at this point is to use SQL Server 2014 Configuration Manager to ensure that Agent is running. Because this chapter is focused not on the deep technical details of CDC itself but rather on how to use its functionality within the context of ETL, you should visit Books Online if you are not sure how to get Agent running. Next, you can enable CDC functionality on the tables of your choice. Run the following command in order to enable CDC on the HumanResources.Employee table: USE AdventureWorks; GO --Enable CDC on a specific table EXECUTE sys.sp_cdc_enable_table @source_schema = N'HumanResources' ,@source_name = N'Employee' ,@role_name = N'cdc_Admin' ,@capture_instance = N'HumanResources_Employee' ,@supports_net_changes = 1;
The supports_net_changes option enables you to retrieve only the final image of a row, even if it was updated multiple times within the time window you specified. If there were no problems, then you should see the following message displayed in the output of the query editor: Job 'cdc.AdventureWorks_capture' started successfully. Job 'cdc.AdventureWorks_cleanup' started successfully.
www.it-ebooks.info
c11.indd 370
3/22/2014 9:21:34 AM
SQL Server Change Data Capture╇
❘╇ 371
If you want to verify that CDC is enabled for any particular table, you can issue a command of the following form: --Check CDC is enabled on the table SELECT [name], is_tracked_by_cdc FROM sys.tables WHERE [object_id] = OBJECT_ID(N'HumanResources.Employee'); --Alternatively, use the built-in CDC help procedure EXECUTE sys.sp_cdc_help_change_data_capture @source_schema = N'HumanResources', @source_name = N'Employee'; GO
If all has gone well, the CDC process is now alive and well and watching the source table for any changes. You used the default configuration for setting up CDC on a table, but there are optional parameters that give you much more power. For instance, you can configure exactly which columns should and shouldn’t be tracked, and the filegroup where the shadow table should live, and you can enable other modes. For now, simple is good, so the next step is to have a look at what SQL Server has done for you.
Capture Instance Tables Capture instance tables — also known as shadow tables and change tables — are the tables that SQL Server creates behind the scenes to help the magic of CDC happen. Here is how the CDC process works:
1.
The end user makes a data change in the source system table you are tracking. SQL Server writes the changes to the database log, and then writes the changes to the database. Note that SQL Server always does the log-write (and always has) regardless of whether or not CDC is enabled — in other words, the database log is not a new feature of CDC, but CDC makes good use of it.
2.
CDC includes a process that runs on server idle time, or on a scheduled interval (controlled by SQL Server Agent) that reads the changes back out of the log and writes them to a separate change tracking (shadow) table with a special schema. In other words, the user wrote the change to the database; the change was implicitly written to the SQL log; the CDC process read it back out of the log and wrote it to a separate table. Why not write it to the second table in the first place? The reason is that synchronous writes impact the source system; users may experience slow application performance if their updates cause two separate writes to two separate tables. By using an asynchronous log reader, the DBA can amortize the writes to the shadow table over a longer period. Of course, you may decide to schedule the Agent job to run on a more frequent basis, in which case the experience may be almost synchronous, but that is an ETL implementation decision. Normally, the log reader runs during idle time or when users are not using the system, so there is little to no application performance overhead.
3.
The ETL process then reads the data out of the change table and uses it to populate the destination. You will learn more about that later; for now let’s continue our look at the SQL change tables.
4.
There is a default schedule that prunes the data in the change tables to keep the contents down to three days’ worth of data to prevent the amount of CDC data from becoming unwieldy. You should change this default configuration to suit your specific needs.
www.it-ebooks.info
c11.indd 371
3/22/2014 9:21:34 AM
372╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
When you enabled CDC on the HumanResources.Employee table, SQL used a default naming convention to create a shadow table in the same database called cdc.HumanResources_Employee_ CT. This table has the same schema as the source table, but it also has several extra metadata columns that CDC needs to do its magic. Issue the following command to see what the shadow table looks like. There should be no rows in the table right now, so you will get back an empty result set. SELECT * FROM cdc.HumanResources_Employee_CT;
Here is a brief overview of the main metadata columns: ➤➤
The __$start_lsn and __$seqval columns identify the original transaction and order in which the operations occurred. These are important values — the API (which you will look at later) operates purely in terms of the LSNs (commit log sequence numbers), but you can easily map date/time values to and from LSNs to make things simpler.
➤➤
The __$operation column shows the source operation that caused the change (1 = delete, 2 = insert, 3 = update [before image], 4 = update [after image], and 5 = merge).
➤➤
The __$update_mask column contains a bit mask indicating which specific columns changed during an update. It specifies what columns changed on a row-by-row basis; however, the mask is just a bitmap, so you need to map the ordinal position of each bit to the column name that it represents. CDC provides functions such as sys.fn_cdc_has_column_changed to help you make sense of these masks.
Okay, now for the exciting part. Make a data change in the source table and then look at the shadow table again to see what happened. To keep it simple, update one specific field on the source table using the following command. Remember that the process runs asynchronously, so you may have to wait a few seconds before the changes appear in the shadow table. Therefore, after running the following statement, wait a few seconds and then run the preceding SELECT statement again. --Make an update to the source table UPDATE HumanResources.Employee SET HireDate = DATEADD(day, 1, HireDate) WHERE [BusinessEntityID] IN (1, 2, 3);
Rather than wait for the asynchronous log reader process to occur, you can also force the process to happen on demand by issuing this command: --Force CDC log crawl EXEC sys.sp_cdc_start_job;
The shadow table should contain two rows for every source row you updated. Why two rows, when you performed only one update per row? The reason is because for updates, the change table contains the before and after images of the affected rows. Now try inserting or deleting a row in the source and note what rows are added to the shadow table.
The CDC API The previous section was just academic background on what is happening; you don’t actually need all this knowledge in order to apply the solution to the problem at hand. CDC provides a set of functions and procedures that abstract away the details of the technology and make it very simple to
www.it-ebooks.info
c11.indd 372
3/22/2014 9:21:34 AM
SQL Server Change Data Capture╇
❘╇ 373
use. When you enabled CDC on the table, SQL automatically generated several function wrappers for you so that you can query the shadow table with ease. Here is an example: USE AdventureWorks; GO --Let's check for all changes since the same time yesterday DECLARE @begin_time AS DATETIME = GETDATE() - 1; --Let's check for changes up to right now DECLARE @end_time AS DATETIME = GETDATE(); --Map the time intervals to a CDC query range (using LSNs) DECLARE @from_lsn AS BINARY(10) = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time); DECLARE @to_lsn AS BINARY(10) = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time); --Validate @from_lsn using the minimum LSN available in the capture instance DECLARE @min_lsn AS BINARY(10) = sys.fn_cdc_get_min_lsn('HumanResources_Employee'); IF @from_lsn < @min_lsn SET @from_lsn = @min_lsn; --Return the NET changes that occurred within the specified time SELECT * FROM cdc.fn_cdc_get_net_changes_HumanResources_Employee(@from_lsn, @to_lsn, N'all with mask');
The CDC functions understand only LSNs. Therefore, you first need to map the date/time values to LSN numbers, being careful to check the minimum and maximum extents. You then call a wrapper function for the table called cdc.fn_cdc_get_net_changes_
(), which returns the rows that have changed. You specify all with mask, which means that the __$update_mask column is populated to tell you which columns changed. If you don’t need the mask, just specify all, because calculating the mask is expensive. The all and all with mask options both populate the __$operation column accordingly. If you had used the parameter value all with merge, the same results would come back, but the __$operation flag would contain only either 1 (delete) or 5 (merge). This is useful if you only need to know whether the row was deleted or changed, but you don’t care what the specific change was. This option is computationally cheaper for SQL to execute. The function you used in this example returns the net changes for the table — meaning if any specific row had multiple updates applied against it in the source system, the result returned would be the net combined result of those changes. For instance, if someone inserted a row and then later (within the same batch window) updated that same row twice, the function would return a row marked as Inserted (__$operation 5 2), but the data columns would reflect the latest values after the second update. Net changes are most likely what you will use for loading your warehouse, because they give you the final image of the row at the end of the specified window, and do not encumber you with any interim values the row might have had. Some near-real-time scenarios, and applications such as auditing and compliance tracking, may require the interim values too. Instead of asking for only the net changes to the source table, you can also ask for the granular (interim) changes. To do this you use another function that CDC automatically generated for you, in this case called cdc.fn_cdc_get_all_changes_
(). Here is an example that uses
www.it-ebooks.info
c11.indd 373
3/22/2014 9:21:34 AM
374╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
the update mask and the all-changes mode together (note that the BusinessEntityID column may be called EmployeeID in previous versions of AdventureWorks): USE AdventureWorks; GO --First update another column besides the HireDate so you can --test the difference in behavior UPDATE HumanResources.Employee SET VacationHours = VacationHours + 1 WHERE BusinessEntityID IN (3, 4, 5); WAITFOR DELAY '00:00:10'; --Wait 10s to let the log reader catch up --Map times to LSNs as you did previously DECLARE @begin_time AS DATETIME = GETDATE() - 1; DECLARE @end_time AS DATETIME = GETDATE(); DECLARE @from_lsn AS BINARY(10) = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time); DECLARE @to_lsn AS BINARY(10) = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time); DECLARE @min_lsn AS BINARY(10) = sys.fn_cdc_get_min_lsn('HumanResources_Employee'); IF @from_lsn <@min_lsn SET @from_lsn = @min_lsn; --Get the ordinal position(s) of the column(s) you want to track DECLARE @hiredate_ord INT = sys.fn_cdc_get_column_ordinal(N'HumanResources_Employee', N'HireDate'); DECLARE @vac_hr_ord INT = sys.fn_cdc_get_column_ordinal(N'HumanResources_Employee', N'VacationHours'); --Return ALL the changes and a flag to tell us if the HireDate changed SELECT BusinessEntityID, --Boolean value to indicate whether hire date was changed sys.fn_cdc_is_bit_set(@hiredate_ord, __$update_mask) AS [HireDateChg], --Boolean value to indicate whether vacation hours was changed in the source sys.fn_cdc_is_bit_set(@vac_hr_ord, __$update_mask) AS [VacHoursChg] FROM cdc.fn_cdc_get_all_changes_HumanResources_Employee(@from_lsn, @to_lsn, N'all');
This call should return every row from the shadow table without aggregating them into a netchanges view. This is useful if your destination system needs to track everything that happened to a source table, including interim values. It includes two BIT fields that indicate whether specific columns were changed. If you want to disable CDC on a table, use a command of the following form. Be careful, though; this command will drop the shadow table and any data it contains: EXECUTE sys.sp_cdc_disable_table @source_schema = N'HumanResources', @source_name = N'Employee', @capture_instance = N'HumanResources_Employee';
Using the SSIS CDC Tools Now that you know how to set up CDC in your relational database engine, it is time to use SSIS to pull the data you need. Three tools in the SSIS toolbox work hand in hand with CDC. ➤➤
CDC Control Task: Used to control the CDC sequence of events in CDC packages. It controls the CDC package synchronization with the initial load package. It also governs the LSN ranges in a CDC package. It can also deal with errors and recovery.
www.it-ebooks.info
c11.indd 374
3/22/2014 9:21:34 AM
SQL Server Change Data Capture╇
➤➤
CDC Source: Used to query the CDC change tables.
➤➤
CDC Splitter: Sends data down different data paths for Inserts, Updates, and Deletions.
❘╇ 375
In the CDC Control Task there are several options for control operations: ➤➤
Mark Initial Load Start: Records the first load starting point
➤➤
Mark Initial Load End: Records the first load ending point
➤➤
Mark CDC Start: Records the beginning of the CDC range
➤➤
Get Processing Range: Retrieves the range for the CDC values
➤➤
Mark Processed Range: Records the range of values processed
These tools make it much easier for you to control CDC packages in SSIS. The CDC Control Task creates a new table in a database of your choosing that holds the CDC state. This state can then be retrieved using the same CDC Control Task. The CDC state is the place at which your CDC data is in at this time. This could mean the database needs to be queried back one hour, one week, or one year from now. The CDC state tells your SSIS package where to gather data. It also can tell if the data is already up-to-date. After determining the CDC state, a Data Flow is then used to move the data. The CDC Source is used to query the data and push it down the Data Flow path. The CDC Splitter is then used to send the data down the appropriate paths for updating, deleting, or inserting. After the Data Flow is done loading the data, another CDC Control Task can update the CDC state. The first steps to getting these CDC tools to work is to set up the CDC state table and create an ADO.NET connection to the database where the CDC state table is found. You can use the CDC Control Task itself to help with these steps. Open the CDC Control Task and you will see a screen like Figure 11-7. Click on the New button next to the first Connection Manager option and create an ADO.NET connection to the AdventureWorks database. Then click the New button next to the Tables to use for storing state option. This will open a window as seen in Figure 11-8. The SQL command for creating the table and the index it automatically generated is shown in this window. Clicking the run button will create the table for the CDC state in the AdventureWorks database. You will also need a variable in the SSIS package to hold the CDC state. Click the New button next to the Variable containing the CDC state option to create this variable (shown in Figure 11-7). The default name of this variable is CDC_State. Now that you have the CDC state table created with a connection to it and a variable to hold the state, you are ready to create a package to load the CDC data. In most situations you will have a CDC package that will do the initial load. This means you will have a package that runs one time to dump all of the data from the source into your data warehouse or any destination you are trying to keep updated. This initial load will consists of a package with a CDC Control Task to set the initial load start, then a Data Flow, and then a CDC Control Task to set to the initial load end. The following example is building the package that will be used on a schedule to update the data in your destination after the initial load has been done. To set the initial load end, just run a CDC Control Task set to “Mark Initial Load End.” Afterward you can query the CDC table to ensure you have a value set.
www.it-ebooks.info
c11.indd 375
3/22/2014 9:21:34 AM
376╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
Figure 11-7
Figure 11-8
Note ╇ The State Name in the CDC Control Task must match the name in the initial load CDC package.
www.it-ebooks.info
c11.indd 376
3/22/2014 9:21:35 AM
SQL Server Change Data Capture╇
❘╇ 377
The incremental package will start with a CDC Control Task that gets the processing range. A Data Flow is then used with the CDC Source and CDC Splitter. Then another CDC Control Task is used to mark the processed range. This ensures that the package is ready to run again and pick up the next batch of changes in the CDC. In the following example we will not write the records to a destination, but you can use Data Viewers to see the CDC data.
1.
Create a package named CDC Demo. Drag in a CDC Control Task and set up the options to match Figure 11-9. This will get the process range from the CDC state table and save it in the CDC_State variable in the SSIS package.
Figure 11-9
Note ╇ If you have been following along with the examples in the chapter then you have CDC enabled on the Employee table in the AdventureWorks database. If not then you will need to go back and run the code in this chapter to see the expected results in this package.
2.
Now drag in a Data Flow and connect the CDC Control Task to the Data Flow. The package should look like Figure 11-10.
3.
Open the Data Flow and drag in the CDC Source and set the options in the source to match Figure 11-11. You are selecting the Employee table as the source and the All option to get all of the changes from the table. The CDC_State variable also needs to be set in the source.
Figure 11-10
www.it-ebooks.info
c11.indd 377
3/22/2014 9:21:35 AM
378╇
❘╇ CHAPTER 11╇ Incremental Loads in SSIS
4.
Once you have the source configured, drag in the CDC Splitter Transformation. Connect the CDC Source to the CDC Splitter. Drag in three Union All Transformations and connect each of the CDC Splitter outputs to the Union All Transformations. Once complete, your Data Flow should look like Figure 11-12.
Figure 11-11
Figure 11-12
5.
You can now place a Data Viewer on each on the CDC Splitter outputs and see the rows that have been updated, deleted, and inserted. If you want to test this further, go back to the Employee table and perform some SQL commands to add rows, delete rows, and perform some updates. These changes will show in the Data Viewers.
www.it-ebooks.info
c11.indd 378
3/22/2014 9:21:35 AM
Summary╇
❘╇ 379
6.
To complete this package you will need to add in three destinations in the Data Flow, one for each of the CDC Splitter outputs. You will also need to add another CDC Control Task after the Data Flow Tasks to mark the processed range. This ensures the package will not pick up the same rows twice.
7.
Back in the Control Flow, add one more CDC Control Task. Configure it the same way you configured the first, but change the CDC Control Operation to Mark Processed Range. Attach this task to the Data Flow Task. This task writes back to the state table what transactions have already been transferred.
This example shows you the functionality of the CDC tools in SQL Server Integration Services. These tools should make CDC work much easier inside of SSIS. As you can see though, the steps that are done in this resemble the same steps with a control table. Note ╇ There is also a SSIS Oracle CDC option now, too, though it is not specifically covered in this book.
Summary As this chapter has shown, you can take advantage of many opportunities to use CDC in concert with SSIS in your ETL solution to incrementally load data into your database. You can also use the control table pattern to load data in a very similar pattern. While it requires more work to implement, it won’t require Enterprise Edition of SQL Server.
www.it-ebooks.info
c11.indd 379
3/22/2014 9:21:36 AM
www.it-ebooks.info
c11.indd 380
3/22/2014 9:21:36 AM
12
Loading a Data Warehouse What’s in This Chapter? ➤➤
Data profiling
➤➤
Dimension and fact table loading
➤➤
Analysis Services cube processing
Wrox.com Downloads for This Chapter
You can find the wrox.com code downloads for this chapter at www.wrox.com/go/ prossis2014 on the Download Code tab.
Among the various applications of SQL Server Integration Services (SSIS), one of the more common is loading a data warehouse or data mart. SSIS provides the extract, transform, and load (ETL) features and functionality to efficiently handle many of the tasks required when dealing with transactional source data that will be extracted and loaded into a data mart, a centralized data warehouse, or even a master data management repository, including the capabilities to process data from the relational data warehouse into SQL Server Analysis Services (SSAS) cubes. SSIS provides all the essential elements of data processing — from your source, to staging, to your data mart, and onto your cubes (and beyond!). A few common architectures are prevalent in data warehouse solutions. Figure 12-1 highlights one common architecture of a data warehouse with an accompanying business intelligence (BI) solution.
www.it-ebooks.info
c12.indd 381
22-03-2014 20:09:02
382╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Source Data
Data Processing and Storage
Cubes Staging or Integration
Data Mart/ Data Warehouse
Other
Analytic Tools
Provides core reporting structure, data conformity, and integration
Corporate goals mapped to performance metrics
Analysis and report organization and linking, user portal customization
Portal Parameterized & Static Reporting
HR/ CRM
Allows history tracking and source data association
Presentation Data
Data Extraction and Transformation
ERP
KPIs, Scorecards, Strategy Maps
Provides analytic performance, drilldown, advanced calculations, KPIs
Presentation
Pre-defined operational and management reports
Data summaries with drill-down, trending, root cause analysis, data research, & decomposition
Execs, Analysts, Management, Engineers
SSIS ETL Area Figure 12-1
The presentation layer on the right side of Figure 12-1 shows the main purpose of the BI solution, which is to provide business users (from the top to the bottom of an organization) with meaningful data from which they can take actionable steps. Underlying the presentation data are the back-end structures and processes that make it possible for users to access the data and use it in a meaningful way. Another common data warehouse architecture employs a central data warehouse with subject-oriented data marts loaded from the data warehouse. Figure 12-2 demonstrates this data warehouse architecture.
Other
Data Warehouse
Data Mart
Cubes
Data Mart
Cubes
Data Mart
Cubes
Presentation Data
HR/ CRM
Data Extraction and Transformation
ERP
Figure 12-2
www.it-ebooks.info
c12.indd 382
22-03-2014 20:09:04
Data Profiling)>>╇
❘╇ 383
ETL is an important part of a data warehouse and data mart back-end process because it is responsible for moving and restructuring the data between the data tiers of the overall BI solution. This involves many steps, as you will see — including data profiling, data extraction, dimension table loading, fact table processing, and SSAS processing. This chapter will set you on course to architecting and designing an ETL process for data warehouse and business intelligence ETL. In fact, SSIS contains several out-of-the-box tasks and transformations to get you well on your way to a stable and straightforward ETL process. Some of these components include the Data Profiling Task, the Slowly Changing Dimension Transformation, and the Analysis Services Execute DDL Task. The tutorials in this chapter, like other chapters, use the sample databases for SQL Server, called AdventureWorks and AdventureWorksDW. In addition to the databases, a sample SSAS cube database solution is also used. These databases represent a transactional database schema and a data warehouse schema. The tutorials in this chapter use the sample databases and demonstrate a coordinated process for the Sales Quota Fact table and the associated SSAS measure group, which includes the ETL required for the Employee dimension. You can go to www.wrox.com/go/ prossis2014 and download the code and package samples found in this chapter, including the version of the SSAS AdventureWorks database used.
Data Profiling Ultimately, data warehousing and BI is about reporting and analytics, and the first step to reach that objective is understanding the source data, because that has immeasurable impact on how you design the structures and build the ETL. Data profiling is the process of analyzing the source data to better understand its condition in terms of cleanliness, patterns, number of nulls, and so on. In fact, you probably have profiled data before with scripts and spreadsheets without even realizing that it was called data profiling. A helpful way to data profile in SSIS, the Data Profiling Task, is reviewed in Chapter 3, but let’s drill into some more details about how to leverage it for data warehouse ETL.
Initial Execution of the Data Profiling Task The Data Profiling Task is unlike the other tasks in SSIS because it is not intended to be run repeatedly through a scheduled operation. Consider SSIS as the wrapper for this tool. You use SSIS to configure and run the Data Profiling Task, which outputs an XML file with information about the data you select. You then observe the results through the Data Profile Viewer, which is a standalone application. The output of the Data Profiling Task will be used to help you in your development and design of the ETL and dimensional structures in your solution. Periodically, you may want to rerun the Data Profiling task to see how the data has changed, but the task will not run in the recurring ETL process.
1.
Open Visual Studio and create a new SSIS project called ProSSIS_Ch12. You will use this project throughout this chapter.
2.
In the Solution Explorer, rename Package.dtsx to Profile_EmployeeData.dtsx.
www.it-ebooks.info
c12.indd 383
22-03-2014 20:09:04
384╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
3.
The Data Profiling Task requires an ADO.NET connection to the source database (as opposed to an OLE DB connection). Therefore, create a new ADO.NET connection in the Connection Manager window by right-clicking and choosing “New ADO.NET Connection” and then click the New button. After you create a connection to the AdventureWorks database, return to the Solution Explorer window.
4.
In the Solution Explorer, create a new project connection to your local machine or where the AdventureWorks sample database is installed, as shown in Figure 12-3.
Figure 12-3
5.
Click OK to save the connection information and return to the SSIS package designer. (In the Solution Explorer, rename the project connection to ADONETAdventureWorks.conmgr so that you will be able to distinguish this ADO.NET connection from other connections.)
6.
Drag a Data Profiling Task from the SSIS Toolbox onto the Control Flow and double-click the new task to open the Data Profiling Task Editor.
7.
The Data Profiling Task includes a wizard that will create your profiling scenario quickly; click the Quick Profile Button on the General tab to launch the wizard.
8.
In the Single Table Quick Profile Form dialog, choose the ADONETAdventureWorks connection; and in the Table or View dropdown, select the [Sales].[vSalesPerson] view from the list. Enable all the checkboxes in the Compute list and change the Functional Dependency Profile to use 2 columns as determinant columns, as shown in Figure 12-4. The next section reviews the results and describes the output of the data profiling steps.
www.it-ebooks.info
c12.indd 384
22-03-2014 20:09:04
Data Profiling)>>╇
❘╇ 385
Figure 12-4
9.
Click OK to save the changes, which will populate the Requests list in the Data Profiling Task Editor, as shown in Figure 12-5. Chapter 3 describes each of these different request types, and you will see the purpose and output of a few of these when we run the viewer.
Figure 12-5
www.it-ebooks.info
c12.indd 385
22-03-2014 20:09:05
386╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
10.
Return to the General tab of the editor. In the Destination property box, choose New File Connection. This is where you will select the location of the XML file where the Data Profiling Task stores its profile output when it is run.
11.
In the File Connection Manager Editor, change the Usage type dropdown to “Create file” and enter C:\ProSSIS\Data\Employee_Profile.xml in the File text box. Click OK to save your changes to the connection, and click OK again to save your changes in the Data Profiling Task Editor.
12.
Now it is time to execute this simple package. Run the package in Visual Studio, which will initiate several queries against the source table or view (in this case, a view). Because this view returns only a few rows, the Data Profiling task will execute rather quickly, but with large tables it may take several minutes (or longer if your table has millions of rows and you are performing several profiling tests at once).
The results of the profile are stored in the Employee_Profile.xml file, which you will next review with the Data Profile Viewer tool.
Reviewing the Results of the Data Profiling Task Despite common user expectations, data cannot be magically generated, no matter how creative you are with data cleansing. For example, suppose you are building a sales target analysis that uses employee data, and you are asked to build into the analysis a sales territory group, but the source column has only 50 percent of the data populated. In this case, the business user needs to rethink the value of the data or fix the source. This is a simple example for the purpose of the tutorials in this chapter, but consider a more complicated example or a larger table. The point is that your source data is likely to be of varying quality. Some data is simply missing, other data has typos, sometimes a column has so many different discrete values that it is hard to analyze, and so on. The purpose of doing data profiling is to understand the source, for two reasons. First, it enables you to review the data with the business user, which can effect changes; second, it provides the insight you need when developing your ETL operations. In fact, even though we’re bringing together business data that the project stakeholders use every day, we’re going to be using that data in ways that it has never been used before. Because of this, we’re going to learn things about it that no one knows — not even the people who are the domain experts. Data profiling is one of the up-front tasks that helps the project team avoid unpleasant (and costly) surprises later on. Now that you have run the Data Profiling Task, your next objective is to evaluate the results:
1.
Observing the output requires using the Data Profile Viewer. This utility is found in the Integration Services subdirectory for Microsoft SQL Server 2014 (Start Button ➪ All Programs ➪ Microsoft SQL Server 2014 ➪ Integration Services) or in Windows 8, simply type Data Profile Viewer at the start screen.
2.
Open the Employee_Profile.xml file created earlier by clicking the Open button and navigating to the C:\ProSSIS\Data folder (or the location where the file was saved), highlighting the file, and clicking Open again.
3.
In the Profiles navigation tree, first click the table icon on the top left to put the tree viewer into Column View. Then drill down into the details by expanding Data Sources, server (local), Databases, AdventureWorks, and the [Sales].[vSalesPerson] table, as shown in Figure 12-6.
www.it-ebooks.info
c12.indd 386
22-03-2014 20:09:05
Data Profiling)>>╇
❘╇ 387
Figure 12-6
4.
The first profiling output to observe is the Candidate Key Profiles, so click this item under the Columns list, which will open the results in the viewer on the right. Note that the Data Profiling Task has identified seven columns that are unique across the entire table (with 100 percent uniqueness), as shown in Figure 12-7.
Figure 12-7
Given the small size of this table, all these columns are unique, but with larger tables, you will see fewer columns and less than 100 percent uniqueness, and any exceptions or key violations. The question is, which column looks to be the right candidate key for this table? In the next section you will see how this answer affects your ETL.
www.it-ebooks.info
c12.indd 387
22-03-2014 20:09:06
388╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
5.
Click the Functional Dependency Profile object on the left and observe the results. This shows the relationship between values in multiple columns. Two columns are shown: Determinant Column(s) and Dependant Column. The question is, for every unique value (or combination) in the Determinant Column, is there only one unique value in the Dependant Column? Observe the output. What is the relationship between these combinations of columns: TerritoryGroup and TerritoryName, StateProvinceName, and CountryRegionName. Again, in the next section you will see how these results affect your ETL.
6.
In the profile tree, click the “View Single Column by Profile” icon at the top right of the profile tree. Next, expand the TerritoryName column and highlight the Column Length Distribution. Then, in the distribution profile on the right, double-click the length distribution of 6, as shown in Figure 12-8.
Figure 12-8
The column length distribution shows the number of rows by length. What are the maximum and minimum lengths of values for the column?
7.
Under TerritoryName in the profile browser, select the Column Null Ratio Profile and then double-click the row in the profile viewer on the right to view the detail rows. The Column Null Ratio shows what percentage of rows in the entire table have NULL values. This is valuable for ETL considerations because it spells out when NULL handling is required for the ETL process, which is one of the most common transformation processes.
8.
Select the Column Value Distribution Profile on the left under the TerritoryName and observe the output in the results viewer. How many unique values are there in the entire table? How many values are used only one time in the table?
www.it-ebooks.info
c12.indd 388
22-03-2014 20:09:06
Data Profiling)>>╇
9.
❘╇ 389
In the left navigation pane, expand the PhoneNumber column and then click the Column Pattern Profile. Double-click the first pattern, number 1, in the list on the right, as shown in Figure 12-9. As you can see, the bottom right of the window shows the actual data values for the phone numbers matching the selected pattern. This data browser is helpful in seeing the actual values so that you can analyze the effectiveness of the Data Profile Task.
Figure 12-9
The Column Pattern Profile uses regular expression syntax to display what pattern or range of patterns the data in the column contains. Notice that for the PhoneNumber column, two patterns emerge. The first is for phone numbers that are in the syntax ###-555-####, which is translated to \d\d\d-555-\d\d\d\d in regular expression syntax. The other pattern begins with 1 \(11\) 500 555- and ends with four variable numbers. 10.
The final data profiling type to review is the Column Statistics Profile. This is applicable only to data types related to numbers (integer, float, decimal, numeric) and dates (dates allow only minimum and maximum calculations). In the Profiles tree view on the left of the Data Profile Viewer, expand the SalesYTD column and then click the Column Statistics Profile. Four results are calculated across the spread of values in the numeric column:
a. b. c. d.
Minimum: The lowest number value in the set of column values Maximum: The highest number value in the set of column values Mean: The average of values in the set of column values Standard Deviation: The average variance between the values and the mean
www.it-ebooks.info
c12.indd 389
22-03-2014 20:09:06
390╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
The Column Statistics Profile is very valuable for fact table source evaluation, as the measures in a fact table are almost always numeric based, with a few exceptions. Overall, the output of the Data Profiling Task has helped to identify the quality and range of values in the source. This naturally leads to using the output results to formulate the ETL design.
Turning Data Profile Results into Actionable ETL Steps The typical first step in evaluating source data is to check the existence of source key columns and referential completeness between source tables or files. Two of the data profiling outputs can help in this effort: ➤➤
The Candidate Key Profile will provide the columns (or combination of columns) with the highest uniqueness. It is crucial to identify a candidate key (or composite key) that is 100 percent unique, because when you load your dimension and fact tables, you need to know how to identify a new or existing source record. In the preceding example, shown in Figure 12-7, several columns meet the criteria. The natural selection from this list is the BusinessEntityID column.
➤➤
The Column NULL Ratio is another important output of the Data Profiling Task. This can be used to verify that foreign keys in the source table have completeness, especially if the primary key to foreign key relationships will be used to relate a dimension table to a fact, or a dimension table to another dimension table. Of course, this doesn’t verify that the primary-to-foreign key values line up, but it will give you an initial understanding of referential data completeness.
As just mentioned, the Column NULL Ratio can be used for an initial review of foreign keys in source tables or files that have been loaded into SQL Server for data profiling review. The Column NULL Ratio is an excellent output, because it can be used for almost every destination column type, such as dimension attributes, keys, and measures. Anytime you have a column that has NULLs, you will most likely have to replace them with unknowns or perform some data cleansing to handle them. In step 7 of the previous section, the Territory Name has approximately a 17 percent NULL ratio. In your dimension model destination this is a problem, because the Employee dimension has a foreign surrogate key to the Sales Territory dimension. Because there isn’t completeness in the SalesTerritory, you don’t have a reference to the dimension. This is an actionable item that you will need to address in the dimension ETL section later. Other useful output of the Data Profiling Task includes the column length and statistics presented. Data type optimization is important to define; when you have a large inefficient source column where most of the space is not used (such as a char(1000)), you will want to scale back the data type to a reasonable length. To do so, use the Column Length Distribution (refer to Figure 12-8). The column statistics can be helpful in defining the data type of your measures. Optimization of data types in fact tables is more important than dimensions, so consider the source column’s max and min values to determine what data type to use for your measure. The wider a fact table, the slower it will perform, because fewer rows will fit in the server’s memory for query execution, and the more disk space it will occupy on the server. Once you have evaluated your source data, the next step is to develop your data extraction, the “E” of ETL.
www.it-ebooks.info
c12.indd 390
22-03-2014 20:09:06
Dimension Table Loading)>>╇
❘╇ 391
Data Extraction and Cleansing Data extraction and cleansing applies to many types of ETL, beyond just data warehouse and BI data processing. In fact, several chapters in this book deal with data extraction for various needs, such as incremental extraction, change data capture, and dealing with various sources. Refer to the following chapters to plan your SSIS data extraction components: ➤➤
Chapter 4 takes an initial look at the Source components in the Data Flow that will be used for your extraction.
➤➤
Chapter 10 considers data cleansing, which is a common task for any data warehouse solution.
➤➤
Chapter 13 deals with using the SQL Server relational engine to perform change data capture.
➤➤
Chapter 14 is a look at heterogeneous, or non-SQL Server, sources for data extraction.
The balance of this chapter deals with the core of data warehouse ETL, which is dimension and fact table loading, SSAS object processing, and ETL coordination.
Dimension Table Loading Dimension transformation and loading is about tracking the current and sometime history of associated attributes in a dimension table. Figure 12-10 shows the dimensions related to the Sales Quota Fact table in the AdventureWorksDW database (named FactSalesQuota). The objective of this section is to process data from the source tables into the dimension tables.
Figure 12-10
www.it-ebooks.info
c12.indd 391
22-03-2014 20:09:07
392╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
In this example, notice that each dimension (DimEmployee, DimSalesTerritory, and DimDate) has a surrogate key named Dimension Key, as well as a candidate key named Dimension AlternateKey. The surrogate key is the most important concept in data warehousing because it enables the tracking of change history and optimizes the structures for performance. See The Data Warehouse Toolkit, Third Edition, by Ralph Kimball and Margy Ross (Wiley, 2013), for a detailed review of the use and purpose of surrogate keys. Surrogate keys are often auto-incrementing identity columns that are contained in the dimension table. Dimension ETL has several objectives, each of which is reviewed in the tutorial steps to load the DimSalesTerritory and DimEmployee tables, including the following: ➤➤
Identifying the source keys that uniquely identify a source record and that will map to the alternate key
➤➤
Performing any Data Transformations to align the source data to the dimension structures
➤➤
Handling the different change types for each source column and adding or updating dimension records
SSIS includes a built-in transformation called the Slowly Changing Dimension (SCD) Transformation to assist in the process. This is not the only transformation that you can use to load a dimension table, but you will use it in these tutorial steps to accomplish dimension loading. The SCD Transformation also has some drawbacks, which are reviewed at the end of this section.
Loading a Simple Dimension Table Many dimension tables are like the Sales Territory dimension in that they contain only a few columns, and history tracking is not required for any of the attributes. In this example, the DimSalesTerritory table is sourced from the [Sales].[SalesTerritory] table, and any source changes to any of the three columns will be updated in the dimension table. These columns are referred to as changing dimension attributes, because the values can change.
1.
To begin creating the ETL for the DimSalesTerritory table, return to your SSIS project created in the first tutorial and create a new package named ETL_DimSalesTerritory.dtsx.
2.
Because you will be extracting data from the AdventureWorks database and loading data into the AdventureWorksDW database, create two OLE DB project connections to these databases named AdventureWorks and AdventureWorksDW, respectively. Refer to Chapter 2 for help about defining the project connections.
3.
Drag a new Data Flow Task from the SSIS Toolbox onto the Control Flow and navigate to the Data Flow designer.
4.
Drag an OLE DB Source component into the Data Flow and double-click the new source to open the editor. Configure the OLE DB Connection Manager dropdown to use the Adventure Works database and leave the data access mode selection as “Table or view.” In the “Name of the table or the view” dropdown, choose [Sales].[SalesTerritory], as shown in Figure 12-11.
5.
On the Columns property page (see Figure 12-12), change the Output Column value for the TerritoryID column to SalesTerritoryAlternateKey, change the Name column to SalesTerritoryRegion, and change the Output Column for the Group column to SalesTerritoryGroup. Also, uncheck all the columns under SalesTerritoryGroup because they are not needed for the DimSalesTerritory table.
www.it-ebooks.info
c12.indd 392
22-03-2014 20:09:07
Dimension Table Loadingâ•… ❘â•… 393
Figure 12-11
Figure 12-12
6.
Click OK to save your changes and then drag a Lookup Transformation onto the Data Flow and connect the blue data path from the OLE DB Source onto the Lookup.
7.
On the General property page, shown in Figure 12-13, edit the Lookup Transformation as follows: leave the Cache mode setting at Full cache, and leave the Connection type setting at OLE DB Connection Manager.
Figure 12-13
www.it-ebooks.info
c12.indd 393
22-03-2014 20:09:08
394╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
8.
On the Connection property page, set the OLE DB Connection Manager dropdown to the AdventureWorks connection. Change the “Use a table or a view” dropdown to [Person]. [CountryRegion].
9.
On the Columns property page, drag the CountryRegionCode from the available Input Columns list to the matching column in the Available Lookup Columns list, then select the checkbox next to the Name column in the same column list. Rename the Output Alias of the Name column to SalesTerritoryCountry, as shown in Figure 12-14.
Figure 12-14
10.
Select OK in the Lookup Transformation Editor to save your changes. At this point in the process, you have performed some simple initial steps to align the source data up with the destination dimension table. The next steps are the core of the dimension processing and use the SCD Transformation.
11.
Drag a Slowly Changing Dimension Transformation from the SSIS Toolbox onto the Data Flow and connect the blue data path output from the Lookup onto the Slowly Changing Dimension Transformation. When you drop the path onto the SCD Transformation, you will be prompted to select the output of the Lookup. Choose Lookup Match Output from the dropdown and then click OK.
12.
To invoke the SCD wizard, double-click the transformation, which will open up a splash screen for the wizard. Proceed to the second screen by clicking Next.
www.it-ebooks.info
c12.indd 394
22-03-2014 20:09:08
Dimension Table Loadingâ•… ❘â•… 395
13.
14.
The first input of the wizard requires identifying the dimension table to which the source data relates. Therefore, choose AdventureWorksDW as the Connection Manager and then choose [dbo].[DimSalesTerritory] as the table or view, which will automatically display the dimension table’s columns in the list, as shown in Figure 12-15. For the SalesTerritoryAlternateKey, change the Key Type to Business key. Two purposes are served here: ➤➤
One, you identify the candidate key (or business key) from the Figure 12-15 dimension table and which input column it matches. This will be used to identify row matches between the source and the destination.
➤➤
Two, columns are matched from the source to attributes in the dimension table, which will be used on the next screen of the wizard to identify the change tracking type. Notice that the columns are automatically matched between the source input and the destination dimension because they have the same name and data type. In other scenarios, you may have to manually perform the match.
On the next screen of the SCD wizard, you need to identify what type of change each matching column is identified as. It has already been mentioned that all the columns are changing attributes for the DimSalesTerritory dimension; therefore, select all the columns and choose the “Changing attribute” Change Type from the dropdown lists, as shown in Figure 12-16.
Figure 12-16
www.it-ebooks.info
c12.indd 395
22-03-2014 20:09:08
396╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Three options exist for the Change Type: Changing attribute, Historical attribute, and Fixed attribute. As mentioned earlier, a Changing attribute is updated if the source value changes. For the Historical attribute, when a change occurs, a new record is generated, and the old record preserves the history of the change. You’ll learn more about this when you walk through the DimEmployee dimension ETL in the next section of this chapter. Finally, a Fixed attribute means no changes should happen, and the ETL should either ignore the change or break. 15.
The next screen, titled “Fixed and Changing Attribute Options,” prompts you to choose which records you want to update when a source value changes. The “Fixed attributes” option is grayed out because no Fixed attributes were selected on the prior screen. Under the “Changing attributes” option, you can choose to update the changing attribute column for all the records that match the same candidate key, or you can choose to update only the most recent one. It doesn’t matter in this case because there will be only one record per candidate key value, as there are no historical attributes that would cause a new record. Leave this box unchecked and proceed to the next screen.
16.
The “Inferred Dimension Members” screen is about handling placeholder records that were added during the fact table load, because a dimension member didn’t exist when the fact load was run. Inferred members are covered in the DimEmployee dimension ETL, later in this chapter.
17.
Given the simplicity of the Sales Territory dimension, this concludes the wizard, and on the last screen you merely confirm the settings that you configured. Select Finish to complete the wizard.
The net result of the SCD wizard is that it will automatically generate several downstream transformations, preconfigured to handle the change types based on the candidate keys you selected. Figure 12-17 shows the completed Data Flow with the SCD Transformation.
Figure 12-17
www.it-ebooks.info
c12.indd 396
22-03-2014 20:09:09
Dimension Table Loadingâ•… ❘â•… 397
Since this dimension is simple, there are only two outputs. One output is called New Output, which will insert new dimension records if the candidate key identified from the source does not have a match in the dimension. The second output, called Changing Attribute Updates Output, is used when you have a match across the candidate keys and one or more of the changing attributes does not match between the source input and the dimension table. This OLE DB command uses an UPDATE statement to perform the operation.
Loading a Complex Dimension Table Dimension ETL often requires complicated logic that causes the dimension project tasks to take the longest amount of time for design, development, and testing. This is due to change requirements for various attributes within a dimension such as tracking history, updating inferred member records, and so on. Furthermore, with larger or more complicated dimensions, the data preparation tasks often require more logic and transformations before the history is even handled in the dimension table itself.
Preparing the Data To exemplify a more complicated dimension ETL process, in this section you will create a package for the DimEmployee table. This package will deal with some missing data, as identified earlier in your data profiling research:
1.
In the SSIS project, create a new package called ETL_DimEmployee.dtsx. Since you’ve already created project connections for AdventureWorks and AdventureWorksDW, you do not need to add these to the new DimEmployee SSIS package.
2. 3.
Create a Data Flow Task and add an OLE DB Source component to the Data Flow.
Configure the OLE DB Source component to connect to the AdventureWorks connection and change the data access mode to SQL command. Then enter the following SQL code in the SQL command text window (see Figure 12-18): SELECT e.NationalIDNumber as EmployeeNationalIDAlternateKey , manager.NationalIDNumber as ParentEmployeeNationalIDAlternateKey , s.FirstName, s.LastName, s.MiddleName, e.JobTitle as Title , e.HireDate, e.BirthDate, e.LoginID, s.EmailAddress , s.PhoneNumber as Phone, e.MaritalStatus, e.SalariedFlag , e.Gender, e.VacationHours, e.SickLeaveHours, e.CurrentFlag , s.CountryRegionName as SalesTerritoryCountry , s.TerritoryGroup as SalesTerritoryGroup , s.TerritoryName as SalesTerritoryRegion , s.StateProvinceName FROM [Sales].[vSalesPerson] s INNER JOIN [HumanResources].[Employee] e ON e.[BusinessEntityID] = s.[BusinessEntityID] LEFT OUTER JOIN HumanResources.Employee manager ON (e.OrganizationNode.GetAncestor(1)) = manager.[OrganizationNode]
www.it-ebooks.info
c12.indd 397
22-03-2014 20:09:09
398╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Figure 12-18
Click OK to save the changes to the OLE DB Source component.
4. 5.
6.
Double-click the Lookup Transformation to bring up the Lookup editor. On the General page, change the dropdown named “Specify how to handle rows with no matching entries” to “Redirect rows to no match output.” Leave the Cache mode as Full cache and the Connection type as OLE DB Connection Manager.
7.
On the Connection property page, change the OLE DB connection to AdventureWorksDW and then select [dbo].[DimSalesTerritory] in the dropdown below called “Use a table or a view.”
8.
On the Columns property page, join the SalesTerritoryCountry, SalesTerritoryGroup, and SalesTerritoryRegion columns between the input columns and lookup columns, as shown in Figure 12-19. In addition, select the checkbox next to SalesTerritoryKey in the lookup columns to return this column to the Data Flow.
Drag a Lookup Transformation to the Data Flow and connect the blue data path output from the OLE DB Source to the Lookup. Name the Lookup Sales Territory.
At this point, recall from your data profiling that some of the sales territory columns in the source have NULL values. Also recall that TerritoryGroup and TerritoryName have a one-to-many functional relationship. In fact, assume that you have conferred with the business users, and they confirmed that you can look at the StateProvinceName and CountryRegionName, and if another salesperson has the same combination of values, you can use their SalesTerritory information.
www.it-ebooks.info
c12.indd 398
22-03-2014 20:09:09
Dimension Table Loadingâ•… ❘â•… 399
Figure 12-19
9.
10.
To handle the missing SalesTerritories with the preceding requirements, add a second Lookup Transformation to the Data Flow, and name it Get Missing Territories. Then connect the blue path output of the Sales Territory Lookup to this new Lookup. You will be prompted to choose the Output; select Lookup No Match Output from the dropdown list, as shown in Figure 12-20. Edit the new Lookup and configure the OLE Figure 12-20 DB Source component to connect to the AdventureWorks connection. Then change the data access mode to SQL command. Enter the following SQL code in the SQL command text window: SELECT DISTINCT CountryRegionName as SalesTerritoryCountry , TerritoryGroup as SalesTerritoryGroup , TerritoryName as SalesTerritoryRegion , StateProvinceName FROM [Sales].[vSalesPerson] WHERE TerritoryName IS NOT NULL
www.it-ebooks.info
c12.indd 399
22-03-2014 20:09:09
400╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
11.
On the Columns property page, join the SalesTerritoryCountry and StateProvinceName between the input and lookup columns list and then enable the checkboxes next to SalesTerritoryGroup and SalesTerritoryRegion on the lookup list. Append the word “New” to the OutputAlias, as shown in Figure 12-21.
Figure 12-21
Next, you will recreate the SalesTerritory Lookup from the prior steps to get the Sales TerritoryKey for the records that originally had missing data. 12.
Add a new Lookup to the Data Flow named Reacquire SalesTerritory and connect the output of the Get Missing Territories Lookup (use the Lookup Match Output when prompted). On the General tab, edit the Lookup as follows: leave the Cache mode as Full cache and the Connection type as OLE DB Connection Manager.
13.
On the Connections page, specify the AdventureWorksDW Connection Manager and change the “Use a table or a view” option to [dbo].[DimSalesTerritory].
www.it-ebooks.info
c12.indd 400
22-03-2014 20:09:10
Dimension Table Loading)>>╇
14.
❘╇ 401
On the Columns property page (shown in Figure 12-22), match the columns between the input and lookup table, ensuring that you use the “New” Region and Group column. Match across SalesTerritoryCountry, SalesTerritoryGroupNew, and SalesTerritoryRegionNew. Also return the SalesTerritory Key and name its Output Alias SalesTerritoryKeyNew.
Figure 12-22
15.
16.
Click OK to save your Lookup changes and then drag a Union All Transformation onto the Data Flow. Connect two inputs into the Union All Transformation: ➤➤
The Lookup Match Output from the original Sales Territory Lookup
➤➤
The Lookup Match Output from the Reacquire SalesTerritory Lookup (from steps 12–14)
Edit the Union All Transformation as follows: locate the SalesTerritoryKey column and change the value in the dropdown for the input coming from second lookup to use the SalesTerritoryKeyNew column. This is shown in Figure 12-23.
www.it-ebooks.info
c12.indd 401
22-03-2014 20:09:10
402╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Figure 12-23
17.
Click OK to save your changes to the Union All. At this point, your Data Flow should look similar to the one pictured in Figure 12-24.
These steps described how to handle one data preparation task. When you begin to prepare data for your dimension, chances are good you will need to perform several steps to get it ready for the dimension data changes. You can use many of the other SSIS transformations for this purpose, described in the rest of the book. A couple of examples include using the Derived Column to convert NULLs to Unknowns and the Fuzzy Lookup and Fuzzy Grouping to cleanse dirty data. You can also use the Data Quality Services of SQL Server 2014 to help clean data. A brief overview of DQS is included in Chapter 10.
Figure 12-24
www.it-ebooks.info
c12.indd 402
22-03-2014 20:09:10
Dimension Table Loading)>>╇
❘╇ 403
Handling Complicated Dimension Changes with the SCD Transformation Now you are ready to use the SCD Wizard again, but for the DimEmployee table, you need to handle different change types and inferred members:
1.
2.
On the Select a Dimension Table and Keys page, choose the AdventureWorksDW Connection Manager and the [dbo].[DimEmployee] table.
a.
In this example, not all the columns have been extracted from the source, and other destination columns are related to the dimension change management, which are identified in step 3. Therefore, not all the columns will automatically be matched between the input columns and the dimension columns.
b. c.
Find the EmployeeNationalIDAlternateKey and change the Key Type to Business Key.
Continue development by adding a Slowly Changing Dimension Transformation to the Data Flow and connecting the data path output of the Union All to the SCD Transformation. Then double-click the SCD transformation to launch the SCD Wizard.
3.
Select Next.
On the Slowly Changing Dimension Columns page, make the following Change Type designations, as shown in Figure 12-25:
Figure 12-25
www.it-ebooks.info
c12.indd 403
22-03-2014 20:09:11
404╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
a. b.
c.
4.
On the Fixed and Changing Attribute Options page, uncheck the checkbox under the Fixed attributes label. The result of this is that when a value changes for a column identified as a fixed attribute, the change will be ignored, and the old value in the dimension will not be updated. If you had checked this box, the package would fail.
5.
On the same page, check the box for Changing attributes. As described earlier, this ensures that all the records (current and historical) will be updated when a change happens to a changing attribute.
6.
You will now be prompted to configure the Historical Attribute Options, as shown in Figure 12-26. The SCD Transformation needs to know how to identify the current record when a single business key has multiple values (recall that when a historical attribute changes, a new copy of the record is created). Two options are available. One, a single column is used to identify the record. The better option is to use a start and end date. The DimEmployee table has a StartDate and EndDate column; therefore, use the second configuration option button and set the “Start date column” to StartDate, and the “End date column” to EndDate. Finally, set the “Variable to set date values” dropdown to System::StartTime.
Fixed Attributes: BirthDate, HireDate Changing Attributes: CurrentFlag, EmailAddress, FirstName, Gender, LastName, LoginID, MaritalStatus, MiddleName, Phone, SickLeaveHours, Title, VacationHours Historical Attributes: ParentEmployeeNationalIDAlternateKey, SalariedFlag, SalesTerritoryKey
Figure 12-26
www.it-ebooks.info
c12.indd 404
22-03-2014 20:09:11
Dimension Table Loading)>>╇
❘╇ 405
7.
Assume for this example that you may have missing dimension records when processing the fact table. In this case, a new inferred member is added to the dimension. Therefore, on the Inferred Dimension Members page, leave the “Enable inferred member support” option checked. The SCD Transformation needs to know when a dimension member is an inferred member. The best option is to have a column that identifies the record as inferred; however, the DimEmployee table does not have a column for this purpose. Therefore, leave the “All columns with a change type are null” option selected.
8.
This concludes the wizard settings. Click Finish so that the SCD Transformation can build the downstream transformations needed based on the configurations. Your Data Flow will now look similar to the one shown in Figure 12-27.
Figure 12-27
As you have seen, when dealing with historical attribute changes and inferred members, the output of the SCD Transformation is more complicated with updates, unions, and derived calculations. One of the benefits of the SCD Wizard is rapid development of dimension ETL. Handling changing attributes, new members, historical attributes, inferred members, and fixed attributes is a complicated process that usually takes hours to code, but with the SCD Wizard, you can accomplish this in minutes. Before looking at some drawbacks and alternatives to the SCD Transformation, consider the outputs (refer to Figure 12-27) and how they work: ➤➤
Changing Attribute Updates Output: The changing attribute output records are records for which at least one of the attributes that was identified as a changing attribute goes through a change. This update statement is handled by an OLE DB Command Transformation with the code shown here: UPDATE [dbo].[DimEmployee] SET [CurrentFlag] = ?,[EmailAddress] = ?,[FirstName] = ?,[Gender] = ?,[LastName] = ?,[LoginID] = ?,[MaritalStatus] = ?,[MiddleName] = ?, [Phone] = ?,[SickLeaveHours] = ?,[Title] = ?,[VacationHours] = ? WHERE [EmployeeNationalIDAlternateKey] = ?
www.it-ebooks.info
c12.indd 405
22-03-2014 20:09:11
406╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
The question marks (?) in the code are mapped to input columns sent down from the SCD Transformation. Note that the last question mark is mapped to the business key, which ensures that all the records are updated. If you had unchecked the changing attribute checkbox in step 4 of the preceding list, then the current identifier would have been included and only the latest record would have changed. ➤➤
New Output: New output records are simply new members that are added to the dimension. If the business key doesn’t exist in the dimension table, then the SCD Transformation will send the row out this output. Eventually these rows are inserted with the Insert Destination (refer to Figure 12-27), which is an OLE DB Destination. The Derived Column 1 Transformation shown in Figure 12-28 is to add the new StartDate of the record, which is required for the metadata management.
Figure 12-28
This dimension is unique, because it has both a StartDate column and a Status column (most dimension tables that track history have either a Status column that indicates whether the record is current or datetime columns that indicate the start and end of the record’s current status, but usually not both). The values for the Status column are Current and , so you should add a second Derived Column to this transformation called Status and force a “Current” value in it. You also need to include it in the destination mapping.
www.it-ebooks.info
c12.indd 406
22-03-2014 20:09:12
Dimension Table Loading)>>╇
➤➤
❘╇ 407
Historical Attribute Inserts Output: The historical output is for any attributes that you marked as historical and underwent a change. Therefore, you need to add a new row to the dimension table. Handling historical changes requires two general steps: ➤➤
Update the old record with the EndDate (and NULL Status). This is done through a Derived Column transformation that defines the EndDate as the System::StartTime variable and an OLE DB command that runs an update statement with the following code:
UPDATE [dbo].[DimEmployee] SET [EndDate] = ? , [Status] = NULL WHERE [EmployeeNationalIDAlternateKey] = ? AND [EndDate] IS NULL
This update statement was altered to also set the Status column to NULL because of the requirement mentioned in the new output. Also, note that [EndDate] IS NULL is included in the WHERE clause because this identifies that the record is the latest record. ➤➤
➤➤
Insert the new version of the dimension record. This is handled by a Union All transformation to the new outputs. Because both require inserts, this can be handled in one destination. Also note that the Derived Column shown earlier in Figure 12-28 is applicable to the historical output.
Inferred Member Updates Output: Handling inferred members is done through two parts of the ETL. First, during the fact load when the dimension member is missing, an inferred member is inserted. Second, during the dimension load, if one of the missing inferred members shows up in the dimension source, then the attributes need to be updated in the dimension table. The following update statement is used in the OLE DB Command 1 transformation: UPDATE [dbo].[DimEmployee] SET [BirthDate] = ?,[CurrentFlag] = ?,[EmailAddress] = ?,[FirstName] = ?,[Gender] = ?,[HireDate] = ?,[LastName] = ?,[LoginID] = ?,[MaritalStatus] = ?,[MiddleName] = ?,[ParentEmployeeNationalIDAlternateKey] = ?,[Phone] = ?,[SalariedFlag] = ?,[SalesTerritoryKey] = ?,[SickLeaveHours] = ?, [Title] = ?,[VacationHours] = ? WHERE [EmployeeNationalIDAlternateKey] = ?
What is the difference between this update statement and the update statement used for the changing attribute output? This one includes updates of the changing attributes, the historical attributes, and the fixed attributes. In other words, because you are updating this as an inferred member, all the attributes are updated, not just the changing attributes. ➤➤
Fixed Attribute Output (not used by default): Although this is not used by default by the SCD Wizard, it is an additional output that can be used in your Data Flow. For example, you may want to audit records whose fixed attribute has changed. To use it, you can simply take the blue output path from the SCD Transformation and drag it to a Destination component where your fixed attribute records are stored for review. You need to choose the Fixed Attribute Output when prompted by adding the new path.
www.it-ebooks.info
c12.indd 407
22-03-2014 20:09:12
408╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
➤➤
Unchanged Output (not used by default): This is another output not used by the SCD Transformation by default. As your dimensions are being processed, chances are good that most of your dimension records will not undergo any changes. Therefore, the records do not need to be sent out for any of the prior outputs. However, you may wish to audit the number of records that are unchanged. You can do this by adding a Row Count Transformation and then dragging a new blue data path from the SCD Transformation onto the Row Count Transformation and choosing the Unchanged Output when prompted by adding the new path. With SSIS in SQL Server 2014, you can also report on the Data Flow performance and statistics when a package is deployed to the SSIS server. Chapter 16 and Chapter 22 review the Data Flow reporting.
Considerations and Alternatives to the SCD Transformation As you have seen, the SCD Transformation boasts powerful, rapid development, and it is a great tool to understand SCD and ETL concepts. It also helps to simplify and standardize your dimension ETL processing. However, the SCD Transformation is not always the right choice for handling your dimension ETL. Some of the drawbacks include the following: ➤➤
For each row in the input, a new lookup is sent to the relational engine to determine whether changes have occurred. In other words, the dimension table is not cached in memory. That is expensive! If you have tens of thousands of dimension source records or more, the performance of this approach can be a limiting feature of the SCD Transformation.
➤➤
For each row in the source that needs to be updated, a new update statement is sent to the dimension table (and updates are used by the changing output, historical output, and inferred member output). If a lot of updates are happening every time your dimension package runs, this will also cause your package to run slowly.
➤➤
The Insert Destination is not set to fast load. This is because deadlocks can occur between the updates and the inserts. When the insert runs, each row is added one at a time, which can be very expensive.
➤➤
The SCD Transformation works well for historical, changing, and fixed dimension attributes, and, as you saw, changes can be made to the downstream transformations. However, if you open the SCD Wizard again and make a change to any part of it, you will automatically lose your customizations.
Consider some of these approaches to optimize your package that contains the output from the SCD wizard: ➤➤
Create an index on your dimension table for the business key, followed by the current row identifier (such as the EndDate). If a clustered index does not already exist, create this index as a clustered index, which will prevent a query plan lookup from getting the underlying row. This will help the lookup that happens in the SCD Transformation, as well as all of the updates.
www.it-ebooks.info
c12.indd 408
22-03-2014 20:09:12
Fact Table Loading)>>╇
❘╇ 409
➤➤
The row-by-row updates can be changed to set-based updates. To do this, you need to remove the OLE DB command transformation and add a Destination component in its place to stage the records to a temporary table. Then, in the Control Flow, add an Execute SQL Task to perform the set-based update after the Data Flow is completed.
➤➤
If you remove all the OLE DB command transformations, then you can also change the Insert Destination to use fast load and essentially bulk insert the data, rather than perform per-row inserts.
Overall, these alterations may provide you with enough performance improvements that you can continue to use the SCD Transformation effectively for higher data volumes. However, if you still need an alternate approach, try building the same SCD process through the use of other built-in SSIS transformations such as these: ➤➤
The Lookup Transformation and the Merge Join Transformation can be used to cache the dimension table data. This will greatly improve performance because only a single select statement will run against the dimension table, rather than potentially thousands.
➤➤
The Derived Column Transformation and the Script component can be used to evaluate which columns have changed, and then the rows can be sent out to multiple outputs. Essentially, this would mimic the change evaluation engine inside of the SCD Transformation.
➤➤
After the data is cached and evaluated, you can use the same SCD output structure to handle the changes and inserts, and then you can use set-based updates for better performance.
Fact Table Loading Fact table loading is often simpler than dimension ETL, because a fact table usually involves just inserts and, occasionally, updates. When dealing with large volumes, you may need to handle partition inserts and deal with updates in a different way. In general, fact table loading involves a few common tasks: ➤➤
Preparing your source data to be at the same granularity as your fact table, including having the dimension business keys and measures in the source data
➤➤
Acquiring the dimension surrogate keys for any related dimension
➤➤
Identifying new records for the fact table (and potentially updates)
The Sales Quota fact table is relatively straightforward and will give you a good start toward developing your fact table ETL:
1.
In your SSIS project for this chapter, create a new package and rename it ETL_FactSalesQuota.dtsx.
2.
Just like the other packages you developed in this chapter, you will use two Connection Managers, one for AdventureWorks, and the other for AdventureWorksDW. If you haven’t already created project-level Connection Managers for these in Solution Explorer, add them before continuing.
www.it-ebooks.info
c12.indd 409
22-03-2014 20:09:12
410╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
3.
Create a new Data Flow Task and add an OLE DB Source component. Name it Sales Quota Source. Configure the OLE DB Source component to connect to the AdventureWorks Connection Manager, and change the data access mode to SQL command, as shown in Figure 12-29. Add the following code to the SQL command text window: SELECT QuotaDate, SalesQuota, NationalIDNumber as EmployeeNationalIDAlternateKey FROM Sales.SalesPersonQuotaHistory INNER JOIN HumanResources.Employee ON SalesPersonQuotaHistory.BusinessEntityID = Employee.BusinessEntityID
Figure 12-29
4.
To acquire the surrogate keys from the dimension tables, you will use a Lookup Transformation. Drag a Lookup Transformation onto the Data Flow and connect the blue data path output of the OLE DB Source component onto the Lookup Transformation. Rename the Lookup Employee Key.
5.
Double-click the Employee Key Transformation to bring up the Lookup Editor. On the General property page, leave the Cache mode set to Full cache and the Connection type set to OLE DB Connection Manager.
6.
On the Connection property page, change the OLE DB Connection Manager dropdown to AdventureWorksDW and enter the following code:
www.it-ebooks.info
c12.indd 410
22-03-2014 20:09:12
Fact Table Loading)>>╇
❘╇ 411
SELECT EmployeeKey, EmployeeNationalIDAlternateKey FROM DimEmployee WHERE EndDate IS NULL
Including the EndDate IS NULL filter ensures that the most current dimension record surrogate key is acquired in the Lookup.
7.
Change to the Columns property page and map the EmployeeNationalIDAlternateKey from the input columns to the lookup columns. Then select the checkbox next to the EmployeeKey of the Lookup, as shown in Figure 12-30.
Figure 12-30
8. 9.
10.
Click OK to save your changes to the Lookup Transformation. For the DateKey, a Lookup is not needed because the DateKey is a “smart key,” meaning the key is an integer value based on the date itself in YYYYMMDD format. Therefore, you will use a Derived column to calculate the DateKey for the fact table. Add a Derived Column Transformation to the Data Flow and connect the blue data path output of the Employee Lookup to the Derived Column Transformation. When prompted, choose the Lookup Match Output from the Lookup transformation. Name the Derived Column Date Keys. Double-click the Derived Column Transformation and add the following three new Derived Column columns and their associated expressions, as shown in Figure 12-31:
www.it-ebooks.info
c12.indd 411
22-03-2014 20:09:13
412╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Figure 12-31 ➤➤
DateKey: YEAR([QuotaDate]) *10000 + MONTH([QuotaDate]) *100 + DAY([QuotaDate])
➤➤
CalendarYear: (DT_I2) YEAR([QuotaDate])
➤➤
CalendarQuarter: (DT_UI1) DATEPART(“q”,[QuotaDate])
At this point in your Data Flow, the data is ready for the fact table. If your data has already been incrementally extracted, so that you are getting only new rows, you can use an OLE DB Destination to insert it right into the fact table. Assume for this tutorial that you need to identify which records are new and which records are updates, and handle them appropriately. The rest of the steps accomplish fact updates and inserts. A Merge Join will be used to match source input records to the actual fact table records, but before you add the Merge Join, you need to add a Sort Transformation to the source records (a requirement of the Merge Join) and extract the fact data into the Data Flow. 11.
Add a Sort Transformation to the Data Flow and connect the blue data path output from the Derived Column Transformation to the Sort transformation. Double-click the Sort Transformation to bring up the Sort Transformation Editor and sort the input data by the following columns: EmployeeKey, CalendarYear, and CalendarQuarter, as shown in Figure 12-32. The CalendarYear and CalendarQuarter are important columns for this fact table because they identify the date grain, the level of detail at which the fact table
www.it-ebooks.info
c12.indd 412
22-03-2014 20:09:13
Fact Table Loading)>>╇
❘╇ 413
is associated with the date dimension. As a general rule, the Sort transformation is a very powerful transformation as long as it is working with manageable data sizes, in the thousands and millions, but not the tens or hundreds of millions (if you have a lot of memory, you can scale up as well). An alternate to the Sort is described in steps 12–14, as well as in Chapters 7 and 16. Figure 12-33 shows what your Data Flow should look like at this point.
Figure 12-32
12.
Figure 12-33
Add a new OLE DB Source component to the Data Flow and name it Sales Quota Fact. Configure the OLE DB Source to use the AdventureWorksDW Connection Manager and use the following SQL command: SELECT EmployeeKey, CalendarYear , CalendarQuarter, SalesAmountQuota FROM dbo.FactSalesQuota ORDER BY 1,2,3
www.it-ebooks.info
c12.indd 413
22-03-2014 20:09:13
414╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
13.
Because you are using an ORDER BY statement in the query (sorting by the first three columns in order), you need to configure the OLE DB Source component to know that the data is entering the data flow sorted. First, click OK to save the changes to the OLE DB Source and then right-click the Sales Quota Fact component and choose Show Advanced Editor.
14.
On the Input and Output Properties tab, click the OLE DB Source Output object in the left window; in the right window, change the IsSorted property to True, as shown in Figure 12-34.
Figure 12-34
15.
Expand the OLE DB Source Output on the left and then expand the Output Columns folder. Make the following changes to the Output Column properties:
a.
Select the EmployeeKey column and change its SortKeyPosition to 1, as shown in Figure 12-35. (If the sort order were descending, you would enter a -1 into the SortKeyPosition.)
www.it-ebooks.info
c12.indd 414
22-03-2014 20:09:14
Fact Table Loading)>>╇
❘╇ 415
Figure 12-35
b. c. d.
Select the CalendarYear column and change its SortKeyPosition to 2. Select the CalendarQuarter column and change its SortKeyPosition to 3. Click OK to save the changes to the advanced properties.
16.
Add a Merge Join Transformation to the Data Flow. First, connect the blue data path output from the Sort Transformation onto the Merge Join. When prompted, choose the input option named Merge Join Left Input. Then connect the blue data path output from the Sales Quota Fact Source to the Merge Join.
17.
Double-click the Merge Join transformation to open its editor. You will see that the EmployeeKey, CalendarYear, and CalendarQuarter columns are already joined between inputs. Make the following changes, as shown in Figure 12-36:
www.it-ebooks.info
c12.indd 415
22-03-2014 20:09:14
416╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Figure 12-36
a. b.
c.
Change the Join type dropdown to a Left outer join. Check the SalesQuota, EmployeeKey, DateKey, CalendarYear, CalendarQuarter, and QuotaDate columns from the Sort input list and then change the Output Alias for QuotaDate to Date. Check the SalesAmountQuota from the Sales Quota Fact column list and then change the Output Alias for this column to SalesAmountQuota_Fact.
18.
Click OK to save your Merge Join configuration.
19.
Your next objective is to identify which records are new quotas and which are changed sales quotas. A conditional split will be used to accomplish this task; therefore, drag a Conditional Split Transformation onto the Data Flow and connect the blue data path output from the Merge Join transformation to the Conditional Split. Rename the Conditional Split to Identify Inserts and Updates.
20.
Double-click the Conditional Split to open the editor and make the following changes, as shown in Figure 12-37:
www.it-ebooks.info
c12.indd 416
22-03-2014 20:09:14
Fact Table Loading)>>╇
❘╇ 417
Figure 12-37
a.
Add a new condition named New Fact Records with the following condition: ISNULL([SalesAmountQuota_Fact]). If the measure from the fact is null, it indicates that the fact record does not exist for the employee and date combination.
b.
Add a second condition named Fact Updates with the following condition: [SalesQuota] != [SalesAmountQuota_Fact].
c.
Change the default output name to No Changes.
21.
Click OK to save the changes to the Conditional Split.
22.
Add an OLE DB Destination component to the Data Flow and name it Fact Inserts. Drag the blue data path output from the Conditional Split Transformation to the OLE DB destination. When prompted to choose an output from the Conditional Split, choose the New Fact Records output.
www.it-ebooks.info
c12.indd 417
22-03-2014 20:09:15
418╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
23.
Double-click the Fact Inserts Destination and change the OLE DB Connection Manager to AdventureWorksDW. In the “Name of the table or view” dropdown, choose the [dbo]. [FactSalesQuota] table.
24.
Switch to the Mappings property page and match up the SalesQuota column from the Available Input Columns list to the SalesAmountQuota in the Available Destinations column list, as shown in Figure 12-38. The other columns (EmployeeKey, DateKey, CalendarYear, and CalendarQuarter) should already match. Click OK to save your changes to the OLE DB destination.
Figure 12-38
25.
To handle the fact table updates, drag an OLE DB Command Transformation to the Data Flow and rename it Fact Updates. Drag the blue data path output from the Conditional Split onto the Fact Updates transformation, and when prompted, choose the Fact Update output from the Conditional Split.
www.it-ebooks.info
c12.indd 418
22-03-2014 20:09:15
Fact Table Loading)>>╇
26.
❘╇ 419
Double-click the OLE DB Command Transformation and change the Connection Manager dropdown to AdventureWorksDW. On the Component Properties tab, add the following code to the SQLCommand property (make sure you click the ellipsis button to open an editor window): UPDATE dbo.FactSalesQuota SET SalesAmountQuota = ? WHERE EmployeeKey = ? AND CalendarYear = ? AND CalendarQuarter = ?
27.
Switch to the Column Mappings tab and map the SalesQuota to Param_0, Employee_Key to Param_1, CalendarYear to Param_2, and CalendarQuarter to Param_3, as shown in Figure 12-39.
Figure 12-39
28.
Click OK to save your changes to the OLE DB Command update. Your fact table ETL for the FactSalesQuota is complete and should look similar to Figure 12-40.
www.it-ebooks.info
c12.indd 419
22-03-2014 20:09:15
420╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Figure 12-40
If you test this package out, you will find that the inserts fail. This is because the date dimension is populated through 2006, but several 2007 and 2008 dates exist that are needed for the fact table. For the purposes of this exercise, you can just drop the foreign key constraint on the table, which will enable your FactSalesQuota package to execute successfully. In reality, as part of your ETL, you would create a recurring script that populated the DateDim table with new dates: ALTER TABLE [dbo].[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimDate]
Here are some final considerations for fact table ETL: ➤➤
A Merge Join was used in this case to help identify which records were updates or inserts, based on matching the source to the fact table. Refer to the Chapter 7 to see other alternatives to associating the source to fact table.
➤➤
For the inserts and updates, you may want to leverage the relational engine to handle either the insert or the update at the same time. T-SQL in SQL Server supports a MERGE statement that will perform either an insert or an update depending on whether the record exists. See Chapter 13 for more about how to use this feature.
➤➤
Another alternative to the OLE DB Command fact table updates is to use a set-based update. The OLE DB command works well and is easy for small data volumes; however, your situation may not allow per-row updates. Consider staging the updates to a table and then performing a set-based update (through a multirow SQL UPDATE statement) by joining the staging table to the fact table and updating the sales quota that way.
www.it-ebooks.info
c12.indd 420
22-03-2014 20:09:16
SSAS Processing)>>╇
❘╇ 421
➤➤
Inserts are another area of improvement considerations. Fact tables often contain millions of rows, so you should look for ways to optimize the inserts. Consider dropping the indexes, loading the fact table, and then recreating the indexes. This could be much faster. See Chapter 16 for ideas on how to tune the inserts.
➤➤
If you have partitions in place, you can insert the data right into the partitioned fact table; however, when you are dealing with high volumes, the relational engine overhead may inhibit performance. In these situations, consider switching the current partition out in order to load it separately, then you can switch it back into the partitioned table.
Inferred members are another challenge for fact table ETL. How do you handle a missing dimension key? One approach includes scanning the fact table source for missing keys and adding the inferred member dimension records before the fact table ETL runs. An alternative is to redirect the missing row when the lookup doesn’t have a match, then add the dimension key during the ETL, followed by bringing the row back into the ETL through a Union All. One final approach is to handle the inferred members after the fact table ETL finishes. You would need to stage the records that have missing keys, add the inferred members, and then reprocess the staged records into the fact table. As you can see, fact tables have some unique challenges, but overall they can be handled effectively with SSIS. Now that you have loaded both your dimensions and fact tables, the next step is to process your SSAS cubes, if SSAS is part of your data warehouse or business intelligence project.
SSAS Processing Processing SSAS objects in SSIS can be as easy as using the Analysis Services Processing task. However, if your SSAS cubes require adding or processing specific partitions or changing the names of cubes or servers, then you will need to consider other approaches. In fact, many, if not most, solutions require using other processing methods. SSAS in SQL Server 2014 has two types of models, multidimensional and tabular. Both of these models require processing. For multidimensional models, you are processing dimensions and cube partitions. For tabular models, you are processing tables and partitions. However, both models have similar processing options. The primary ways to process SSAS models through SSIS include the following: ➤➤
Analysis Services Processing Task: Can be defined with a unique list of dimensions, tables, and partitions to process. However, this task does not allow modifications of the objects through expressions or configurations.
➤➤
Analysis Services Execute DDL Task: Can process objects through XMLA scripts. The advantage of this task is the capability to make the script dynamic by changing the script contents before it is executed.
➤➤
Script Task: Can use the API for SSAS, which is called AMO (or Analysis Management Objects). With AMO, you can create objects, copy objects, process objects, and so on.
➤➤
Execute Process Task: Can run ascmd.exe, which is the SSAS command-line tool that can run XMLA, MDX, and DMX queries. The advantage of the ascmd.exe tool is the capability to pass in parameters to a script that is run.
www.it-ebooks.info
c12.indd 421
22-03-2014 20:09:16
422╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
To demonstrate the use of some of these approaches, this next tutorial demonstrates processing a multidimensional model using the Analysis Services Processing Task to process the dimensions related to the sales quotas, and then the Analysis Services Execute DDL Task to handle processing of the partitions. Before beginning these steps, create a new partition in SSAS for the Sales Targets Measure called Sales_Quotas_2014. This is for demonstration purposes. An XMLA script has been╯created and included in the downloadable content at www.wrox.com/go/prossis2014 for this chapter╯called Sales_Quotas_2014.xmla.
1.
In your SSIS project for this chapter, create a new package and rename it SSAS_SalesTargets.dtsx.
2.
Since this is the only package that will be using the SSAS connection, you will create a package connection, rather than a project connection. Right-click in the Connection Managers window and choose New Analysis Services Connection. In the Add Analysis Services Connection Manager window, click the Edit button to bring up the connection properties, as shown in Figure 12-41.
Figure 12-41
a.
Specify your server in the “Server or file name” text box (such as localhost if you are running SSAS on the same machine).
b.
Change the “Log on to the server” option to Use Windows NT Integrated Security.
www.it-ebooks.info
c12.indd 422
22-03-2014 20:09:16
SSAS Processing)>>╇
❘╇ 423
c.
In the Initial catalog dropdown box, choose the Adventure Works SSAS database, which by default is named Adventure Works DW Multidimensional. Please remember that you will need to download and install the sample SSAS cube database, which is available from www.wrox.com/go/prossis2014.
d.
Click OK to save your changes to the Connection Manager and then click OK in the Add Analysis Services Connection Manager window.
e.
Finally, rename the connection in the SSIS Connection Managers window to AdventureWorksAS.
3.
To create the dimension processing, drag an Analysis Services Processing Task from the SSIS Toolbox onto the Control Flow and rename the task Process Dimensions.
4.
Double-click the Process Dimensions Task to bring up the editor and navigate to the Processing Settings property page.
a.
Confirm that the Analysis Services Connection Manager dropdown is set to AdventureWorksAS.
b.
Click the Add button to open the Add Analysis Services Object window. As shown in Figure 12-42, check the Date, Employee, and Sales Territory dimensions and then click OK to save your changes.
c.
For each dimension, change the Process Options dropdown to Process Default, which will either perform a dimension update or, if the dimension has never been processed, fully process the dimension.
Figure 12-42
d.
Click the Change Settings button, and in the Change Settings editor, click the Parallel selection option under the Processing Order properties. Click OK to save your settings.
e.
Click OK to save your changes to the Analysis Services Processing Task.
5.
Before continuing, you will create an SSIS package variable that designates the XMLA partition for processing. Name the SSIS variable Sales_Quota_Partition and define the variable with a String data type and a value of “Fact Sales Quota.”
6.
Drag an Analysis Services Execute DDL Task onto the Data Flow and drag the green precedence constraint from the Process Dimensions Task onto the Analysis Services Execute DDL Task. Rename the Analysis Services Execute DDL Task Process Partition.
a. b.
Edit the Process Partition Task and navigate to the DDL property page. Change the Connection property to AdventureWorksAS and leave the SourceType as Direct Input, as shown in Figure 12-43.
www.it-ebooks.info
c12.indd 423
22-03-2014 20:09:16
424╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Figure 12-43
c.
Change to the Expressions property page of the editor and click the Expressions property in the right window. Click the ellipsis on the right side of the text box, which will open the Property Expressions Editor. Choose Source from the dropdown, as shown in Figure 12-44.
d.
Now you need to add the XMLA code that will execute when the package is run. The expressions will dynamically update the code when this task executes. Click the ellipsis on the right side of the Source property (refer to Figure 12-44) to open Expression Builder.
Figure 12-44
www.it-ebooks.info
c12.indd 424
22-03-2014 20:09:17
SSAS Processing)>>╇
e.
❘╇ 425
Enter the following code in the Expression text box, which is also shown in Figure 12-45:
“ProcessFullUseExisting” ”
Figure 12-45
f.
This code generates the XMLA and includes the Sales_Quota_Partition variable. The good news is that you don’t need to know XMLA; you can use SSMS to generate it for you.
www.it-ebooks.info
c12.indd 425
22-03-2014 20:09:17
426╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
To automatically generate the XMLA code that will process a Sales Quota partition, open SSMS and connect to SSAS. Expand the Databases folder, then the Adventure Works SSAS database, then the Cubes folder; then expand the Adventure Works cube; and finally expand the Sales Targets measure group. Right-click the Sales Quota 2014 partition and choose Process, as shown in Figure 12-46. The processing tool in SSMS looks very similar to the SSAS processing task in SSIS except that the SSMS processing tool has a Script button near the title bar. Click the Script button.
7.
g.
Click OK in the open windows to save your changes. The purpose of creating the script and saving the file is to illustrate that you can build your own processing with XMLA and either execute the code in SSMS (clicking on the Execute button) or execute the file through the Analysis Services Execute DDL Task.
Figure 12-46
The SSIS package that you have just developed should look similar to Figure 12-47.
If you were to fully work out the development of this package, you would likely have a couple more tasks involved in the process. First, the current partition is entered in the variable, but you haven’t yet put the code in place to update this variable when the package is run. For this task, either you could use an Execute SQL Task to pull the value for the current partition from a configuration table or the system date into the variable, or you could use a Script Task to populate the variable.
Figure 12-47
Second, if you have a larger solution with many partitions that are at the weekly or monthly grain, you would need a task that created a new partition, as needed, before the partition was run. This could be an Analysis Services Execute DDL Task similar to the one you just created for the processing Task, or you could use a Script Task and leverage AMO to create or copy an existing partition to a new partition. As you have seen, processing SSAS objects in SSIS can require a few simple steps or several more complex steps depending on the processing needs of your SSAS solution.
Using a Master ETL Package Putting it all together is perhaps the easiest part of the ETL process because it involves simply using SSIS to coordinate the execution of the packages in the required order.
www.it-ebooks.info
c12.indd 426
22-03-2014 20:09:17
Using a Master ETL Package╇
❘╇ 427
The best practice to do this is to use a master package that executes the child packages, leveraging the Execute Package Task. The determination of precedence is a matter of understanding the overall ETL and the primary-to-foreign key relationships in the tables. The following steps assume that you are building your solution using the project deployment model. With the project deployment model, you do not need to create connections to the other packages in the project. If you are instead deploying your packages to the file system, you need to create and configure File Connection Managers for each child package, as documented in SQL Server Books Online.
1. 2. 3. 4.
Create a new package in your project called Master_ETL.dtsx. Drag an Execute Package Task from the SSIS Toolbox into the Control Flow. Double-click the Execute Package Task to open the task editor. On the Package property page, leave the ReferenceType property set to Project Reference. For the PackageNameFromProjectReference property, choose the ETL_DimSalesTerritory.dtsx package.
Your Execute Package task will look like the one pictured in Figure 12-48.
Figure 12-48
The ETL packages for the dimension tables are executed, followed by the fact table ETL and concluding with the cube processing. The master package for the examples in this chapter is shown in Figure 12-49.
www.it-ebooks.info
c12.indd 427
22-03-2014 20:09:18
428╇
❘╇ â•›Chapter 12╇╇Loading a Data Warehouse
Figure 12-49
The related packages are grouped with Sequence containers to help visualize the processing order. In this case, the Dim Sales Territory package needs to be run before the Dim Employee package because of the foreign key reference in the DimEmployee table. Larger solutions will have multiple dimension and fact packages.
Summary Moving from start to finish in a data warehouse ETL effort requires a lot of planning and research. This research should include both data profiling and interviews with the business users to understand how they will be using the dimension attributes, so that you can identify the different attribute change types. As you develop your dimension and fact packages, you will need to carefully consider how to most efficiently perform inserts and updates, paying particular attention to data changes and missing members. Finally, don’t leave your SSAS processing packages for the last minute. You may be surprised at the time it can take to develop a flexible package that can dynamically handle selective partition processing and creation. In the next chapter, you will learn about the pros and cons of using the relational engine instead of SSIS.
www.it-ebooks.info
c12.indd 428
22-03-2014 20:09:18
13
Using the Relational Engine What’s in This Chapter? ➤➤
Using the relational engine to facilitate SSIS data extraction
➤➤
Loading data with SSIS and the relational engine
➤➤
Understanding when to use the relational engine versus SSIS
Wrox.com Code Downloads for this Chapter
You can find the wrox.com code downloads for this chapter at http://www.wrox.com/go/ prossis2014 on the Download Code tab.
An old adage says that when you’re holding a hammer, everything looks like a nail. When you use SSIS to build a solution, make sure that you use the right tool for each problem you tackle. SSIS will be excellent for some jobs, and SQL Server will shine at other tasks. When used in concert, the combination of the two can be powerful. This chapter discusses other features in the SQL Server arsenal that can help you build robust and high-performance ETL solutions. The SQL Server relational database engine has many features that were designed with data loading in mind, and as such the engine and SSIS form a perfect marriage to extract, load, and transform your data. This chapter assumes you are using SQL Server 2014 as the source system, though many of the same principles will apply to earlier versions of SQL Server and to other relational database systems too. You should also have the SQL Server 2014 versions of AdventureWorks and AdventureWorksDW installed; these are available from www.wrox.com. The easiest way to understand how the relational database engine can help you design ETL solutions is to segment the topic into the three basic stages of ETL: extraction, transformation, and loading. Because the domain of transformation is mostly within SSIS itself, there is not much to say there about the relational database engine, so our scope of interest here is narrowed down to extraction and loading.
www.it-ebooks.info
c13.indd 429
3/22/2014 9:31:19 AM
430╇
❘╇ CHAPTER 13╇ Using the Relational Engine
Data Extraction Even if a data warehouse solution starts off simpleâ•–—╖╉using one or two sourcesâ•–—╖╉it can rapidly become more complex when the users begin to realize the value of the solution and request that data from additional business applications be included in the process. More data increases the complexity of the solution, but it also increases the execution time of the ETL. Storage is certainly cheap today, but the size and amount of data are growing exponentially. If you have a fixed batch window of time in which you can load the data, it is essential to minimize the expense of all the operations. This section looks at ways of lowering the cost of extraction and how you can use those methods within SSIS.
SELECT * Is Bad In an SSIS Data Flow, the OLE DB Source and ADO.NET Source components allow you to select a table name that you want to load, which makes for a simple development experience but terrible runtime performance. At runtime the component issues a SELECT * FROM «table» command to SQL Server, which obediently returns every single column and row from the table. This is a problem for several reasons: ➤➤
CPU and I/O cost: You typically need only a subset of the columns from the source table, so every extra column you ask for incurs processing overhead in all the subsystems it has to travel through in order to get to the destination. If the database is on a different server, then the layers include NTFS (the file system), the SQL Server storage engine, the query processor, TDS (tabular data stream, SQL Server’s data protocol), TCP/IP, OLE DB, the SSIS Source component, and finally the SSIS pipeline (and probably a few other layers). Therefore, even if you are extracting only one redundant integer column of data from the source, once you multiply that cost by the number of rows and processing overhead, it quickly adds up. Saving just 5 percent on processing time can still help you reach your batch window target.
➤➤
Robustness: If the source table has ten columns today and your package requests all the data in a SELECT * manner, then if tomorrow the DBA adds another column to the source table, your package could break. Suddenly the package has an extra column that it doesn’t know what to do with, things could go awry, and your Data Flows will need to be rebuilt.
➤➤
Intentional design: For maintenance, security, and self-documentation reasons, the required columns should be explicitly specified.
➤➤
DBA 101: If you are still not convinced, find any seasoned DBA, and he or she is likely to launch into a tirade of why SELECT * is the root of all evil.
As Figure 13-1 shows, the Source components also give you the option of using checkboxes to select or deselect the columns that you require, but the problem with this approach is that the filtering occurs on the client-side. In other words, all the columns are brought across (incurring all that I/O overhead), and then the deselected columns are deleted once they get to SSIS. So what is the preferred way to extract data using these components? The simple answer is to forget that the table option exists and instead use only the query option. In addition, forget that the
www.it-ebooks.info
c13.indd 430
3/22/2014 9:31:19 AM
Data Extraction╇
❘╇ 431
column checkboxes exist. For rapid development and prototyping these options may be useful, but for deployed solutions you should type in a query to only return the necessary columns. SSIS makes it simple to do this by providing a query builder in both the OLE DB and ADO.NET Source �components, which enables you to construct a query in a visual manner, as shown in Figure 13-2.
Figure 13-1
Figure 13-2
www.it-ebooks.info
c13.indd 431
3/22/2014 9:31:19 AM
432╇
❘╇ CHAPTER 13╇ Using the Relational Engine
If you forget to use the query option or you use a SELECT * while using the query option and do not need the extraneous columns, SSIS will gently remind you during the execution of the package. The Data Flow Task’s pipeline recognizes the unused columns and throws warning events when the package runs. These messages provide an easy way to verify your column listing and performance tune your package. When running the package in debug mode, you can see the messages on the Progress tab, as shown in Figure 13-3. An example full message states: “[SSIS.Pipeline] Warning: The output column "PersonType" (31) on output "OLE DB Source Output" (29) and component "Table Option Source - Bad Practice" (18) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.” This reminds you to remove the column PersonType from the source query to Â�prevent the warning
from reoccurring and affecting your future package executions.
Figure 13-3
Note╇ When using other SSIS sources, such as the Flat File Source, you do not have the option of selecting specific columns or rows with a query. Therefore, you will need to use the method of unchecking the columns to filter these sources.
WHERE Is Your Friend As an ancillary to the previous tenet, the WHERE clause (also called the query predicate) is one of the most useful tools you can use to increase performance. Again, the table option in the Source components does not allow you to narrow down the set of columns, nor does it allow you to limit the number of rows. If all you really need are the rows from the source system that are tagged with yesterday’s date, then why stream every single other row over the wire just to throw them away once they get to SSIS? Instead, use a query with a WHERE clause to limit the number of rows being returned. As before, the less data you request, the less processing and I/O is required, and thus the faster your solution will be.
www.it-ebooks.info
c13.indd 432
3/22/2014 9:31:20 AM
Data Extraction╇
❘╇ 433
--BAD programming practice (returns 11 columns, 121,317 rows) SELECT * FROM Sales.SalesOrderDetail; --BETTER programming practice (returns 6 columns, 121,317 rows) SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount FROM Sales.SalesOrderDetail; --BEST programming practice (returns 6 columns, 79 rows) SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount FROM Sales.SalesOrderDetail WHERE ModifiedDate = '2008-07-01';
Note╇ All code samples in this chapter are available as part of the Chapter 13 code download for the book at http://www.wrox.com/go/prossis2014.
In case it is not clear, Figure 13-4 shows how you would use this SELECT statement (and the other queries discussed next) in the context of SSIS. Drop an OLE DB or ADO.NET Source component onto the Data Flow design surface, point it at the source database (which is AdventureWorks in this case), select the SQL command option, and plug in the preceding query.
Figure 13-4
Transform during Extract The basic message here is to do some of your transformations while you are extracting. This is not a viable approach for every single transformation you intend to doâ•–—╖╉especially if your ETL
www.it-ebooks.info
c13.indd 433
3/22/2014 9:31:20 AM
434╇
❘╇ CHAPTER 13╇ Using the Relational Engine
solution is used for compliance reasons, and you want to specifically log any errors in the dataâ•–—╖╉ but it does make sense for primitive operations, such as trimming whitespace, converting magic numbers to NULLs, sharpening data types, and even something as simple as providing a friendlier column name. Note╇ A magic number is a value used to represent the “unknown” or NULL value in some systems. This is generally considered bad database design practice; Â� however, it is necessary in some systems that don’t have the concept of a NULL state. For instance, you may be using a source database for which the data Â�steward could not assign the value “Unknown” or NULL to, for example, a date column, so instead the operators plugged in 1999/12/31, not expecting that one day the “magic number” would suddenly gain meaning!
The practice of converting data values to the smallest type that can adequately represent them is called data sharpening. In one of the following examples, you convert a DECIMAL(37,0) value to BIT because the column only ever contains the values 0 or 1, as it is more efficient to store and process the data in its smallest (sharpest) representation.
Many data issues can be cleaned up as you’re extracting the data, before it even gets to SSIS. This does not mean you physically fix the data in the source system (though that would be an ideal solution).
Note╇ The best way to stop bad data from reaching your source system is to
restrict the entry of the data in operational applications by adding validation checks, but that is a topic beyond the scope of this book. To fix the extraction data means you will need to write a query smart enough to fix some basic problems and send the clean data to the end user or the intended location, such as a data warehouse. If you know you are immediately going to fix dirty data in SSIS, fix it with the SQL query so SSIS receives it clean from the source. By following this advice you can offload the simple cleanup work to the SQL Server database engine, and because it is very efficient at doing this type of set-based work, this can improve your ETL performance, as well as lower the package’s complexity. A drawback of this approach is that data quality issues in your source systems are further hidden from the business, and hidden problems tend to not be fixed! To demonstrate this concept, imagine you are pulling data from the following source schema. The problems demonstrated in this example are not merely illustrative; they reflect some real-world issues that the authors have seen.
www.it-ebooks.info
c13.indd 434
3/22/2014 9:31:20 AM
Data Extraction╇
❘╇ 435
Column Name
Data Type
Ex amples
Notes
CUSTOMER_ID
Decimal(8,0)
1, 2, 3
The values in this column are integers (4 bytes), but the source is declared as a decimal, which takes 5 bytes of storage per value.
CUSTOMER_NAME
Varchar(100)
“Contoso Traders__”, “_XXX”, “_Adventure Works”, “_”, “Acme Apples”, “___”, ““
The problem with this column is that where the customer name has not been provided, a blank string ““ or “XXX” is used instead of NULL. There are also many leading and trailing blanks in the values (represented by “_” in the examples).
ACTIVE_IND
Decimal(38,0)
1, 0, 1, 1, 0
Whether by intention or mistake, this simple True/False value is represented by a 17-byte decimal!
LOAD_DATE
DateTime
“2000/1/1”, “1972/05/27”, “9999/12/31”
The only problem in this column is that unknown dates are represented using a magic numberâ•–—╖╉in this case, “9999/12/31”. In some systems dates are represented using text fields, which means the dates can be invalid or ambiguous.
If you retrieve the native data into SSIS from the source just described, it will obediently generate the corresponding pipeline structures to represent this data, including the multi-byte decimal ACTIVE_ IND column that will only ever contain the values 1 or 0. Depending on the number of rows in the source, allowing this default behavior incurs a large amount of processing and storage overhead. All the data issues described previously will be brought through to SSIS, and you will have to fix them there. Of course, that may be your intention, but you could make your life easier by dealing with them as early as possible. Here is the default query that you might design: --Old Query SELECT [BusinessEntityID] ,[FirstName] ,[EmailPromotion] FROM [AdventureWorks].[Person].[Person]
You can improve the robustness, performance, and intention of the preceding query. In the spirit of the “right tool for the right job,” you clean the data right inside the query so that SSIS receives it in a
www.it-ebooks.info
c13.indd 435
3/22/2014 9:31:20 AM
436╇
❘╇ CHAPTER 13╇ Using the Relational Engine
cleaner state. Again, you can use this query in a Source component, rather than use the table method or plug in a default SELECT * query: --New Query SELECT --Cast the ID to an Int and use a friendly name cast([BusinessEntityID] as int) as BusinessID --Trim whitespaces, convert empty strings to Null ,NULLIF(LTRIM(RTRIM(FirstName)), '') AS FirstName --Cast the Email Promotion to a bit ,cast((Case EmailPromotion When 0 Then 0 Else 1 End) as bit) as EmailPromoFlag FROM [AdventureWorks].[Person].[Person] --Only load the dates you need Where [ModifiedDate] > '2008-12-31'
Let’s look at what you have done here: ➤➤
First, you have cast the BusinessEntityID column to a 4-byte integer. You didn’t do this conversion in the source database itself; you just converted its external projection. You also gave the column a friendlier name that your ETL developers many find easier to read and remember.
➤➤
Next, you trimmed all the leading and trailing whitespace from the FirstName column. If the column value were originally an empty string (or if after trimming it ended up being an empty string), then you convert it to NULL.
➤➤
You sharpened the EmailPromotion column to a Boolean (BIT) column and gave it a name that is simpler to understand.
➤➤
Finally, you added a WHERE clause in order to limit the number of rows.
What benefit did you gain? Well, because you did this conversion in the source extraction query, SSIS receives the data in a cleaner state than it was originally. Of course, there are bound to be other data quality issues that SSIS will need to deal with, but at least you can get the trivial ones out of the way while also improving basic performance. As far as SSIS is concerned, when it sets up the pipeline column structure, it will use the names and types represented by the query. For instance, it will believe the IsActive column is (and always has been) a BITâ•–—╖╉it doesn’t waste any time or space treating it as a 17-byte DECIMAL. When you execute the package, the data is transformed inside the SQL engine, and SSIS consumes it in the normal manner (albeit more efficiently because it is cleaner and sharper). You also gave the columns friendlier names that your ETL developers may find more intuitive. This doesn’t add to the performance, but it costs little and makes your packages easier to understand and maintain. If you are planning to use the data in a data warehouse and eventually in an Analysis Services cube, these friendly names will make your life much easier in your cube development. The results of these queries running in a Data Flow in SSIS are very telling. The old query returns over 19,000 rows, and it took about 0.3 seconds on the test machine. The new query returned only a few dozen rows and took less than half the time of the old query. Imagine this was millions of rows or even billions of rows; the time savings would be quite significant. So query tuning should always be performed when developing SSIS Data Flows.
www.it-ebooks.info
c13.indd 436
3/22/2014 9:31:20 AM
Data Extraction╇
❘╇ 437
Many ANDs Make Light Work OK, that is a bad pun, but it’s also relevant. What this tenet means is that you should let the SQL engine combine different data sets for you where it makes sense. In technical terms, this means do any relevant JOINs, UNIONs, subqueries, and so on directly in the extraction query. That does not mean you should use relational semantics to join rows from the source system to the destination system or across heterogeneous systems (even though that might be possible) because that will lead to tightly coupled and fragile ETL design. Instead, this means that if you have two or more tables in the same source database that you are intending to join using SSIS, then JOIN or UNION those tables together as part of the SELECT statement. For example, you may want to extract data from two tablesâ•–—╖╉SalesQ1 and SalesQ2â•–—╖╉in the same database. You could use two separate SSIS Source components, extract each table separately, then combine the two data streams in SSIS using a Union All Component, but a simpler way would be to use a single Source component that uses a relational UNION ALL operator to combine the two tables directly: --Extraction query using UNION ALL SELECT --Get data from Sales Q1 SalesOrderID, SubTotal FROM Sales.SalesQ1 UNION ALL --Combine Sales Q1 and Sales Q2 SELECT --Get data from Sales Q2 SalesOrderID, SubTotal FROM Sales.SalesQ2
Here is another example. In this case, you need information from both the Product and the Subcategory table. Instead of retrieving both tables separately into SSIS and joining them there, you issue a single query to SQL and ask it to JOIN the two tables for you (see Chapter 7 for more information): --Extraction query using a JOIN SELECT p.ProductID, p.[Name] AS ProductName, p.Color AS ProductColor, sc.ProductSubcategoryID, sc.[Name] AS SubcategoryName FROM Production.Product AS p INNER JOIN --Join two tables together Production.ProductSubcategory AS sc ON p.ProductSubcategoryID = sc.ProductSubcategoryID;
SORT in the Database SQL Server has intimate knowledge of the data stored in its tables, and as such it is highly efficient at operations such as sortingâ•–—╖╉especially when it has indexes to help it do the job. While SSIS allows you to sort data in the pipeline, you will find that for large data sets SQL Server is more proficient. As an example, you may need to retrieve data from a table, then immediately sort it so that a Merge Join Transformation can use it (the Merge Join Transformation requires pre-sorted inputs).
www.it-ebooks.info
c13.indd 437
3/22/2014 9:31:21 AM
438╇
❘╇ CHAPTER 13╇ Using the Relational Engine
You could sort the data in SSIS by using the Sort Transformation, but if your data source is a relational database, you should try to sort the data directly during extraction in the SELECT clause. Here is an example: --Extraction query using a JOIN and an ORDER BY SELECT p.ProductID, p.[Name] AS ProductName, p.Color AS ProductColor, sc.ProductSubcategoryID, sc.[Name] AS SubcategoryName FROM Production.Product AS p INNER JOIN --Join two tables together Production.ProductSubcategory AS sc ON p.ProductSubcategoryID = sc.ProductSubcategoryID ORDER BY --Sorting clause p.ProductID, sc.ProductSubcategoryID;
In this case, you are asking SQL Server to pre-sort the data so that it arrives in SSIS already sorted. Because SQL Server is more efficient at sorting large data sets than SSIS, this may give you a good performance boost. The Sort Transformation in SSIS must load all of the data in memory; therefore, it is a fully blocking asynchronous transform that should be avoided whenever possible. See Chapter 7 for more information on this. Note that the OLE DB and ADO.NET Source components submit queries to SQL Server in a passthrough mannerâ•–—╖╉meaning they do not parse the query in any useful way themselves. The ramification is that the Source components will not know that the data is coming back sorted. To work around this problem, you need to tell the Source components that the data is ordered, by following these steps: Right-click the Source component and choose Show Advanced Editor.
1. 2.
3.
Next, select the columns that are being sorted on, and assign them values as follows: If the column is not sorted, the value should be zero. If the column is sorted in ascending order, the value should be positive. If the column is sorted in descending order, the value should be negative. The absolute value of the number should correspond to the column’s position in the order list. For instance, if the query was sorted with ColumnA ascending, ColumnB descending, then you would assign the value 1 to ColumnA and the value -2 to ColumnB, with all other columns set to 0.
4.
In Figure 13-5, the data is sorted by the ProductID column. Expand the Output Columns node under the default output node, and then select the ProductID column. In the property grid, set the SortKeyPosition value to 1. Now the Source component is aware that the query is returning a sorted data set; furthermore, it knows exactly which columns are used for the sorting. This sorting information will be passed downstream to the next tasks. The passing down of information allows you to use components like the Merge Join Transformation, which requires sorted inputs, without using an SSIS Sort Transformation in the Data Flow.
Select the Input and Output Properties tab and click the root node for the default output (not the error output). In the property grid on the right is a property called IsSorted. Change this to True. Setting the IsSorted property to true just tells the component that the data is pre-sorted, but it does not tell it in what order.
www.it-ebooks.info
c13.indd 438
3/22/2014 9:31:21 AM
Data Extraction╇
❘╇ 439
Figure 13-5
Be very careful when specifying the sort orderâ•–—╖╉you are by contract telling the system to trust that you know what you are talking about, and that the data is in fact sorted. If the data is not sorted, or it is sorted in a manner other than you specified, then your package can act unpredictably, which could lead to data and integrity loss.
Modularize If you find you have common queries that you keep using, then try to encapsulate those queries in the source system. This statement is based on ideal situations; in the real world you may not be allowed to touch the source system, but if you can, then there is a benefit. Encapsulating the queries in the source system entails creating views, procedures, and functions that read the dataâ•–—╖╉you are not writing any data changes into the source. Once the (perhaps complex) queries are encapsulated in the source, your queries can be used in multiple packages by multiple ETL developers. Here is an example: USE SourceSystemDatabase; GO CREATE PROCEDURE dbo.up_DimCustomerExtract(@date DATETIME) -- Test harness (also the query statement you'd use in the SSIS source component): -- Sample execution: EXEC dbo.up_DimCustomerExtract '2004-12-20'; AS BEGIN SET NOCOUNT ON; SELECT --Convert to INT and alias using a friendlier name
www.it-ebooks.info
c13.indd 439
3/22/2014 9:31:21 AM
440╇
❘╇ CHAPTER 13╇ Using the Relational Engine
Cast(CUSTOMER_ID as int) AS CustomerID --Trim whitespace, convert empty strings to NULL and alias ,NULLIF(LTRIM(RTRIM(CUSTOMER_NAME)), '') AS CustomerName --Convert to BIT and use friendly alias ,Cast(ACTIVE_IND as bit) AS IsActive ,CASE --Convert magic dates to NULL WHEN LOAD_DATE = '9999-12-31' THEN NULL --Convert date to smart surrogate number of form YYYYMMDD ELSE CONVERT(INT, (CONVERT(NVARCHAR(8), LOAD_DATE, 112))) --Alias using friendly name END AS LoadDateID FROM dbo.Customers --Filter rows using input parameter WHERE LOAD_DATE = @date; SET NOCOUNT OFF; END; GO
To use this stored procedure from SSIS, you would simply call it from within an OLE DB or ADO .NET Source component. The example shows a static value for the date parameter, but in your solution you would use a variable or expression instead, so that you could call the procedure using different date values (see Chapter 5 for more details): EXEC dbo.up_DimCustomerExtract '2013-12-20';
Here are some notes on the benefits you have gained here: ➤➤
In this case you have encapsulated the query in a stored procedure, though you could have encased it in a user-defined function or view just as easily. A side benefit is that this complex query definition is not hidden away in the depths of the SSIS packageâ•–—╖╉you can easily access it using SQL Server.
➤➤
The benefit of a function or procedure is that you can simply pass a parameter to the module (in this case @date) in order to filter the data (study the WHERE clause in the preceding code). Note, however, that SSIS Source components have difficulty parsing parameters in functions, so you may need to use a procedure instead (which SSIS has no problems with), or you can build a dynamic query in SSIS to call the function (see Chapter 5 for more information).
➤➤
If the logic of this query changesâ•–—╖╉perhaps because you need to filter in a different way, or you need to point the query at an alternative set of tablesâ•–—╖╉then you can simply change the definition in one place, and all the callers of the function will get the benefit. However, there is a risk here too: If you change the query by, for example, removing a column, then the packages consuming the function might break, because they are suddenly missing a column they previously expected. Make sure any such query updates go through a formal change management process in order to mitigate this risk.
SQL Server Does Text Files Too It is a common pattern for source systems to export nightly batches of data into text files and for the ETL solution to pick up those batches and process them. This is typically done using a Flat File Source component in SSIS, and in general you will find SSIS is the best tool for the job. However, in some cases you may want to treat the text file as a relational source and sort it, join it, or perform
www.it-ebooks.info
c13.indd 440
3/22/2014 9:31:21 AM
Data Extraction╇
❘╇ 441
calculations on it in the manner described previously. Because the text file lives on disk, and it is a file not a database, this is not possibleâ•–—╖╉or is it? Actually, it is possible! SQL Server includes a table-valued function called OPENROWSET that is an ad hoc method of connecting and accessing remote data using OLE DB from within the SQL engine. In this context, you can use it to access text data, using the OPENROWSET(BULK ...) variation of the function. Note╇ Using the OPENROWSET and OPENQUERY statements has security ramifications, so they should be used with care in a controlled environment. If you want to test this functionality, you may need to enable the functions in the SQL Server Surface Area Configuration Tool. Alternatively, use the T-SQL configuration function as shown in the following code. Remember to turn this functionality off again after testing it (unless you have adequately mitigated any security risks). See Books Online for more information. sp_configure 'show advanced options', 1; --Show advanced configuration options GO RECONFIGURE; GO sp_configure 'Ad Hoc Distributed Queries', 1; --Switch on OPENROWSET functionality GO RECONFIGURE; GO sp_configure 'show advanced options', 0; --Remember to hide advanced options GO RECONFIGURE; GO
The SQL Server documentation has loads of information about how to use these two functions, but the following basic example demonstrates the concepts. First create a text file with the following data in it. Use a comma to separate each column value. Save the text file using the name BulkImport.txt in a folder of your choice. 1,AdventureWorks 2,Acme Apples Inc 3,Contoso Traders
Next create a format file that will help SQL Server understand how your custom flat file is laid out. You can create the format file manually or you can have SQL Server generate it for you: Create a table in the database where you want to use the format file (you can delete the table later; it is just a shortcut to build the format file). Execute the following statement in SQL Server Management Studioâ•–—╖╉for this example, you are using the AdventureWorks database to create the table, but you can use any database because you will delete the table afterward. The table schema should match the layout of the file. --Create temporary table to define the flat file schema USE AdventureWorks GO CREATE TABLE BulkImport(ID INT, CustomerName NVARCHAR(50));
www.it-ebooks.info
c13.indd 441
3/22/2014 9:31:21 AM
442╇
❘╇ CHAPTER 13╇ Using the Relational Engine
Now open a command prompt, navigate to the folder where you saved the BulkImport.txt file, and type the following command, replacing “AdventureWorks” with the database where you created the BulkImport table: bcp AdventureWorks..BulkImport format nul -c -t , -x -f BulkImport.fmt -T
If you look in the folder where you created the data file, you should now have another file called BulkImport.fmt. This is an XML file that describes the column schema of your flat fileâ•–—╖╉well, actually it describes the column schema of the table you created, but hopefully you created the table schema to match the file. Here is what the format file should look like:
Remember to delete the table (BulkImport) you created, because you don’t need it anymore. If you have done everything right, you should now be able to use the text file in the context of a relational query. Type the following query into SQL Server Management Studio, replacing the file paths with the exact folder path and names of the two files you created: --Select data from a text file as if it was a table SELECT T.* --SELECT * used for illustration purposes only FROM OPENROWSET(--This is the magic function BULK 'c:\ProSSIS\Data\Ch13\BulkImport.txt', --Path to data file FORMATFILE = 'c:\ProSSIS\Data\Ch13\BulkImport.fmt' --Path to format file ) AS T; --Command requires a table alias
After executing this command, you should get back rows in the same format they would be if they had come from a relational table. To prove that SQL Server is treating this result set in the same manner it would treat any relational data, try using the results in the context of more complex operations such as sorting: --Selecting from a text file and sorting the results SELECT T.OrgID, --Not using SELECT * anymore T.OrgName FROM OPENROWSET( BULK 'c:\ProSSIS\Data\Ch13\BulkImport.txt', FORMATFILE = 'c:\ProSSIS\Data\Ch13\BulkImport.fmt' ) AS T(OrgID, OrgName) --For fun, give the columns different aliases ORDER BY T.OrgName DESC; --Sort the results in descending order
You can declare if and how the text file is pre-sorted. If the system that produced the text file did so in a sorted manner, then you can inform SQL Server of that fact. Note that this is a contract from you, the developer, to SQL Server. SQL Server uses something called a streaming assertion when
www.it-ebooks.info
c13.indd 442
3/22/2014 9:31:21 AM
Data Extraction╇
❘╇ 443
reading the text file to double-check your claims, but in some cases this can greatly improve performance. Later you will see how this ordering contract helps with the MERGE operator, but here’s a simple example to demonstrate the savings. Run the following query. Note how you are asking for the data to be sorted by OrgID this time. Also note that you have asked SQL Server to show you the query plan that it uses to run the query: SET STATISTICS PROFILE ON; --Show query plan SELECT T.OrgID, T.OrgName FROM OPENROWSET( BULK 'c:\ProSSIS\Data\Ch13\BulkImport.txt', FORMATFILE = 'c:\ProSSIS\Data\Ch13\BulkImport.fmt' ) AS T(OrgID, OrgName) ORDER BY T.OrgID ASC; --Sort the results by OrgID SET STATISTICS PROFILE OFF; --Hide query plan
Have a look at the following query plan that SQL Server generates. The query plan shows the internal operations SQL Server has to perform to generate the results. In particular, note the second operation, which is a SORT: SELECT <...snipped...> |--Sort(ORDER BY:([BULK].[OrgID] ASC)) |--Remote Scan(OBJECT:(STREAM))
This is obvious and expected; you asked SQL Server to sort the data, and it does so as requested. Here’s the trick: In this case, the text file happened to be pre-sorted by OrgID anyway, so the sort you requested was actually redundant. (Note the text data file; the ID values increase monotonically from 1 to 3.) To prove this, type the same query into SQL again, but this time use the OPENROWSET(... ORDER) clause: SET STATISTICS PROFILE ON; --Show query plan SELECT T.OrgID, T.OrgName FROM OPENROWSET( BULK 'c:\ProSSIS\Data\Ch13\BulkImport.txt', FORMATFILE = 'c:\ProSSIS\Data\Ch13\BulkImport.fmt', ORDER (OrgID ASC) --Declare the text file is already sorted by OrgID )AS T(OrgID, OrgName) ORDER BY T.OrgID ASC; --Sort the results by OrgID SET STATISTICS PROFILE OFF; --Hide query plan
Once again you have asked for the data to be sorted, but you have also contractually declared that the source file is already pre-sorted. Have a look at the new query plan. Here’s the interesting result: Even though you asked SQL Server to sort the result in the final ORDER BY clause, it didn’t bother doing so because you indicated (and it confirmed) that the file was already ordered as such: SELECT <...snipped...> |--Assert <...snipped...> |--Sequence Project(<...snipped...>) |--Segment |--Remote Scan(OBJECT:(STREAM))
www.it-ebooks.info
c13.indd 443
3/22/2014 9:31:21 AM
444╇
❘╇ CHAPTER 13╇ Using the Relational Engine
As you can see, there is no SORT operation in the plan. There are other operators, but they are just inexpensive assertions that confirm the contract you specified is true. For instance, if a row arrived that was not ordered in the fashion you declared, the statement would fail. The streaming assertion check is cheaper than a redundant sort operation, and it is good logic to have in place in case you got the ordering wrong, or the source system one day starts outputting data in a different order than you expected. So after all that, why is this useful to SSIS? Here are a few examples: ➤➤
You may intend to load a text file in SSIS and then immediately join it to a relational table. Now you could do all that within one SELECT statement, using a single OLE DB or ADO .NET Source Component.
➤➤
Some of the SSIS components expect sorted inputs (the Merge Join Component, for example). Assuming the source is a text file, rather than sort the data in SSIS you can sort it in SQL Server. If the text file happens to be pre-sorted, you can declare it as such and save even more time and expense.
➤➤
The Lookup Transformation can populate data from almost anywhere (see Chapter 7). This may still prove a useful technique in some scenarios. Warning╇ Using OPENROWSET to select from a text file should be used only
as a temporary solution, as it has many downfalls. First, there are no indexes on the file, so performance is going to be degraded severely. Second, if the sourcetype=“warning” data file changes in structure (for instance, a column is dropped), and you don’t keep the format file in sync, then the query will fail. Third, if the format file is deleted or corrupted, the query will also fail. This technique can be used when SSIS is not available or does not meet your needs. In most cases, loading the file into a table with SSIS and then querying that table will be your best option.
Using Set-Based Logic Cursors are infamous for being very slow. They usually perform row-by-row operations that are time-consuming. The SQL Server relational database engine, along with SSIS, performs much faster in set-based logic. The premise here is simple: avoid any use of cursors like the plague. Cursors are nearly always avoidable, and they should be used only as a final resort. Try out the following features and see if they can help you build efficient T-SQL operations: ➤➤
Common table expressions (CTEs) enable you to modularize subsections of your queries, and they also support recursive constructs, so you can, for instance, retrieve a self-linked (parent-child) organizational hierarchy using a single SQL statement.
➤➤
Table-valued parameters enable you to pass arrays into stored procedures as variables. This means that you can program your stored procedure logic using the equivalent of dynamic arrays.
➤➤
UNION is now joined by its close cousins, INTERSECT and EXCEPT, which completes the primitive set of operations you need to perform set arithmetic. UNION joins two rowsets together, INTERSECT finds their common members, and EXCEPT finds the members that are present in
one rowset but not the other.
www.it-ebooks.info
c13.indd 444
3/22/2014 9:31:21 AM
Data Extraction╇
❘╇ 445
The following example brings all these ideas together. In this example scenario, suppose you have two tables of data, both representing customers. The challenge is to group the data into three subsets: one set containing the customers who exist in the first table only, the second set containing customers who exist in the second table only, and the third set containing the customers who exist in both tables. The specific example illustrates the power and elegance of common table expressions (CTEs) and set-arithmetic statements. If you remember Venn diagrams from school, what you are trying to achieve is the relational equivalent of the diagram shown in Figure 13-6.
Customers in First Table
Customers in Both Tables
Customers in Second Table
Figure 13-6
Following is a single statement that will partition the data as required. This statement is not meant to convey good programming practice, because it is not the most optimal or concise query you could write to derive these results. It is simply meant to demonstrate the manner in which these constructs can be used. By studying the verbose form, you can appreciate the elegance, composability, and selfdocumenting nature of the syntax. For convenience you will use related tables from AdventureWorks and AdventureWorksDW. Note the use of multiple CTE structures to generate intermediate results (though the query optimizer is smart enough to not execute the statements separately). Also notice the use of UNION, EXCEPT, and INTERSECT to derive specific results: WITH SourceRows AS ( --CTE containing all source rows SELECT TOP 1000 AccountNumber FROM AdventureWorks.Sales.Customer ORDER BY AccountNumber ), DestinationRows(AccountNumber) AS ( --CTE containing all destination rows SELECT CustomerAlternateKey FROM AdventureWorksDW.dbo.DimCustomer ), RowsInSourceOnly AS ( --CTE: rows where AccountNumber is in source only SELECT AccountNumber FROM SourceRows --select from previous CTE EXCEPT --EXCEPT means 'subtract' SELECT AccountNumber FROM DestinationRows --select from previous CTE ), RowsInSourceAndDestination AS( --CTE: AccountNumber in both source & destination SELECT AccountNumber FROM SourceRows INTERSECT --INTERSECT means 'find the overlap' SELECT AccountNumber FROM DestinationRows ), RowsInDestinationOnly AS ( --CTE: AccountNumber in destination only SELECT AccountNumber FROM DestinationRows EXCEPT --Simply doing the EXCEPT the other way around SELECT AccountNumber FROM SourceRows ),
www.it-ebooks.info
c13.indd 445
3/22/2014 9:31:22 AM
446╇
❘╇ CHAPTER 13╇ Using the Relational Engine
RowLocation(AccountNumber, Location) AS ( --Final CTE SELECT AccountNumber, 'Source Only' FROM RowsInSourceOnly UNION ALL --UNION means 'add' SELECT AccountNumber, 'Both' FROM RowsInSourceAndDestination UNION ALL SELECT AccountNumber, 'Destination Only' FROM RowsInDestinationOnly ) SELECT * FROM RowLocation --Generate final result ORDER BY AccountNumber;
Here is a sample of the results: AccountNumber ----------AW00000700 AW00000701 AW00011000 . . . AW00011298 AW00011299 AW00011300
Location ----------Source Only Source Only Both Both Destination Only Destination Only
SQL Server provides many powerful tools for use in your data extraction arsenal. Learn about them and then start using the SQL Server relational database engine and SSIS in concert to deliver optimal extraction routines. The list presented previously is not exhaustive; you can use many other similar techniques to improve the value of the solutions you deliver.
Data Loading This section focuses on data loading. Many of the same techniques presented in the data extraction section apply here too, so the focus is on areas that have not been covered before.
Database Snapshots Database snapshots were introduced as a way to persist the state of a database at a specific point in time. The underlying technology is referred to as copy-on-first-write, which is a fancy way of saying that once you create the database snapshot, it is relatively cheap to maintain because it only tracks things that have changed since the database snapshot was created. Once you have created a database snapshot, you can change the primary database in any wayâ•–—╖╉for instance, changing rows, creating indexes, and dropping tables. If at any stage you want to revert all your changes back to when you created the database snapshot, you can do that very easily by doing a database restore using the database snapshot as the media source. In concept, the technology sounds very similar to backup and restore, the key difference being that this is a completely online operation, and depending on your data loads, the operations can be near instantaneous. This is because when you create the snapshot, it is a metadata operation onlyâ•–—╖╉you do not physically “back up” any data. When you “restore” the database from the snapshot, you do not restore all the data; rather, you restore only what has changed in the interim period. This technique proves very useful in ETL when you want to prototype any data changes. You can create a package that makes any data changes you like, confident in the knowledge that you can
www.it-ebooks.info
c13.indd 446
3/22/2014 9:31:22 AM
Data Loading╇
❘╇ 447
easily roll back the database to a clean state in a short amount of time. Of course, you could achieve the same goals using backup and restore (or transactional semantics), but those methods typically have more overhead and/or take more time. Snapshots may also be a useful tool in operational ETL; you can imagine a scenario whereby a snapshot is taken before an ETL load and then if there are any problems, the data changes can be easily rolled back. There is a performance overhead to using snapshots, because you can think of them as a “live” backup. Any activity on the source database incurs activity on the snapshot database, because the first change to any database page causes that page to be copied to the database snapshot. Any subsequent changes to the same page do not cause further copy operations but still have some overhead due to the writing to both source and snapshot. You need to test the performance overhead in the solutions you create, though you should expect to see an overhead of anywhere from 5 percent to 20 percent. Because you are writing data to the destination database in this section, it is useful to create a database snapshot so you can roll back your changes very easily. Run this complete script: --Use a snapshot to make it simple to rollback the DML USE master; GO --To create a snapshot you need to close all other connections on the DB ALTER DATABASE [AdventureWorksDW] SET SINGLE_USER WITH ROLLBACK IMMEDIATE; ALTER DATABASE [AdventureWorksDW] SET MULTI_USER; --Check if there is already a snapshot on this DB IF EXISTS (SELECT [name] FROM sys.databases WHERE [name] = N'AdventureWorksDW_Snapshot') BEGIN --If so RESTORE the database from the snapshot RESTORE DATABASE AdventureWorksDW FROM DATABASE_SNAPSHOT = N'AdventureWorksDW_Snapshot'; --If there were no errors, drop the snapshot IF @@error = 0 DROP DATABASE [AdventureWorksDW_Snapshot]; END; --if --OK, let's create a new snapshot on the DB CREATE DATABASE [AdventureWorksDW_Snapshot] ON ( NAME = N'AdventureWorksDW_Data', --Make sure you specify a valid location for the snapshot file here FILENAME = N'c:\ProSSIS\Data\Ch13\AdventureWorksDW_Data.ss') AS SNAPSHOT OF [AdventureWorksDW]; GO
The script should take only a couple of seconds to run. It creates a database file in the specified folder that it tagged as being a snapshot of the AdventureWorksDW database. You can run the following command to list all the database snapshots on the server: --List database snapshots SELECT d.[name] AS DatabaseName, s.[name] AS SnapshotName FROM sys.databases AS s INNER JOIN sys.databases AS d ON (s.source_database_id = d.database_id);
www.it-ebooks.info
c13.indd 447
3/22/2014 9:31:22 AM
448╇
❘╇ CHAPTER 13╇ Using the Relational Engine
You should now have a snapshot called “AdventureWorksDW_Snapshot.” This snapshot is your “live backup” of AdventureWorksDW. Once you have ensured that the database snapshot is in place, test its functionality by changing some data or metadata in AdventureWorksDW. For instance, you can create a new table in the database and insert a few rows: --Create a new table and add some rows USE AdventureWorksDW; GO CREATE TABLE dbo.TableToTestSnapshot(ID INT); GO INSERT INTO dbo.TableToTestSnapshot(ID) SELECT 1 UNION SELECT 2 UNION SELECT 3;
You can confirm the table is present in the database by running this statement. You should get back three rows: --Confirm the table exists and has rows SELECT * FROM dbo.TableToTestSnapshot;
Now you can test the snapshot rollback functionality. Imagine that the change you made to the database had much more impact than just creating a new table (perhaps you dropped the complete sales transaction table, for instance) and you now want to roll the changes back. Execute the same script that you used to originally create the snapshot; you will notice that the script includes a check to ensure that the snapshot exists; then, if so, it issues a RESTORE ... FROM DATABASE_SNAPSHOT command. After running the script, try running the SELECT command again that returned the three rows. You should get an error saying the table “TableToTestSnapshot” does not exist. This is good news; the database has been restored to its previous state! Of course, this same logic applies whether you created a table or dropped one, added or deleted rows, or performed just about any other operation. The really cool benefit is that it should have taken only a couple of seconds to run this “live restore.” As part of the original snapshot script, the database was rolled back, but the script should also have created a new snapshot in the old one’s place. Make sure the snapshot is present before continuing with the next sections, because you want to make it simple to roll back any changes you make. Not only can you use database snapshots for prototyping, you can add tasks to your regularly occurring ETL jobs to create the snapshotsâ•–—╖╉and even restore them if needed! A solution that uses this methodology is robust enough to correct its own mistakes and can be part of an enterprise data warehouse solution.
The MERGE Operator If your source data table is conveniently partitioned into data you want to insert, data you want to delete, and data you want to update, then it is simple to use the INSERT, UPDATE, and DELETE statements to perform the respective operations. However, it is often the case that the data is not presented to you in this format. More often than not you have a source system with a range of data that needs to be loaded, but you have no way of distinguishing which rows should be applied in which way. The source contains a mix of new, updated, and unchanged rows.
www.it-ebooks.info
c13.indd 448
3/22/2014 9:31:22 AM
Data Loading╇
❘╇ 449
One way you can solve this problem is to build logic that compares each incoming row with the destination table, using a Lookup Transformation (see Chapter 7 for more information). Another way to do this would be to use Change Data Capture (see Chapter 11 for more information) to tell you explicitly which rows and columns were changed, and in what way. There are many other ways of doing this too, but if none of these methods are suitable, you have an alternative, which comes in the form of the T-SQL operator called MERGE (also known in some circles as “upsert” because of its mixed Update/Insert behavior). The MERGE statement is similar in usage to the INSERT, UPDATE, and DELETE statements; however, it is more useful in that it can perform all three of their duties within the same operation. Here is pseudocode to represent how it works; after this you will delve into the real syntax and try some examples: MERGE INTO Destination Using these semantics: { If a row in the Destination matches a row in the Source then: UPDATE If a row exists in the Source but not in the Destination then: INSERT If a row exists in the Destination but not in the Source then: DELETE } FROM Source;
As you can see, you can issue a single statement to SQL Server, and it is able to figure out on a rowby-row basis which rows should be INSERT-ed, UPDATE-ed, and DELETE-ed in the destination. This can provide a huge time savings compared to doing it the old way: issuing two or three separate statements to achieve the same goal. Note that SQL Server is not just cleverly rewriting the MERGE query back into INSERT and UPDATE statements behind the scenes; this functionality is a DML primitive deep within the SQL core engine, and as such it is highly efficient. Now you are going to apply this knowledge to a real set of tables. In the extraction �section of this chapter you used customer data from AdventureWorks and compared it to data in AdventureWorksDW. There were some rows that occurred in both tables, some rows that were only in the source, and some rows that were only in the destination. You will now use MERGE to synchronize the rows from AdventureWorks to AdventureWorksDW so that both tables contain the same data. This is not a real-world scenario because you would not typically write rows directly from the source to the destination without cleaning and shaping the data in an ETL tool like SSIS, but for the sake of convenience the example demonstrates the concepts. First, you need to add a new column to the destination table so you can see what happens after you run the statement. This is not something you would need to do in the real solution. USE AdventureWorksDW; GO --Add a column to the destination table to help us track what happened --You would not do this in a real solution, this just helps the example ALTER TABLE dbo.DimCustomer ADD Operation NVARCHAR(10); GO
Now you can run the MERGE statement. The code is commented to explain what it does. The destination data is updated from the source in the manner specified by the various options. There are blank
www.it-ebooks.info
c13.indd 449
3/22/2014 9:31:22 AM
450╇
❘╇ CHAPTER 13╇ Using the Relational Engine
lines between each main section of the command to improve readability, but keep in mind that this is a single statement: USE AdventureWorksDW; GO --Merge rows from source into the destination MERGE --Define the destination table INTO AdventureWorksDW.dbo.DimCustomer AS [Dest] --Friendly alias --Define the source query USING ( SELECT AccountNumber AS CustomerAlternateKey, --Keep example simple by using just a few data columns p.FirstName, p.LastName FROM AdventureWorks.Sales.Customer c INNER JOIN AdventureWorks.Person.Person p on c.PersonID=p.BusinessEntityID ) AS [Source] --Friendly alias --Define the join criteria (how SQL matches source/destination rows) ON [Dest].CustomerAlternateKey = [Source].CustomerAlternateKey --If the same key is found in both the source & destination WHEN MATCHED --For *illustration* purposes, only update every second row --AND CustomerAlternateKey % 2 = 0 --Then update data values in the destination THEN UPDATE SET [Dest].FirstName = [Source].FirstName , [Dest].LastName = [Source].LastName, [Dest].Operation = N'Updated' --Note: clause is implicit --If a key is in the source but not in the destination WHEN NOT MATCHED BY TARGET --Then insert row into the destination THEN INSERT ( GeographyKey, CustomerAlternateKey, FirstName, LastName, DateFirstPurchase, Operation ) VALUES ( 1, [Source].CustomerAlternateKey, [Source].FirstName, [Source].LastName, GETDATE(), N'Inserted' ) --If a key is in the destination but not in the source… WHEN NOT MATCHED BY SOURCE --Then do something relevant, say, flagging a status field THEN UPDATE SET [Dest].Operation = N'Deleted'; --Note: clause is implicit --Alternatively you could have deleted the destination row --but in AdventureWorksDW that would fail due to FK constraints --WHEN NOT MATCHED BY SOURCE THEN DELETE; GO
www.it-ebooks.info
c13.indd 450
3/22/2014 9:31:22 AM
Data Loading╇
❘╇ 451
After running the statement, you should get a message in the query output pane telling you how many rows were affected: (19119 row(s) affected)
You can now check the results of the operation by looking at the data in the destination table. If you scroll through the results you should see each row’s Operation column populated with the operation that was applied to it: --Have a look at the results SELECT CustomerAlternateKey, DateFirstPurchase, Operation FROM AdventureWorksDW.dbo.DimCustomer;
Here is a subset of the results. For clarity, the different groups of rows have been separated in this book by blank lines: CustomerAlternateKey -------------------AW00019975 AW00019976 AW00019977
DateFirstPurchase ----------------------2002-04-11 00:00:00.000 2003-11-27 00:00:00.000 2002-04-26 00:00:00.000
Operation ----------NULL Updated NULL
AW00019978 AW00019979
2002-04-20 00:00:00.000 Deleted 2002-04-22 00:00:00.000 Deleted
AW00008000 AW00005229 AW00001809
2008-02-24 20:48:12.010 Inserted 2008-02-24 20:48:12.010 Inserted 2008-02-24 20:48:12.010 Inserted
As you can see, a single MERGE statement has inserted, updated, and deleted rows in the destination in the context of just one operation. The reason why some of the updates could show a NULL operation is if a predicate was used in the WHEN MATCHED section to only UPDATE every second row. Note that the source query can retrieve data from a different database (as per the example); furthermore, it can even retrieve data using the OPENROWSET() function you read about earlier. However, MERGE requires sorting the source data stream on the join key; SQL Server will automatically sort the source data for you if required, so ensure that the appropriate indexes are in place for a more optimal experience. These indexes should be on the join key columns. Do not confuse this operator with the Merge Join Transformation in SSIS. If the source query happens to be of the form OPENROWSET(BULK...)â•–—╖╉in other words, you are reading from a text fileâ•–—╖╉then make sure you have specified any intrinsic sort order that the text file may already have. If the text file is already sorted in the same manner as the order required for MERGE (or you can ask the source extract system to do so), then SQL is smart enough to not incur a redundant sort operation. The MERGE operator is a very powerful technique for improving mixed-operation data loads, but how do you use it in the context of SSIS? If you do not have the benefit of Change Data Capture (discussed in Chapter 11) and the data sizes are too large to use the Lookup Transformation in an efficient manner (see Chapter 7), then you may have to extract your data from the source, clean and shape it in SSIS, and then dump the results to a staging table in SQL Server. From the staging table, you now need to apply the rows against
www.it-ebooks.info
c13.indd 451
3/22/2014 9:31:22 AM
452╇
❘╇ CHAPTER 13╇ Using the Relational Engine
the true destination table. You could certainly do this using two or three separate INSERT, UPDATE, and DELETE statementsâ•–—╖╉with each statement JOIN-ing the staging table and the destination table together in order to compare the respective row and column values. However, you can now use a MERGE statement instead. The MERGE operation is more efficient than running the separate statements, and it is more intentional and elegant to develop and maintain. This is also more efficient with larger data sets than the SCD wizard and its OLE DB Command Transformation approach. Make sure you execute the original snapshot script again in order to undo the changes you made in the destination database.
Summary As this chapter has shown, you can take advantage of many opportunities to use SQL in concert with SSIS in your ETL solution. The ideas presented in this chapter are not exhaustive; many other ways exist to increase your return on investment using the Microsoft data platform. Every time you find a way to use a tool optimized for the task at hand, you can lower your costs and improve your efficiencies. There are tasks that SSIS is much better at doing than the SQL Server relational database engine, but the opposite statement applies too. Make sure you think about which tool will provide the best solution when building ETL solutions; the best solutions often utilize a combination of the complete SQL Server business intelligence stack.
www.it-ebooks.info
c13.indd 452
3/22/2014 9:31:22 AM
14
Accessing Heterogeneous Data What’s in This Chapter? ➤➤
Dealing with Excel and Access data
➤➤
Integrating with Oracle
➤➤
Working with XML files and web services
➤➤
Extracting from flat files
➤➤
Integrating with ODBC
Wrox.com Code Downloads for this Chapter
You can find the wrox.com code downloads for this chapter at http://www.wrox.com/go/ prossis2014 on the Download Code tab.
In this chapter, you will learn about importing and working with data from heterogeneous, or various non–SQL Server, sources. In today’s enterprise environments, data may exist in many diverse systems, such as Oracle, DB2, Teradata, SQL Azure, SQL Parallel Data Warehouse (PDW), Office documents, XML, or flat files, to name just a few. The data may be generated within the company, or it may be delivered through the Internet from a trading partner. Whether you need to import data from a spreadsheet to initially populate a table in a new database application or pull data from other sources for your data warehouse, accessing heterogeneous data is probably a big part of your job. You can load data into SQL Server using SSIS through any ODBC-compliant, OLE DB– compliant, or ADO.NET managed source. Many ODBC, OLE DB, and .NET providers are supplied by Microsoft for sources like Excel, Access, DB2, FoxPro, Sybase, Oracle, Teradata, and dBase. Others are available from database vendors. A variety of Data Source Components
www.it-ebooks.info
c14.indd 453
3/22/2014 9:33:07 AM
454╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
are found in SSIS. These include Excel, Flat File, XML, ADO.NET (which is used to connect to .NET Sources), OLE DB (which allows connections to many different types of data), and Raw File (a special source used to read data that has been previously exported to a Raw File Destination). If the supplied Data Sources do not meet your needs, you can also create custom Data Sources. SSIS can consume many of these sources from out-of-the-box features. In addition, Microsoft has also provided a set of free downloads in the SQL Server feature pack for advanced data source extraction. They include a set of source components from Attunity, third-party components that Microsoft has licensed for use with SSIS. The Attunity connectors allow advanced sourcing from Oracle (with bulk load capabilities), Teradata, and ODBC sources. Figure 14-1 highlights the Source Assistant within the Data Flow Toolbox. It shows the various source options within SSIS. Many of them require the installation of a client tool; the gray information window at the bottom of the figure describes where to find the additional application if required.
Figure 14-1
This chapter begins with the built-in features and walks you through accessing data from several of the most common sources. In addition to working in SSIS, you will become familiar with the differences between 32-bit and 64-bit drivers, as well as the client tools you need to install for the provider for DB2, Oracle, SAP BI, and Teradata, as available from those websites. This chapter targets the following sources: ➤➤
Excel and MS Access (versions 2013 and earlier): Excel is often used as a quick way to store data because spreadsheets are easy to set up and use. Access applications are frequently upsized to SQL Server as the size of the database and number of users increase.
➤➤
Oracle: Even companies running their business on Oracle or another of SQL Server’s competitors sometimes make use of SQL Server because of its cost-effective reporting and business intelligence solutions.
www.it-ebooks.info
c14.indd 454
3/22/2014 9:33:08 AM
Excel and Access╇
❘╇ 455
➤➤
XML and Web Services: XML and web services (which is XML delivered through HTTP) are standards that enable very diverse systems to share data. The XML Data Source enables you to work with XML as you would with almost any other source of data.
➤➤
Flat Files: Beyond just standard delimited files, SSIS can parse flat files of various types and code page encoding, which allows files to be received from and exported to different operating systems and non-Windows-based systems. This reduces the need to convert flat files before or after working with them in SSIS.
➤➤
ODBC: Many organizations maintain older systems that use legacy ODBC providers for data access. Because of the complexities and cost of migrating systems to newer versions, ODBC is still a common source.
➤➤
Teradata: Teradata is a data warehouse database engine that scales out on multiple nodes. Large organizations that can afford Teradata’s licensing and ongoing support fees often use it for centralized warehouse solutions.
➤➤
Other Heterogeneous Sources: The sources listed previously are the most common; however, this only touches on the extent of Data Sources that SSIS can access. The last section of this chapter provides third-party resources for when you are trying to access other sources such as SAP or Sybase.
Excel and Access SSIS deals with Excel and Access data in a similar fashion because they use the same underlying provider technology for data access. For Microsoft Office 2003 and earlier, the data storage technology is called the JET Engine, which stands for Join Engine Technology; therefore, when you access these legacy releases of Excel or Access, you will be using the JET OLE DB Provider (32-bit only). Office 2007 introduced a new engine called ACE that is essentially a newer version of the JET but supports the new file formats of Excel and Access. ACE stands for Access Engine and is used for Office 2007 and later. In addition, with the release of Office 2010, Microsoft provided a 64-bit version of the ACE provider. You will find both the 32-bit and 64-bit drivers under the name “Microsoft Office 12.0 Access Database Engine OLE DB Provider” in the OLE DB provider list. Therefore, when connecting to Access or Excel in these versions, you will use the ACE OLE DB Provider. If you have the 64-bit version of Office 2010 or 2013 installed, the next section will also review working with the 32-bit provider, because it can be confusing. Later in this section you will learn how to connect to both Access and Excel for both the JET and ACE engines.
64-Bit Support In older versions of Office (Office 2007 and earlier), only a 32-bit driver was available. That meant if you wanted to extract data from Excel or Access, you had to run SSIS in 32-bit mode. Beginning with Office 2010, however, a 64-bit version of the Office documents became available that enables you to extract data from Excel and Access using SSIS on a 64-bit server in native mode. In order
www.it-ebooks.info
c14.indd 455
3/22/2014 9:33:08 AM
456╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
to use the 64-bit versions of the ACE engine, you need to install the 64-bit access provider either by installing the 64-bit version of Microsoft Office 2010 or later or by installing the 64-bit driver from Microsoft’s download site: http://www.microsoft.com/en-us/ download/details.aspx?id=39358. However, even though a 64-bit version of the ACE provider is available, you cannot Figure 14-2 develop packages with the 64-bit driver. This is because Visual Studio is a 32-bit application and is unable to see a 64-bit driver. With ACE, if you try to install both 32-bit and 64-bit, you will receive the error shown in Figure 14-2. Therefore, the 64-bit driver can be used for a test or production server where packages are not executed through the design-time environment. The approach to using the 64-bit driver is to design your package with the 32-bit driver and then deploy your package to a server that has the 64-bit ACE driver installed. To be sure, SSIS can run natively on a 64-bit machine (just like it can on a 32-bit machine). This means that when the operating system is running the X64 version of Windows Server 2003, Windows 7, Windows 8, Windows Server 2008, or a future version of Windows Server, you can natively install and run SQL Server in the X64 architecture (an IA64 Itanium build is also available from Microsoft support). When you execute a package in either 64-bit or 32-bit mode, the driver needs to either work in both execution environments or, like the ACE provider, have the right version for either the 32-bit or 64-bit execution mode. When you install SSIS with the native X64 installation bits, you also get the 32-bit runtime executables that you can use to run packages that need access to 32-bit drivers not supported in the 64-bit environment. When working on a 64-bit machine, you can run packages in 32-bit emulation mode through the SSDT design environment and through the 32-bit version of DTExec. In addition, when using the SSIS Server Catalog in SQL 2014, you are also able to run packages in 32-bit or 64-bit mode. Here are the details: ➤➤
Visual Studio 2012: By default, when you are in a native 64-bit environment and you run a package, you are running the package in 64-bit mode. However, you can change this behavior by modifying the properties of your SSIS project. Figure 14-3 shows the Run64bitRuntime property on the Debugging property page. When you set this to False, the package runs in 32-bit emulation mode even though the machine is 64-bit.
➤➤
32-bit version of DTExec: By default, a 64-bit installation of SSIS references the 64-bit version of DTExec, usually found in the C:\Program Files\Microsoft SQL Server\120\ DTS\Binn folder. However, a 32-bit version is also included in C:\Program Files (X86)\ Microsoft SQL Server\120\DTS\Binn, and you can reference that directly if you want a package to run in 32-bit emulation mode in order to access the ACE and JET providers.
www.it-ebooks.info
c14.indd 456
3/22/2014 9:33:08 AM
Excel and Access╇
❘╇ 457
Figure 14-3 ➤➤
32-bit version for packages deployed to SSIS catalog: When running a package that has been deployed to the SSIS 2014 catalog, an advanced configuration option, “32-bit runtime,” will allow your package to be executed in legacy 32-bit execution mode. This option is available both in SQL Agent and in the package execution UI in the SSIS 2014 catalog. The default is to have this option unchecked so that packages run in 64-bit mode.
Be careful not to run all your packages in 32-bit emulation mode when running on a 64-bit machine, just the ones that need 32-bit support. The 32-bit emulation mode limits the memory accessibility and the performance. The best approach is to modularize your packages by developing more packages with less logic in them. One benefit to this is the packages that need 32-bit execution can be separated and run separately.
Working with Excel Files Excel is a common source and destination because it is often the favorite “database” software of many people without database expertise (especially in your accounting department!). SSIS has Data Flow Source and Destination Components made just for Excel that ease the connection setup, whether connecting to Excel 2003 or earlier or to Excel 2007 or later (the JET and ACE providers). You can be sure that these components will be used in many SSIS packages, because data is often imported from Excel files into a SQL Server database or exported into Excel for many highlevel tasks such as sales forecasting. Because Excel is so easy to work with, it is common to find inconsistencies in the data. For example, while lookup lists or data type enforcement is possible to implement, it is less likely for an Excel workbook to have it in place. It’s often possible for the person entering data to type a note in a cell where a date should go. Of course, cleansing the data is part of the ETL process, but it may be even more of a challenge when importing from Excel.
Exporting to All Versions of Excel In this section, you will use SSDT to create SSIS packages to export data to Excel files. The first example shows how to create a package that exports a worksheet that the AdventureWorks
www.it-ebooks.info
c14.indd 457
3/22/2014 9:33:08 AM
458╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
inventory staff will use to record the physical inventory counts. Follow these steps to learn how to export data to Excel:
1. 2.
Create a new SSIS package in SSDT and Rename the package Export Excel.dtsx.
Add an OLE DB Source Component.
3. 4. 5.
6.
Choose the AdventureWorks Connection Manager for the OLE DB Connection Manager property. The data access mode should be set to SQL Command. In this case, you will write a query (Excel_Export_SQL.txt) to specify which data to export:
Drag a Data Flow Task from the Toolbox to the Control Flow design area and then switch to the Data Flow tab.
Create a Connection Manager pointing to the AdventureWorks database. Double-click the OLE DB Source Component to bring up the OLE DB Source Editor. Make sure that Connection Manager is selected on the left.
SELECT ProductID, LocationID, Shelf, Bin, Null as PhysicalCount FROM Production.ProductInventory ORDER by LocationID, Shelf, Bin
7.
If you select Columns in the left pane, you have the opportunity to deselect some of the columns or change the name of the output columns. Click OK to accept the configuration.
8.
Drag an Excel Destination Component from the SSIS Toolbox, found under the Other Destinations grouping and drag the Data Flow Path (blue arrow on your screen) from the OLE DB Source to the Excel Destination. Double-click the Excel Destination.
9.
Click the New button for the Connection Manager, and in the Excel Connection Manager window, choose Microsoft Excel 2007 from the Excel Version dropdown, and then enter the path to your destination (C:\ProSSIS\Data\Inventory_Worksheet.xlsx).
10.
Select OK in the Excel Connection Manager window, and then click New on the Name of Excel sheet dropdown to create a new worksheet.
11.
In the Create Table window, you can leave the name of the worksheet or change it and modify the columns as Figure 14-4 shows. Click OK to create the new worksheet.
12.
The data access mode should be set to Table or View (more about this later). Click OK to create a new worksheet with the appropriate column headings in the Excel file. Make sure that Name of the Excel sheet is set to Inventory Worksheet.
13.
You must click Mappings on the left to set the mappings between the source and
Figure 14-4
www.it-ebooks.info
c14.indd 458
3/22/2014 9:33:08 AM
Excel and Access╇
❘╇ 459
the destination. Each one of the Available Input Columns should match up exactly with an Available Output Column. Click OK to accept the Inventory Worksheet settings. Run the package to export the product list. The fields selected in the Production.Inventory table will export to the Excel file, and your inventory crew members can each use a copy of this file to record their counts.
Importing from Excel 2003 and Earlier For this example of importing Excel data, assume that you work for AdventureWorks and the AdventureWorks inventory crew divided up the assignments according to product location. As each assignment is completed, a partially filled-out worksheet file is returned to you. In this example, you create a package to import the data from each worksheet that is received: Open SQL Server Data Tools (SSDT) and create a new SSIS package.
1. 2. 3.
4.
Drag the blue Data Flow Path from the Inventory Worksheet Component to the Inventory Import Component.
Drag a Data Flow Task to the Control Flow design pane. Click the Data Flow and add an Excel Source and an OLE DB Destination Component. Rename the Excel Source to Inventory Worksheet and rename the OLE DB Destination to Inventory Import.
Note ╇ The OLE DB Destination sometimes works better than the SQL Server Destination Component for importing data from non-SQL Server sources! When using the SQL Server Destination Component, you cannot import into integer columns or varchar columns from an Excel spreadsheet, and must import into double precision and nvarchar columns. The SQL Server Destination Component does not support implicit data type conversions and works as expected when moving data from SQL Server as a source to SQL Server as a destination.
5.
Create a Connection Manager for the Excel file you have been working with by following the instructions in the previous section (select Microsoft Excel 97-2003 in the Excel version dropdown).
Rename the Excel Connection Manager in the Properties window to Inventory Source.
6. 7. 8.
9.
For this example the data access mode should be set to SQL Command because you only want to import rows with the physical count filled in. Type the following query (Excel_ Import_SQL.txt) into the SQL command text box (see Figure 14-6):
Create a Connection Manager pointing to the AdventureWorks database. Double-click the Inventory Worksheet Component to bring up the Excel Source Editor (see Figure 14-5).
SELECT ProductID, PhysicalCount, LocationID, Shelf, Bin FROM Inventory_Worksheet WHERE PhysicalCount IS NOT NULL
www.it-ebooks.info
c14.indd 459
3/22/2014 9:33:08 AM
460╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Figure 14-5
Figure 14-6
www.it-ebooks.info
c14.indd 460
3/22/2014 9:33:09 AM
Excel and Access╇
❘╇ 461
10.
Double-click the Inventory Import Component to bring up the OLE DB Destination Editor. Make sure the AdventureWorks connection is chosen. Under Data access mode, choose Table or View.
11.
Click the New button next to Name of the table or the view to open the Create Table dialog.
12.
Change the name of the table to InventoryImport and click OK to create the table. Select Mappings. Each field from the worksheet should match up to a field in the new table.
13.
Click OK to accept the configuration.
While this is a simple example, it illustrates just how easy it is to import from and export to Excel files.
Importing from Excel 2007 and Later Setting up an SSIS package to import from Excel 2007 and later is very similar to setting up the connection when exporting to Excel 2007. When you set up the connection, choose Excel 2007 from the Excel version dropdown (step 5 above). Once you have set up the connection as already shown in Figures 14-5 and 14-6, you need to create an OLE DB Source adapter in the Data Flow. You can either reference the worksheet directly or specify a query that extracts data from specific Excel columns. Figure 14-7 shows a worksheet directly referenced, called “Inventory_Worksheet$.”
Figure 14-7
www.it-ebooks.info
c14.indd 461
3/22/2014 9:33:09 AM
462╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Working with Access MS Access is the departmental database of choice for countless individual users and small workgroups. It has many great features and wizards that enable a small application or prototype to be quickly developed. Often, when an application has outgrown its humble Access origins, discussions about moving the data to SQL Server emerge. Many times, the client will be rewritten as a web or desktop application using VB.NET or another language. Sometimes the plan is to link to the SQL Server tables, utilizing the existing Access front end. Unfortunately, if the original application was poorly designed, moving the data to SQL Server will not improve performance. This section demonstrates how you can use SSIS to integrate with Microsoft Access. Access 2007 and later use the same ACE provider as Excel does, so as you work with Access in a 32-bit or 64-bit mode, please refer to the 64-bit discussion of Excel in the previous section.
Configuring an Access Connection Manager Once the Connection Manager is configured properly, importing from Access is simple. First, look at the steps required to set up the Connection Manager:
1.
Create a new SSIS package and create a new Connection Manager by right-clicking in the Connection Managers section of the design surface.
2.
Select New OLE DB Connection to bring up the Configure OLE DB Connection Manager dialog.
3.
Click New to open the Connection Manager. In the Provider dropdown list, choose one of the following access provider types: ➤➤
Microsoft Jet 4.0 OLE DB Provider (for Access 2003 and earlier)
➤➤
Microsoft Office 12.0 Access Database Engine OLE DB Provider (for Access 2007 and later)
If you do not see the Microsoft Office 12.0 Access Database Engine provider in the list, you need to install the 32-bit ACE driver described earlier. Click OK after making your selection.
4. 5.
6.
By default, the database user name is blank, with a blank password. If security has been enabled for the Access database, a valid user name and password must be entered. Enter the password on the All pane in the Security section. The user Password property is also available in the properties window. Check the Save my password option.
7.
If, conversely, a database password has been set, enter the database password in the Password property on the Connection pane. This also sets the ODBC:Database Password property found on the All tab.
8.
If both user security and a database password have been set up, enter both passwords on the All pane. In the Security section, enter the user password and the database password for the Jet OLEDB:New Database Password property. Check the Save my password option. Be sure to test the connection and click OK to save the properties.
The Connection Manager dialog changes to an Access-specific dialog. In the server or file name box, enter the path to the Northwind database, C:\ProSSIS\Data\Northwind.mdb, as Figure 14-8 shows. You are using the Northwind MS Access sample database for this example.
www.it-ebooks.info
c14.indd 462
3/22/2014 9:33:09 AM
Excel and Access╇
❘╇ 463
Figure 14-8
Importing from Access Once you have the Connection Manager created, follow these steps to import from Access:
1.
Using the project you created in the last section with the Access Connection Manager already configured, add a Data Flow Task to the Control Flow design area.
2.
Click the Data Flow tab to view the Data Flow design area. Add an OLE DB Source Component and name it Customers.
3.
Double-click the Customers icon to open the OLE DB Source Editor. Set the OLE DB Connection Manager property to the Connection Manager that you created in the last section.
4.
Select Table or View from the Data access mode dropdown list. Choose the Customers table from the list under Name of the table or the view (see Figure 14-9).
5.
Click Columns on the left of the Source Editor to choose which columns to import and change the output names if needed.
6. 7. 8.
Click OK to accept the configuration.
Create a Connection Manager pointing to AdventureWorks. Create an OLE DB Destination Component and name it NW_Customers. Drag the connection (blue arrow on your screen) from the Customers Source Component to the NW_Customers Destination Component.
www.it-ebooks.info
c14.indd 463
3/22/2014 9:33:10 AM
464╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Figure 14-9
9.
Double-click the Destination Component to bring up the OLE DB Destination Editor and configure it to use the AdventureWorks Connection Manager.
10.
You can choose an existing table or you can click New to create a new table as the Data Destination. If you click New, notice that the Create Table designer does not script any keys, constraints, defaults, or indexes from Access. It makes its best guess as to the data types, which may not be the right ones for your solution. When building a package for use in a production system, you will probably want to design and create the SQL Server tables in advance.
11.
For now, click New to bring up the table definition (see Figure 14-10). Notice that the table name is the same as the Destination Component, so change the name to NW_Customers if you did not name the OLE DB Destination as instructed previously.
12.
Click OK to create the new table.
13.
Click Mappings on the left to map the source and destination columns.
14.
Click OK to accept the configuration.
15.
Run the package. All the Northwind customers should now be listed in the SQL Server table. Check this by clicking New Query in Microsoft SQL Server Management Studio. Run the following query (Access_Import.txt) to see the results:
www.it-ebooks.info
c14.indd 464
3/22/2014 9:33:10 AM
Excel and Access╇
❘╇ 465
USE AdventureWorks GO SELECT * FROM NW_Customers
Figure 14-10
16.
Empty the table to prepare for the next example by running this query: TRUNCATE TABLE NW_CUSTOMERS
Using a Parameter Another interesting feature is the capability to pass a parameter from a package variable to a SQL command. The following steps demonstrate how:
Note ╇ In Access, you can create a query that prompts the user for parameters at runtime. You can import most Access select queries as tables, but data from an Access parameter query cannot be imported using SSIS.
1. 2. 3.
Select the package you started in the last section. Navigate back to the Control Flow tab and right-click the design area. Choose Variables and add a variable by clicking the Add Variable icon. Name it CustomerID. Change the Data Type to String, and give it a value of ANTON (see Figure 14-11). Close the Variables window and navigate back to the Data Flow tab.
Figure 14-11
www.it-ebooks.info
c14.indd 465
3/22/2014 9:33:10 AM
466╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Note ╇ The design area or component that is selected determines the scope of the variable when it is created. The scope can be set to the package if it is created right after clicking the Control Flow design area. You can also set the scope to a Control Flow Task, Data Flow Component, or Event Handler Task.
4.
Double-click the Customers Component to bring up the OLE DB Source Editor and change the data access mode to SQL Command. A SQL Command text box and some buttons appear. You can click the Build Query button to bring up a designer to help build the command or click Browse to open a file with the command you want to use. For this example, type in the following SQL statement (Access_Import_Parameter.txt) (see Figure 14-12): SELECT CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax FROM Customers WHERE (CustomerID = ?)
Figure 14-12
www.it-ebooks.info
c14.indd 466
3/22/2014 9:33:11 AM
Excel and Access╇
❘╇ 467
5. The ? symbol is used as the placeholder for the parameter in the query. Map the parameters to variables in the package by clicking the Parameters button. Choose User::CustomerID from the Variables list and click OK (see Figure 14-13). Note that you cannot preview the data after setting up the parameter because the package must be running to load the value into the parameter.
Figure 14-13
Note ╇ Variables in SSIS belong to namespaces. By default, there are two namespaces, User and System. Variables that you create belong to the User namespace. You can also create additional namespaces.
6.
Click OK to accept the new configuration and run the package. This time, only one record is imported (see Figure 14-14). You can also return to SQL Server Management Studio to view the results: USE AdventureWorks GO SELECT * FROM NW_Customers
www.it-ebooks.info
c14.indd 467
3/22/2014 9:33:11 AM
468╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Figure 14-14
If you wish to use multiple parameters in your SQL command, use multiple question marks (?) in the query and map them in order to the parameters in the parameter mapping. To do this, you set up a second package-level variable for CompanyName and set the value to Island Trading. Change the query in the Customers Component to the following (Access_Import_Parameter2.txt): SELECT CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax FROM Customers WHERE (CustomerID = ?) OR (CompanyName = ?)
Now the Parameters dialog will show the two parameters. Associate each parameter with the appropriate variable (see Figure 14-15). Importing data from Access is a simple process as long as Access security has not been enabled. Often, porting an Access application to SQL Server is the desired result. Make sure you have a good book or other resource to help ensure success. Figure 14-15
www.it-ebooks.info
c14.indd 468
3/22/2014 9:33:11 AM
Importing from Oracle╇
❘╇ 469
Importing from Oracle Because of SQL Server’s world-class reporting and business intelligence tools, more and more shops running Oracle rely on SQL Server for their reporting needs. Luckily, importing data from Oracle is much like importing from other sources, such as a text file or another SQL Server instance. In this section, you learn how to access data from an Oracle database with the built-in OLE DB provider and the Oracle client.
Oracle Client Setup Connecting to Oracle in SSIS is a two-step process. First you install the Oracle client software, and then you use the OLE DB provider in SSIS to connect to Oracle. To be sure, the Microsoft Data Access Components (MDAC) that comes with the operating system include an OLE DB provider for Oracle. This is the 32-bit Microsoft-written OLE DB provider to access an Oracle source system. However, even though the OLE DB provider is installed, you cannot use it until you install a second component, the Oracle client software. In fact, when you install the Oracle client software, Oracle includes an OLE DB provider that can be used to access an Oracle source. The OLE DB providers have subtle differences, which are referenced later in this section.
Installing the Oracle Client Software To install the Oracle client software, you first need to locate the right download from the Oracle website at www.oracle.com. Click Downloads and then click the button to download 12c. Accept the licensing agreement and select your operating system. As you are well aware, there are several versions of Oracle (currently Oracle 11g, 11g Release 2, 12c), and each has a different version of the Oracle client. Some of them are backward compatible, but it is always best to go with the version that you are connecting to. It is best to install the full client software in order to ensure that you have the right components needed for the OLE DB providers.
Configuring the Oracle Client Software Once you download and install the right client for the version of Oracle you will be connecting to and the right platform of Windows you are running, the final step is configuring it to reference the Oracle servers. You will probably need help from your Oracle DBA or the support team of the Oracle application to configure this. There are two options: an Oracle name server or manually configuring a TNS file. The TNS file is more common and is found in the Oracle install directory under the network\ADMIN folder. This is called the Oracle Home directory. The Oracle client uses the Windows environment variables %Path% and %ORACLE_HOME% to find the location to the client files. Either replace the default TNS file with one provided by an Oracle admin or create a new entry in it to connect to the Oracle server.
www.it-ebooks.info
c14.indd 469
3/22/2014 9:33:11 AM
470╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
A typical TNS entry looks like this: [Reference name] = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP) (HOST = [Server])(PORT = [Port Number])) ) (CONNECT_DATA = (SID = [Oracle SID Name]) ) )
Replace the brackets with valid entries. The [Reference Name] will be used in SSIS to connect to the Oracle server through the provider.
64-Bit Considerations As mentioned earlier, after you install the Oracle client software, you can then use the OLE DB provider for Oracle in SSIS to extract data from an Oracle source or to send data to an Oracle Destination. These procedures are described next. However, if you are working on a 64-bit server, you may need to make some additional configurations. First, if you want to connect to Oracle with a native 64-bit connection, you have to use the Oraclewritten OLE DB provider for Oracle because the Microsoft-written OLE DB driver for Oracle is available only in a 32-bit mode. Be sure you also install the right 64-bit Oracle client (Itanium IA64 or X64) if you want to connect to Oracle in native 64-bit mode. Although it is probably obvious to you, it bears mentioning that even though you may have X64 hardware, in order to leverage it in 64-bit mode, the operating system must be installed with the X64 version. Furthermore, even though you may be working on a 64-bit server, you can still use the 32-bit provider through the 32-bit Windows emulation mode. Review the 64-bit details in the “Excel and Access” section earlier in this chapter for details about how to work with packages in 32-bit mode when you are on a 64-bit machine. You need to use the 32-bit version of DTExec for package execution, and when working in SSDT, you need to change the Run64bitRuntime property of the project to False.
Importing Oracle Data In this example, the alias ORCL is used to connect to an Oracle database named orcl. Your Oracle administrator can provide more information about how to set up your tnsnames.ora file to point to a test or production database in your environment. The following tnsnames file entry is being used for the subsequent steps: ORCL = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = VPC-XP)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = orcl) ) )
To extract data from an Oracle server, perform the following steps. These assume that you have installed the Oracle client and configured a tnsnames file or an Oracle names server.
www.it-ebooks.info
c14.indd 470
3/22/2014 9:33:11 AM
Importing from Oracle╇
❘╇ 471
Create a new Integration Services project using SSDT.
1. 2.
3.
In the Connection Managers area, right-click and choose New OLE DB Connection to open the Configure OLE DB Connection Manager dialog.
4.
Click New to open the Connection Manager dialog. Select Microsoft OLE DB Provider for Oracle from the list of providers and click OK.
5. 6.
Type the alias from your tnsnames.ora file for the Server Name.
Add a Data Flow Task to the design area. On the Data Flow tab, add an OLE DB source. Name the OLE DB source Oracle.
Type in the user name and password and check Save my password (see Figure 14-16). This example illustrates connecting to the widely available scott sample database schema. The user name is scott; the password is tiger. Verify the credentials with your Oracle administrator. Test the connection to ensure that everything is configured properly. Click OK to accept the configuration.
Figure 14-16
7.
In the custom properties section of the Oracle Component’s property dialog, change the AlwaysUseDefaultCodePage property to True.
8.
Open the OLE DB Source Editor by double-clicking the Oracle Source Component. With the Connection Manager tab selected, choose the Connection Manager pointing to the Oracle database.
www.it-ebooks.info
c14.indd 471
3/22/2014 9:33:11 AM
472╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
9.
Select Table or view from the Data access mode dropdown. Click the dropdown list under Name of the table or the view to see a list of the available tables. Choose the “Scott”.”Dept” table from the list.
10.
Select the Columns tab to see a list of the columns in the table.
11.
Click Preview to see sample data from the Oracle table. At this point, you can add a Data Destination Component to import the data into SQL Server or another OLE DB Destination. This is demonstrated several times elsewhere in the chapter, so it isn’t covered again here.
Importing Oracle data is very straightforward, but there are a few things to watch out for. The current Microsoft ODBC driver and Microsoft-written OLE DB provider for Oracle were designed for Oracle 7. At the time of this writing, Oracle 11g is the latest version available. Specific functionality and data types that were implemented after the 7 release will probably not work as expected. See Microsoft’s Knowledge Base article 244661 for more information. If you want to take advantage of newer Oracle features, you should consider using the Oracle-written OLE DB provider for Oracle, which is installed with the Oracle client software.
Using XML and Web Services Although XML is not a common source for large volumes of data, it is an integral technology standard in the realm of data. This section considers XML from a couple of different perspectives. First, you will work with the Web Service Task to interact with a public web service. Second, you will use the XML Source adapter to extract data from an XML document embedded in a file. In one of the web service examples, you will also use the XML Task to read the XML file.
Configuring the Web Service Task In very simple terms, a web service is to the web as a function is to a code module. It accepts a message in XML, including arguments, and returns the answer in XML. The value of XML technology is that it enables computer systems that are completely foreign to each other to communicate in a common language. When using web services, this transfer of XML data occurs across the enterprise or across the Internet using the HTTP protocol. Many web services — for example, stock-tickers and movie listings — are freely available for anyone’s use. Some web services, of course, are private or require a fee. Two common and useful applications are to enable orders or other data to be exchanged easily by corporate partners, and to receive information from a service — either one that you pay for or a public service that is exposed free on the Internet. In the following examples, you’ll learn how to use a web service to get the weather forecast of a U.S. zip code by subscribing to a public web service, and how to use the Web Service Task to perform currency conversion. Keep in mind that the web service task depends on the availability of a server. The Web Service Task could return errors if the server is unreachable or if the server is experiencing any internal errors.
Weather by Zip Code Example This example demonstrates how to use a web service to retrieve data:
www.it-ebooks.info
c14.indd 472
3/22/2014 9:33:12 AM
Using XML and Web Services╇
❘╇ 473
1.
Create a new package and create an HTTP Connection by right-clicking in the Connection Managers pane and choosing New Connection.
2.
Choose HTTP and click Add to bring up the HTTP Connection Manager Editor. Type http://www.webservicex.net/WeatherForecast.asmx?wsdl as the Server URL (see Figure 14-17). In this case, you’ll use a publicly available web service, so you won’t have to worry about any credentials or certificates. If you must supply proxy information to browse the web, fill that in on the Proxy tab.
3.
Before continuing, click the Test Connection button, and then click OK to accept the Connection Manager.
Figure 14-17
Add a Web Service Task from the Toolbox to the Control Flow workspace.
4. 5.
6.
In order for a web service to be accessed by a client, a Web Service Definition Language (WSDL) file must be available that describes how the web service works — that is, the methods available and the parameters that the web service expects. The Web Service Task provides a way to automatically download this file.
7.
In the WSDLFile property, enter the fully qualified path c:\ProSSIS\Data\weather.wsdl where you want the WSDL file to be created (see Figure 14-18).
Double-click the Web Service Task to bring up the Web Service Task Editor. Select the General pane. Make sure that the HttpConnection property is set to the HTTP connection you created in step number 2.
www.it-ebooks.info
c14.indd 473
3/22/2014 9:33:12 AM
474╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Figure 14-18
8.
Set the OverwriteWSDLFile property to True and then click Download WSDL to create the file. If you are interested in learning more about the file’s XML structure, you can open it with Internet Explorer. By downloading the WSDL file, the Web Service Task now knows the web service definition.
9.
Select the Input pane of the Web Service Task Editor. Then, next to the Service property, open the dropdown list and select the one service provided, called WeatherForecast.
10.
After selecting the WeatherForecast service, click in the Method property and choose the GetWeatherByZipCode option.
11.
Web services are not limited to providing just one method. If multiple methods are provided, you’ll see all of them listed. Notice another option called GetWeatherByPlaceName. You would use this if you wanted to enter a city instead of a zip code. Once the GetWeatherByZipCode method is selected, a list of arguments appears. In this case, a ZipCode property is presented. Enter a zip code location of a U.S. city (such as 30303 for Atlanta, or, if you live in the U.S., your own zip code). See Figure 14-19.
12.
Now that everything is set up to invoke the web service, you need to tell the Web Service Task what to do with the result. Switch to the Output property page of the Web Service Task Editor. Choose File Connection in the dropdown of the OutputType property. You can also store the output in a variable to be referenced later in the package.
www.it-ebooks.info
c14.indd 474
3/22/2014 9:33:12 AM
Using XML and Web Services╇
❘╇ 475
Figure 14-19
13.
In the File property, open the dropdown list and choose .
14.
When you are presented with the File Connection Manager Editor, change the Usage type property to Create file and change the File property to C:\ProSSIS\Data\weatheroutput .xml, as shown in Figure 14-20.
Figure 14-20
15.
Select OK in the File Connection Manager Editor, and OK in the Web Service Task Editor to finish configuring the SSIS package.
www.it-ebooks.info
c14.indd 475
3/22/2014 9:33:12 AM
476╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Now you’re ready to run the package. After executing it, wait for the Web Service Task to complete successfully. If all went well, use Internet Explorer to open the XML file returned by the web service (c:\ProSSIS\data\weatheroutput.xml) and view the weather forecast for the zip code. It will look something like this: 33.752502484.388850.00028513ATLANTAGAThursday, September 01, 2011http://forecast.weather.gov/images/wtf/nfew.jpg93723422Friday, September 02, 2011
The Currency Conversion Example In this second example, you learn how to use a web service to get a value that can be used in the package to perform a calculation. To convert a price list to another currency, you’ll use the value with the Derived Column Transformation:
1.
2. 3.
Begin by creating a new SSIS package. This example requires three variables. To set them up, ensure that the Control Flow tab is selected. If the Variables window is not visible, right-click in the design area and select Variables. Set up the three variables as shown in the following table. At this time, you don’t need initial values. (You can also use package parameters instead of variables for this example.) Name
Scope
Data T ype
XMLAnswer
Package
String
Answer
Package
String
ConversionRate
Package
Double
Add a Connection Manager pointing to the AdventureWorks database. Add a second connection. This time, create an HTTP Connection Manager and set the Server URL to http://www.webservicex.net/CurrencyConvertor.asmx?wsdl.
www.it-ebooks.info
c14.indd 476
3/22/2014 9:33:12 AM
Using XML and Web Services╇
❘╇ 477
Note ╇ Note that this web service was valid at the time of publication, but the authors cannot guarantee its future availability.
4.
Drag a Web Service Task to the design area and double-click the task to open the Web Service Task Editor. Set the HTTPConnection property to the Connection Manager you just created.
5.
Type in a location to store the WSDLFile, such as c:\ProSSIS\data\CurrencyConversion .wsdl, and then click the Download WSDL button as you did in the last example to download the WSDL file.
6.
Click Input to see the web service properties. Select CurrencyConvertor as the Service property and ConversionRate as the Method.
7.
Two parameters will be displayed: FromCurrency and ToCurrency. Set FromCurrency equal to USD, and ToCurrency equal to EUR (see Figure 14-21).
Figure 14-21
8. 9.
Click Output and set the OutputType to Variable. The variable name to use is User::XMLAnswer (see Figure 14-22). Click OK to accept the configuration.
www.it-ebooks.info
c14.indd 477
3/22/2014 9:33:13 AM
478╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Figure 14-22
Note ╇ At this point, you may be interested in viewing the XML that it returned from the web service. You can save the XML in a file instead of a variable. Then, after running the task, examine the file. Alternately, you can set a breakpoint on the task and view the variable at runtime. See Chapter 18 to learn more about breakpoints and debugging.
The value of the XML returned will look something like this: 0.836
10.
Because (for the sake of the example) you just need the number and not the XML, add an XML Task to the designer to evaluate the XML.
11.
Drag the precedence constraint from the Web Service Task to the XML Task, and then open the XML Task Editor by double-clicking the XML Task.
12.
Change the OperationType to XPATH. The properties available will change to include those specific for the XPATH operation. Set the properties to match those in the following table:
www.it-ebooks.info
c14.indd 478
3/22/2014 9:33:13 AM
Using XML and Web Services╇
Sec tion
Propert y
Value
Input
OperationType
XPATH
SourceType
Variable
Source
User:XMLAnswer
Output
SaveOperationResult
True
Operation Result
OverwriteDestination
True
Destination
User::Answer
DestinationType
Variable
SecondOperandType
Direct Input
SecondOperand
/
PutResultInOneNode
False
XpathOperation
Values
Second Operand
Xpath Options
❘╇ 479
A discussion about the XPATH query language is beyond the scope of this book, but this XML is very simple with only a root element that can be accessed by using the slash character (/). Values are returned from the query as a list with a one-character unprintable row delimiter. In this case, only one value is returned, but it still has the row delimiter that you can’t use. You have a couple of options here. You could save the value to a file, then import using a File Source Component into a SQL Server table, and finally use the Execute SQL Task to assign the value to a variable; but in this example, you will get a chance to use the Script Task to eliminate the extra character:
1.
Add a Script Task to the design area and drag the precedence constraint from the XML Task to the Script Task.
Open the Script Task Editor and select the Script pane.
2. 3.
4.
Click Design Script to open the code window. A Microsoft Visual Studio Tools for Applications environment opens. The script will save the value returned from the web service call to a variable. One character will be removed from the end of the value, leaving only the conversion factor. This is then converted to a double and saved in the ConversionRate variable for use in a later step.
In order for the Script Task to access the package variables, they must be listed in the ReadOnlyVariables or ReadWriteVariables properties (as appropriate considering whether you will be updating the variable value in the script) in a semicolon-delimited list. Enter User::Answer in the ReadOnlyVariables property and User::ConversionRate in the ReadWriteVariables property (see Figure 14-23).
www.it-ebooks.info
c14.indd 479
3/22/2014 9:33:13 AM
480╇
❘╇ CHAPTER 14╇ Accessing Heterogeneous Data
Figure 14-23
5. Replace Sub Main with the following code (Currency_Script.txt): Public Sub Main() Dim strConversion As String strConversion = Dts.Variables("User::Answer").Value.ToString strConversion = strConversion.Remove(strConversion.Length -1,1) Dts.Variables("User::ConversionRate").Value = CType(strConversion,Double) Dts.TaskResult = Dts.Results.Success End Sub
Close the scripting environment, and then click OK to accept the Script Task configuration.
6. 7.
8.
Move to the Data Flow tab and add a Connection Manager pointing to the AdventureWorks database, if you did not do so when getting started with this example.
9. 10.
Add a Data Flow Task to the design area and connect the Script Task to the Data Flow Task. The Control Flow area should resemble what is shown in Figure 14-24.
Drag an OLE DB Source Component to the design area. Open the OLE DB Source Editor and set the OLE DB Connection Manager property to the AdventureWorks connection. Change the data access mode property to SQL Command. Type the following query (Currency_Select.txt) in the command window:
www.it-ebooks.info
c14.indd 480
3/22/2014 9:33:13 AM
Using XML and Web Services╇
❘╇ 481
SELECT ProductID, ListPrice FROM Production.Product WHERE ListPrice> 0
Figure 14-24
11.
Click OK to accept the properties.
12.
Add a Derived Column Transformation to the design area.
13.
Drag the Data Flow Path from the OLE DB Source to the Derived Column Component.
14.
Double-click to open the Derived Column Transformation Editor dialog. Variables, columns, and functions are available for easily building an expression. Add a derived column called EuroListPrice. In the Expression field, type the following (Currency_ Expression.txt): ListPrice * @[User::ConversionRate]
15.
The Data Type should be a decimal with a scale of 2. Click OK to accept the properties (see Figure 14-25).
16.
Add a Flat File Destination Component to the Data Flow design area. Drag the Data Flow Path from the Derived Column Component to the Flat File Destination Component.
17.
Bring up the Flat File Destination Editor and click New to open the Flat File Format dialog.
18.
Choose Delimited and click OK. The Flat File Connection Manager Editor will open.
19.
Browse to or type in the path to a file, C:\ProSSIS\data\USD_EUR.txt. Here you can modify the file format and other properties if required (see Figure 14-26). Chec