The first time you had to load a Slowly Changing Dimension table with SSIS you probably thought no problem. There is a transformation for that. Let us take a look at the pros and cons and see how it compares to using the hash function.
Why use an SCD Transformation?
• External Feed
• Legacy system with no timestamp
• Change Data Capture not available
• Wizard creates necessary code.
• Customization has to be done again if wizard is rerun.
• Emphasis is on Slow.
SCD with Hash Function
• Very Fast
• Initial Setup has a few more steps then the wizard
• Source table with 64,000 records of name and address data.
• 1344 updates (2%)
• 32 new records (0.05%)
• Type 1 SCD. Change data, no history.
• Staging table with new source records
• Target table to be updated
SCD Transformation test
Execute the wizard to map input and output tables.
Execution time: Almost 23 minutes.
SCD with Hash Setup
• Add Hashcode nvarchar(255) to target table
• Update hashcode with hash of data fields
• Create an empty update table in Staging DB that is a copy of the target table.
hashcode = hashbytes(‘SHA1’, FirstName + LastName + Address1 + StateCode + city
+ cast(zipcode as nvarchar(10)) + comment)
Target table already contains Hash Code for each row:
When selecting from the source table, calculate the Hash Code.
,cast(hashbytes(‘SHA1’, FirstName + LastName + Address1 + StateCode
+ city + cast(zipcode as nvarchar(10)) + comment) as nvarchar(255)) as hashcodealpha
order by custID
Control flow for hash SCD:
Data Flow for Hash SCD:
In the Lookup transformation, get ID and Hashcode from production table. If error (no ID in production table) perform and insert.
In the Condition Split transformation, if the hashcodes are different write to update table. If they are the same ignore record.
Execute the update script for all changed records:
SET FirstName = u.FirstName
,LastName = u.LastName
,Address1 = u.Address1
,StateCode = u.StateCode
,City = u.City
,ZipCode = u.ZipCode
,Comment = u.Comment
,hashcode = hashbytes(‘SHA1’, u.FirstName + u.LastName + u.Address1 + u.StateCode + u.city + cast(u.zipcode as nvarchar(10)) + u.comment)
from t_customer t
join u_customer u
on t.custid = u.custid
Hash SCD Results:
Execution time: just over 3 seconds
As you can see, a little extra setup with an update table and writing a hash function can increase your throughput almost 400 times. On a small table the update went from 23 minutes down to 3 seconds. The hash function also reduces the number of fields you have to compare to one. If you have a very wide dimension table, the proportional savings will be even greater.
Give it a try. It will provide immediate benefits.
by Kevin Bucher