Background:
I have a data model that contains System-Versioned Tables, one of which had an initial history load of 50+ million rows. The tool I used to load data is an SSIS package run via Azure Data Factory using Azure SSIS Integration Runtime.
I created an SSIS package to accomplish this task, which was deployed as part of a project to the SSIS catalog residing on an Azure SQL Database.
The package uses an OLEDB Source connection to a SQL Server 2014 database residing on an Azure VM and a Target connection to an Azure SQL DB database. As part of the database deployment there is a staging table with an index and a System-Versioned dimension table that contains two indexes. One for the system generated primary index key, and another column-stored index used for extract and reporting.
The dilemma:
It was taking over 2 hours to load the data into the staging table.I assumed it would take another 2+ hours to load the data into the destination tables. I wanted to increase load performance and decrease the amount of time it took to load the data. I wanted to drop all the indexes during the load process to increase throughput but had to re-sequence system-versioning table-create process in order to do this. System-versioning requires a primary key during the creation process.
The solution:
I dropped and recreated the dimension table as a non-system-versioned table containing the necessary columns for system versioning. I also added an identity column that would eventually become the primary index to the table. The table at this point contains no indexes.
Dimension table create logic containing system-versioning columns:
CREATE TABLE [dim].[TableTest] ( [Id] BIGINT IDENTITY(1, 1) NOT NULL, [COL1] BIGINT NOT NULL, [COL2] BIGINT NULL, [COL3] VARCHAR(25) NULL, [RowStartTimeUTC] DATETIME2 GENERATED ALWAYS AS ROW START DEFAULT(getutcdate()) NOT NULL, [RowEndTimeUTC] DATETIME2 GENERATED ALWAYS AS ROW END DEFAULT(CONVERT([datetime2], '9999-12-31 23:59:59.9999999')) NOT NULL, PERIOD FOR SYSTEM_TIME(RowStartTimeUTC, RowEndTimeUTC) ) ON [PRIMARY]
I updated the SSIS package to do the following:
Stage-table load process
- Exec-SQL Task to drop the indexes on the stage table.
- Data-Flow Task to load data from the source table to the stage table.
- Exec SQL Task to create an index containing the columns necessary to increase lookup performance.
Dimension-table load process
- Exec SQL task containing an insert statement to load the dimension table from the staging table
- My thought was to do all the work on the database server.
- Exec SQL task that contains the following SQL code:
- Make the identity column the primary key to the dimension table
ALTER TABLE [dim].[Table] ADD PRIMARY KEY CLUSTERED ( [Id] ASC ) ON [PRIMARY];
- Turn on system-versioning for the dimension table
ALTER TABLE [dim].[Table] SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [history].[Table]))
- Create an additional index for reporting performance
CREATE NONCLUSTERED INDEX [DimCompanyProduct_IX1] ON [dim].[CompanyProduct] ( [COL1] ASC, [COL2] ASC, . [COLN] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
Conclusion
By dropping all the indexes during the load process, I was able to decrease load time to 2 hours from 4+ hours for both staging and dimension table. So, how‘s your load time? If you still have questions around the optimization of any load process, please don’t hesitate to reach out to us. We’d love to help you get started.
by paul olimpo