The title should actually read “8 Random Ways to Optimise SSIS”.
One of the recent project I have been working on involved building a distributed (scaled-out) SSIS environment, this means multiple VMs with a standalone SSIS (2012 in Package mode) instances installed (so no SQL Server Database Engine), all pushing massive amount of data to a staging database.
I have been brought in on this project to suggest a few techniques to improve the performance of the scaled-out SSIS environment, by basically increasing the throughput to the staging database, below I discuss some of the general approaches I have taken to achieve that goal. Some of the advice might be a bit random, and others might not be pertinent to your particular situation, but over-all you should find a gem or two on optimising SSIS performance in there somewhere!
Minimize SSIS File Movement
This was a big issue in the environment I was analyzing, considering the SSIS packages are processing and transforming raw text files, there was a lot of data movement in order to distributed the files to the SSIS VMs for processing, then back to the distribution server for archiving.
Considering there was a large number of files, the SSIS process was having to share I/O and Network resources with the file copy process, causing sluggishness while both processes were being executed at the same time.
Ideally all file movement should be minimized if I/O is a bottleneck in your SSIS environment, this can be accomplished through fast access shared storage (that appears as if it was local to each of the SSIS VMs), HyperV has a clustered disk solution that means each VM would see the disk as local storage, but in reality they are shared across multiple VMs that can edit files concurrently. It is important to note that normal network share causes files to be copied locally (to where the SSIS process is being executed) before execution can begin, which causes a huge delay in the throughout.
Another thing to consider is that if all the file movement steps are required in your execution logic, for example in my scenario we had:
Distribution Box (Downloading data) –(move)> SSIS Box (Processing data) –(move)> Distribution Box (Archiving data)
We reworked this path to minimize network traffic by copying the file from the distribution box to the SSIS box, while archiving it the same time, once SSIS finished processing a file it will just delete it. This means that files are transferred over network only once.
Additionally, SSIS, as with many data processing solutions, works better on larger files rather than a large number of small files, this is because it takes time to establish a connection to a new file each time, hence to shave off these extra milliseconds, it is recommended to combine the data files into smaller number of files. It makes a difference when processing very large number of files.
Drop/Create Indexes on Staging Tables
Pushing data into an indexed table will take much more time than pushing the data into an empty table and then rebuilding indexes afterwards.
This is because while pushing data into an indexed table, the database engine is having to build the index alongside the updates to the table, which causes a considerable delay.
Additionally, and depending on your insert pattern, if you are inserting data into an indexed table you might end-up with a fragmented index, which is slower to access and deal with than a non-fragmented index (obviously), hence it is highly recommended to follow the index drop/recreate procedure below:
- Drop all indexes on destination table
- Push data (perform ETL operations)
- Create all indexes on destination table
Dedicate VM Resources
I have noticed, while analyzing the live environment I was trying to optimise, that VMWare (which is our virtualization provider) seems to de-allocate and reallocate CPU resources on the SSIS boxes depending on the activity (and the activity trend over time), this is to allow the host to re-allocate that resource around the VM stack, giving boxes that require more resources a way to get what they need, while maintaining operations on existing VMs.
This is all well and good, if you have a steady workload that doesn’t change over time, but if it does, then the box will suffer from “resource suffocation” until the VM controller decides to re-allocate the resources back to the box. This re-allocation will take some time during which your process is probably running on very low throughout.
If you have a variable load, it is highly recommended to allocate non-reclaimable resources (Memory and CPU) to those boxes, this will reduce resource contention due to the dynamic VM resource re-arrangement and provide consistent performance figures.
Enable Bulk Operations on the SSIS Login
By providing the login being used to load data into the database from SSIS (the Destination Connection login) with the BulkAdmin server-role you will enable SSIS to load the data in bulk, if it is possible.
This recommendation has to be in synergy with utilizing the fast load data access mode in the Destination component of the SSIS’s Data Flow, which allows SSIS to utilize the BULK INSERT statement.
Set FastParse Option on the Flat File Source
The give-away is in the name, the FastParse option allows the Flat File data source output to be parsed more quickly (sometimes significantly) at the expense of supporting locale-specific data formats (such as numeric, date and time data). This is an option definitely worth considering if you do not have locale specific data or can control the format of the input files. Also, it is something worth keeping in mind when building an output file that is intended to be consumed by an SSIS package.
The option can be located for each column (or node) under the following path: Flat File Source -> Show Advance Editor -> Input and Output Properties (tab) -> Flat File Source Output (node) -> Output Columns (node).
Configure and Tweak SSIS Destination Operators
There are a few options you could tweak on the SQL Server Data Destination to better accommodate the data you are trying to load, these are:
- MaximumInsertCommitSize: Used to determine how much data is pushed into the table before SSIS commits the transaction, by default this is set to 2147483647, but should be tweaked to better suite the type of data being pushed. This value’s effect becomes more important in an environment were multiple SSIS packages are loading data into one table, and each package is trying to take a table lock on the destination table. Its important to note that this value also influence how much is rolled-back if the SSIS fails before committing the transaction.
- Check Constraints: This flags whether the destination component should check the constraint imposed on the destination table when inserting data, un-ticking this box will speed up loading into the destination table, in the cases were there are constraints on the table.
Configure Connection Packet Size Property
By default, an SSIS connection is setup with a packet transfer size (Packet Size) of 4,096 bytes (4KB), this is all well and good for old networks that can’t handle larger packet size contention, but these days you will be hard done to find an enterprise grade network that isn’t capable of handling packet sizes 8 times the size of that.
Packaging data into bigger size packets over the network will reduces the overhead associated with the packaging and unpackaging of those packets, leading to better throughput.
It is generally recommended to increase this value to 32,767 (32KB) which is the maximum allowed value. It is highly recommended to monitor your network and ensure no contention or packet loss is happening due to this newly configured value.
Destination Table Partitioning
The importance of partitioning can not be stressed enough in a scaled-out SSIS environment. Whether each SSIS package inserts into a separate staging table which then gets switched into one staging table (ideal), or SSIS itself inserts into specific partitions of the staging table (you’ll need to take out table-lock escalation), partitioning (when done correctly) should eleminate any bottlenecks to do with sql engine table lock specific waits, which can be considerable in a multi-ssis environment.
There is an excellent article by Len, Tim and David from Microsoft that discuss a partitioning approach when loading huge amount of data in SSIS that I recommend reading.
Further Reading on SSIS Performance Optimisation
- Microsoft’s recommendation’s on improving data flow performance.
- SSIS best practices by Denny Lee and Thomas Kejser.
- SSIS best practices by Arshad Ali
- Threading SSIS by Jamie Thomson
- Parallelism in SSIS by Josef Richberg