Simple Series:Data Flow Properties - Misc

Many times people overlook some of the Misc Properties of given components.  The Data Flow Component is one of the most powerful components and is heavily used.  It's how data is moved from table to table, server to server, and manipulated to suit data transformations.  There are 5  properties I will be talking about in this post:


The first two properties, BLOB & BufferTempStoragePath are used when you run out of memory space to store either BLOBs (such as imported documents) or the buffers you request (we will see that in the paragraphs to come).  It is a good practice to set these to known locations (be it locally or remotely) so that you have accountability when it comes to performance and troubleshooting.  If you allow the defaults to take effect, you might run out of TMP space, because you didn't account for another process that uses it and you spend a few hours trying to tie out why your package failed.  If you don't do BLOB work, then you don't need to set it, but it doesn't hurt.

 The next two rows are many times overlooked by people and can be somewhat confusing.  They actually work against one another, that is to say the system will pick the lesser of the two to satisfy what it does.  Here's how it works.

The way the defaults are set up, when you have and any source component, SSIS will read in no more than 10,000 records (DefaultBufferMaxRows)OR the number of records which fill 10MB (DefaultBufferSize).  Note the DefaultBufferSizeis in bytes.  Make sense?  I tend to look at the avg number of records in my source and then the avg record size and toggle the numbers until I get the number of records per pull from the source.  There is a 100MB limit for the DefaultBufferSizeso you will also have a limit on the number of records you can pull at each cycle.  If you turn on logging while tuning your package, you only need the Diagnostic section.  That gives you some nice information about how many buffers are being used, how much memory, etc.  It's a little more elegant than setting the max rows to say 20k and run your package, to find you only pull 12k rows, so you up your memory by say 5MB and rerun.  Here is the section in the logging choices:


The last property is the EngineThreads, which defaults to 10 in SSIS 2008 and 5 in SSIS 2005.  In SSIS 2008 the range is 2-1000, while in SSIS 2005 it is 2-60.  This basically tells SSIS the maximum number source/destination objects can be run at the same time within a given Data Flow object.  Now you might be thinking , "I only have a 4 processor box, how can that accomodate 10 simultaneous source/destinations?".  In a given flow of source-to-destination, I might have to wait on a transformation or on a destination object (maybe I have a block in sql server on my OLEDB Destination) in that case I have a free thread that can do other work.  There is always a little time to slice away from one component and give to another.   Do not confuse this with MaxConcurrentExecutables, which is the number of concurrent Control Flow components you can run at the same time.  An example would be you have a two processor box and you leave the settings on their default.  If you have a singleData Flow component with say 5 source/destination paths you are using a single CPU to run 10 threads.  It would be much more efficient to split that into two and have the 10 threads across 2 cpus.

The best way to determine what the above components should be set at is testing,testing, and more testing.  I personally don't like to pack dozens of source/destinationn objects into a single Data Flow component.  I tend to like to keep them grouped by some common function.  I might decide to break up a single 10 source/destination Data Flow component into two that each has 5 source/destination objects.  Why?  Well, for one thing, you cannot disable any of the components within a Data Flow component, so troubleshooting became that much more difficult :)  Also, keep in mind that while the EngineThreads setting is a max, the others are not.  Your memory/buffer settings are Per Source!  That means that it will try to get as many records as it can within those parameters.  If you have some Data Flow components that don't need heavy lifting, move them to another one with smaller constraints.  Keep in mind what I wrote above about how the CPUs are divided up vs the threads.

 Performance tuning is all about resource use.  Take the time to plan out your workflow, which will enable you to effectively use the resources you have.