One-Way hashing, HashBytes, and SSIS

I helped develop an application that stores documents within SQLServer (v2005) and are then retrieved for use by a web-based front end. I had the task of co-developing the back end pieces; table structures, indexes, relationships, and anSSISpackage to manage the load. The design is straight forward: Documents are stored in a varbinary column with ancillary data stored in additional tables to facilitate look ups. The current document form is a .pdf around 80k each. The current logic states that we keep only one copy of a given document and the first copy we get is the one we keep. It is easy to maintain in that I do a look up in the ancillary tables prior to loading up the "new" .pdf and if this document exists, I don't bother loading up the new one. Then came the curve.

Since this is application stores various business documents (and any physical type .pdf,.xls,.doc, etc) and those documents have various business rules..a new rule came to bear. A new business document had a rule that stated we needed to maintain a certain number of distinct copies. Unlike the previous document type whose rule stated you keep the first copy, this one stated you had to maintain multiple, let's say the last ten. To satisfy this business need I had to figure a way to keep 10 DIFFERENT copies, not 10 identical copies. How do I determine if a binary file has changed from one version to the next; by using a cryptographic hash function (I'll explain more about cryptographic hash functions in a soon to come post).  HashBytes to the rescue, or so I thought.

SQLServer 2005 has a function called HashBytes which takes in the name of a known hash function (saySHA-1) and an input of varbinary(max) and hands back avarbinaryof the appropriate size (depending upon the hash function you use). Well, it seems as though BOL has a bit of a misprint. Turns out HashBytes has a max input size of 8000 bytes, which makes it unsuitable for me (remember the avg size of a document is 80k). This wouldn't be as simple as adding a computed column (although this could only be done in SP2, since it seems prior to that HashBytes was considered non-deterministic). Back to SSIS and the load process.

SSIShas two very powerful 'widgets'. The script task widget and the script component widget. The script component widget is the one to solve this task. Within the widget I can create a VB .NET component and VB .NET has a few cryptographic hash functions in

the System.Security.Cryptography library. The code reads in a blob from the input stream and calculates a hash, which it then passes down the stream for the rest of the SSIS package to use.


Public ClassScriptMain
DimmySHA1 As NewSHA1CryptoServiceProvider()

Public Overrides Sub Input0_ProcessInputRow(ByValRow As Input0Buffer)
Dimblobdata() As Byte
DimfSizeAs Int32

fSize= Convert.ToInt32(Row.c002data.Length)
blobdata= Row.c002data.GetBlobData(0,fSize)
End Sub

End Class

In my script component I have an incomingdataflowcolumn called c002data,which contains the .pdf. I have an outgoing dataflow column called HashValue. I get the entire blob out of the stream and into the blob data array, which I then pass to theSHA1 function which computes the hash and sends it on it's merry way! I now can check to see if the digital document has changed and if it has, add it to the table. This is a simple, quick, and efficient way to manage binary data.