Cryptographic Hash Functions

Cryptographic hash functions take an arbitrary number of input bytes and reduce it to a fixed size. This resultant size is dependent upon the function you use, be it MD4,MD5,SHA-1, etc. I prefer to use SHA-1 since it is the most secure of the 3 listed above. There are many other cryptographic functions, but they are not included in all libraries.

SHA-1 has an output size of 160 bits (20bytes) for an input as large as 2^64-1 bytes. The output size of 160 bits means that the chance of a two completely random input streams to produce the same hash is 1 in 2^80 ( or 2^(output size/2)). I have not yet worked on a data set that large, so it's good enough for me to use in this fashion; digital fingerprint. Let's say you have a customer table withfirstname,lastname,address1-address5, city,state,zip. You want to update the record every time something changes in any of the 10 columns. If you happen to have a trigger on the table, then you just do your updates based upon the records that changed, which is happily provided to you by the trigger. If on the other hand you are working with very large data sets or don't have the option of a trigger you need an effective way to determine which records have changed. You might not want to run a 10k record update to catch the 100 or so that actually changed. Here is where the hash comes into play. Depending upon the function you are using (either SQLServer's built-in HashBytes) or through SSIS and a .Net function, you want to take all of the columns you that can change and condense them into a single value. In the example above, which was allchar(x)fields, you can simply do firstname+lastname+address1+address2....Feed the entire concatenation into the function and you have a digital fingerprint. If anything in any of the fields change, then the 'fingerprint' will be different and a simple comparison (orig.digitalprint!=delta.digitalprint) enables you to identify the record. With SQLServer 2005 sp2,HashBytes, is deterministic so you can make a computed column which will calculate the HashByte of each record automatically. You can then run a sql statement which will do an update if the hashes match or an insert if they don't. This is a very good unique key, providing very random placement (the purpose of a cryptographic function is to be as truly random as possible) versus adding a incrementing column simply to provide a unique index.