I posted a while back about how to you a cryptographic hash as a way to identify differences in BLOBs for storage within SQL Server. Looking back I realized it was more of an abstract explanation and I was just recently asked for real world example, so here it is.
The first thing I do I pick a column from the 'stream', which is a way of saying 'all of the columns that have been acquired in the SSIS package until now. This would be any columns that are selected from a source and/or added via derived columns. I have imported my .pdfs under the column name 'FileData'. The image below shows how I select that as the source for my script.
I now have to do a little manual labor to set up the script so it can process the incoming data and properly produce an output column. If you notice I created an output column called 'Fingerprint', it is of type DT_BYTES and it is of length 20. I know the length because the SHA algorithm produces a 20 byte (160 bit) output regardless of the input (see my post about cryptographic hash functions )
Below is the simple one line script that takes my input column FileData and transforms it into Fingerprint
Now that I have Fingerprint in my downstream, I can use it like any other derived column. Here is use it in a Lookup object to check if that specific Fingerprint exists. If it does, I toss it out, otherwise I insert that with a few other columns later on in the package.