I built a system that ingests .PDF files which are of varying sizes (100-400k on avg) which represent various types of files (invoices,statements, pro-formas). I can ingest any binary file of any size, but we currently only use .pdfs. This has an interesting SSIS package (which I will discuss after the table structure) The table structure has many columns, but the ones of interest for #TSQL2sDay are the following:
- [C002_Data] [varbinary](max) NOT NULL -- stores pdf
- [C002_Fingerprint] [binary](20) NULL -- stores 20 byte hash of pdf
- [C002_webunique] [int] IDENTITY(1,1) NOT NULL --some web apps have issues with the fingerprint
Users of the front end enter a bunch of criteria to narrow down the .pdf files they want to see. Then they click on the row in the front end and that grabs the webunique number which efficiently pulls the Data column for display. One finger print, one binary (.pdf for now)
A portion SSIS package looks like this:
As you can the C002_FingerPrint column is used to not only pull the .pdf for viewing, but to validate that the .pdf going into the table is not duplicated. Using SHA1, there is a 1 in 2^80 chance that two totally different .pdf files, or any binary file, run through the algorithm will have the same 20 byte binary hash. This works incredibly well for us. We don't have to parse through the binary file looking for bits and pieces and we can include a new binary (say a word or excel doc) without having to build a new library to peek inside. The script used to calculate the hash is explained below.
The binary file is pulled in using the Import Column component and then passed down the stream to this script which then builds the hash using the C# crypto library.
Later on down the line we actually push the data into a sql table, not via the standard OLEDB Destination object, but via a custom Script Destination Component, so we can get an auto-increment number (seems not all of the .net front ends can easily grab a 20 byte binary, go figure). I will go into this custom script in a future post.