Cryptographic hash functions take an arbitrary number of input bytes and reduce it to a fixed size. This resultant size is dependent upon the function you use, be it MD4,MD5,SHA-1, etc. I prefer to use SHA-1 since it is the most secure of the 3 listed above. There are many other cryptographic functions, but they are not included in all libraries.
I helped develop an application that stores documents within SQLServer (v2005) and are then retrieved for use by a web-based front end. I had the task of co-developing the back end pieces; table structures, indexes, relationships, and anSSISpackage to manage the load. The design is straight forward: Documents are stored in a varbinary column with ancillary data stored in additional tables to facilitate look ups. The current document form is a .pdf around 80k each. The current logic states that we keep only one copy of a given document and the first copy we get is the one we keep. It is easy to maintain in that I do a look up in the ancillary tables prior to loading up the "new" .pdf and if this document exists, I don't bother loading up the new one. Then came the curve.