Tuesday, February 9, 2010

Google's Fixed Content Store

Note to Google:

Google wants to make the world's information accessible, right? Here's a way to better do that.

Much of the world's information is contained in discrete documents. Google downloads and caches most of these documents already.

All such documents have a unique hash based solely on their content, e.g., SHA1.

Google should make all its cached documents publicly available under an URL that is based on such a hash. This would just be a giant Fixed Content Storage system for the world to enjoy.

URLs might look something like http://fcs.google.com/8843d7f92416211de9ebb963ff4ce28125932878

This would give every document an URL that:
  • Is canonical and universal
  • Is globally unique
  • Can be generated locally (i.e., without querying any sort of directory), given the document
Even better, Google would allow users to upload arbitrary documents into this store. Initially the document's size would be charged against your quota. However, Google would also track hits to the URL. Documents that got many hits would stay in the cache, and go down in "quota cost".

Instead of emailing attachments to your friends, you could email links instead. Etc.

This clearly separates the job of storing a document from the job of referencing a document. Today those jobs are often mixed together due to technical limitations. For example, the only reason people attach files to emails is that there's no easy way to provide your email recipients a usable reference to a file sitting on your computer (which may be off the network, or just plain off).

These URLs would also be great for intermediate caches, such as enterprise proxy servers.

Now what if you're paranoid about hash collisions? Then a UUID or unique serial number could be created for each new document to be used in the canonical URL, and an HTTP redirect created for the SHA1 version of the URL (and MD5, etc.) . The SHA1 redirect would stay in place until the unlikely event that someone uploaded a different document with the same hash, at which time it would start returning an error page showing the multiple matches.

Google would use the normal tricks to infer the document's MIME type from the content, as the MIME type cannot be included in the data that gets hashed.