File Hashing: If you look close enough, even files have fingerprints
A little while back I spent some time looking at an interesting issue with a colleage. He was trying to load a virtual machine from an .OVA but kept recieving error messages as he loaded it that the file was invalid. Somehow though, other people used the same file. This one could be fun to diagnose.
To start with, the premise of our argument has a problem: it wasn’t the same file but rather a copy of the file. Others had pulled directly from a local network resource whereas my colleage’s was downloaded via HTTP though a cloud storage provider. They should be the same, but are they? The fastest and easiest way for us to check is through comparing hashes.
A file’s hash is like its fingerprint. When two files have the same hash, you can be reasonably certain that they are the same file.1 Change a single bit anywhere in the stream and you’ll wind up with a drastically different hash.
There are several different hashing algorithms available. For our purposes we’ll use SHA-1. 2 Now let’s take a second to verify our claim that a single bit being different will change the hash.
echo "Hello World" | shasum 648a6a6ffffdaa0badb23b8baf90b6168dd16b3a - echo "Hello Wnrld" | shasum ade3cd129c0310e5726138fb4654da0311056d3b -
So that appears to confirm it: changing a single bit, in this case transforming the second ‘o’ to ‘n’, fundamentally changed the hash. Now we can make shasum hash our file by passing the filename as the first argument.
shasum my_large_file.img c77050ddf414d6e73b78e7c713a3fdef1b258fae my_large_file.img
In our case, by hashing both of our files we could see if there were any differences. As you can most likely already guess, VirtualBox was right; the file was not valid. Some corruption had occurred while downloading the file. Retransferring the file fixed the problem.
A quick note with the above samples: shasum is the name of the utility on Linux and OS X. On Windows you’ll need an external utility, such as Quick Hash GUI.
I said “reasonably certain”. When two files differ but share a hash, that is known as a collision. While it is theoretically possible that two files could unintentionally collide, it is amazingly unlikely. ↩︎
For our purposes, we chose SHA-1, but we could have chosen any of a number of different hashes including MD5, SHA-224, SHA-256, or SHA-512. Even then, there are others still. ↩︎