I have been using a program called backintime for making periodic snapshots of my source code. It’s really great for regular snapshots without taking up too much space. The basic operation is simple. An initial copy is made. Then the next time a snapshot is taken, it is compared against the previous. Files that haven’t changed simply have symbolic links made to the original file. Files that have changed get a new copy. This keeps the size of each snapshot quite small. The program also has a rotation fetcher that can remove snapshots after some period of time has passed. This is exactly the functionality I was looking for to make hourly snapshots of my source code files. If I messed up, I could simply restore from any previous time. Worked so well I decided to try it on my website. That’s where I ran into a couple problems.
One of the problems is that my backup machine will sometimes loss the connection to the web server. When it tries to do a backup, everything looks like it has been removed. So, it snapshots an empty directory. Then when the connection comes back, the snapshot looks like an entirely new system. It didn’t know the network drive wasn’t available, nor did it know it already had copies of everything.
I decided to look into what it would take to make a similar system that addressed these short comings. In my system, original files are stored by hash. Snapshots are simply symbolic links to these hash files. What this does is only preserve unique files. Files with duplicate content link to the same source. If a file is removed the link simply doesn’t get made in the snapshot. If the file re-appears, the link again uses the original hash file. And if the data is moved, only the links are updated.
I put together a quick Python script to try this concept out. Only took an hour or two before the basic concept was functional. I used a non-cryptographic hash function called xxHash. This produces a 64-bit unique hash for file content and is very fast.