So here’s how I lost my data because of RAID1. Apparently, my mobo (Asus P5W DH Deluxe) has an on board RAID1 option, which is on by default. Simply plugging in the SATA cable from a data-containing harddrive into the second port (my OS disk was on the first port) was enough to criple my data ad nausea: my OS disk was mirrored over my data on the bit level.
Needless to say, my heart aged a bit that day. And my mind got a crash course on data recovery and the NTFS filesystem, but to no avail. All I found back on the crippled data drive were files that were actually on my OS disk.
How could I miss that? This mobo is in my computer for ages. Or did I know, but forgot, somewhere along the way? And why, for crying out loud, didn’t I do any backups for the last three years? I can only conclude this is all my own fault.
On a higher level, I made two conclusions:
- Because of how my data is organized, not all was lost: my data is spread over Gmail, Gdrive, BitBucket, DropBox, NAS server, Laptop, remote VPS servers and multiple disks.
- Because of how my data is organized, a good backup regime is quite hard to come up with and to implement.
Instead of looking back, I decided to look forward, and I came up with a list of properties that I would like to see in a backup system:
- Multiple locations to backup from, and multiple locations to backup to
- Fine grained control on the folder and file level of:
- What to backup
- Where to backup to, and the amount of replicates to maintain
- The amount of versions to maintain and what to do upon deletion
- Backup over a network connection
- Moving files or folders should not result in data duplication on the backup site
- Deletion of files should not immediately result in deletion from the backup
- Deletion of files should not result in infinite retention inside the backup
- Dashboard with stats to monitor free space, backup site availability and deletions
- Work on Linux, and preferably also with Windows
- Recovery should be possible based on individual or multiple backup sites
Does this exist? I don’t know. I didn’t look too far, but instead started to work on this myself. It’s probably stupid, but only time can tell. Here’s a few key aspects:
- Written in PHP. I know PHP and I benefit from learning more about it. Furthermore it is available from the command line and via webserver (good for the Dashboard part) on all my systems.
- Sqlite as database on both ends. Especially for the backup location this means that the metadata is in exactly the same folder as the file data and can be copied around easily when needed.
- Backup files are named md4_filesize.bck. This means that files that are exactly the same will only be stored once. I hope md4 + filesize is enough to reasonably prevent collisions. SHA would be better, but slower. Perhaps MurMur3?
- Data transfer is over a socket, uses hashing to check for data integrity. Files already on the backup location are skipped (based on information on both ends) or (when inclomplete) continued.
- Using regexp for fine grained control over what to backup, where, versions, etc
File transfer is already working nicely. Backing up a folder structure with fine grained control over what to backup and how many versions to maintain is working as well. There’s a bit of logging up and running and a very rudimentary dashboard as well. A simple recover script has also been tested (but this doesn’t work over the network yet). Scaling this to multiple locations and repositories, and handling file deletions (I’m planning to report deletions and keep them for X days, which may be checked and corrected manually via the dashboard) is on the todo list. Up until now there’s no security. I could add some, or simply tunnel over SSH for remote connections.
The nice thing is that this backup solution could help me backing up my documents, my pictures, my mp3s (which would be fine on a different disk inside the same building) and also database dumps and logs from my webservers.
I think I want to also include BitBucket, Gmail and Gdrive exports, as well as DropBox files. Or can we trust those parties with our digital lives?