Destroyed by RAID1. Backup plan 2.

So here’s how I lost my data because of RAID1. Apparently, my mobo (Asus P5W DH Deluxe) has an on board RAID1 option, which is on by default. Simply plugging in the SATA cable from a data-containing harddrive into the second port (my OS disk was on the first port) was enough to criple my data ad nausea: my OS disk was mirrored over my data on the bit level.

Needless to say, my heart aged a bit that day. And my mind got a crash course on data recovery and the NTFS filesystem, but to no avail. All I found back on the crippled data drive were files that were actually on my OS disk.

How could I miss that? This mobo is in my computer for ages. Or did I know, but forgot, somewhere along the way? And why, for crying out loud, didn’t I do any backups for the last three years? I can only conclude this is all my own¬†fault.

On a higher level, I made two conclusions:

  1. Because of how my data is organized, not all was lost: my data is spread over Gmail, Gdrive, BitBucket, DropBox, NAS server, Laptop, remote VPS servers and multiple disks.
  2. Because of how my data is organized, a good backup regime is quite hard to come up with and to implement.

Instead of looking back, I decided to look forward, and I came up with a list of properties that I would like to see in a backup system:

  • Multiple locations to backup from, and multiple locations to backup to
  • Fine grained control on the folder and file level of:
    • What to backup
    • Where to backup to, and the amount of replicates to maintain
    • The amount of versions to maintain and what to do upon deletion
  • Backup over a network connection
  • Moving files or folders should not result in data duplication on the backup site
  • Deletion of files should not immediately result in deletion from the backup
  • Deletion of files should not result in infinite retention inside the backup
  • Dashboard with stats to monitor free space, backup site availability and deletions
  • Work on Linux, and preferably also with Windows
  • Recovery should be possible based on individual or multiple backup sites

Does this exist? I don’t know. I didn’t look too far, but instead started to work on this myself. It’s probably stupid, but only time can tell. Here’s a few key aspects:

  • Written in PHP. I know PHP and I benefit from learning more about it. Furthermore it is available from the command line and via webserver (good for the Dashboard part) on all my systems.
  • Sqlite as database on both ends. Especially for the backup location this means that the metadata is in exactly the same folder as the file data and can be copied around easily when needed.
  • Backup files are named md4_filesize.bck. This means that files that are exactly the same will only be stored once. I hope md4 + filesize is enough to reasonably prevent collisions. SHA would be better, but slower. Perhaps MurMur3?
  • Data transfer is over a socket, uses hashing to check for data integrity. Files already on the backup location are skipped (based on information on both ends) or (when inclomplete) continued.
  • Using regexp for fine grained control over what to backup, where, versions, etc

File transfer is already working nicely. Backing up a folder structure with fine grained control over what to backup and how many versions to maintain is working as well. There’s a bit of logging up and running and a very rudimentary dashboard as well. A simple recover script has also been tested (but this doesn’t work over the network yet). Scaling this to multiple locations and repositories, and handling file deletions (I’m planning to report deletions and keep them for X days, which may be checked and corrected manually via the dashboard) is on the todo list. Up until now there’s no security. I could add some, or simply tunnel over SSH for remote connections.

The nice thing is that this backup solution could help me backing up my documents, my pictures, my mp3s (which would be fine on a different disk inside the same building) and also database dumps and logs from my webservers.

I think I want to also include BitBucket, Gmail and Gdrive exports, as well as DropBox files. Or can we trust those parties with our digital lives?

 

2 comments

  1. anon says:

    md5 would be a better balance between speed and collision resistance than md4.

    Also if you name files after their hash, you should not need to store a size in the filename. A change in size should result in a change in hash, otherwise you’ve found a collision. So your files could be named just md5.bck.

    You do, however, need to keep a map of hash->filename somewhere. I assume this is the sqlite db. Are you backing it up to the backup location? If so, what is it named. Otherwise when disk X dies, how will you know which md5.sqlite3 file corresponds to disk X’s files?

    • admin says:

      Thanks for your remarks!

      I thought about md5 but decided to go for md4 after some reading around on Stackoverflow. Perhaps I should do some testing, to see if the speed difference is actually noticeable (compared to the disk speed).

      As for the file size: this way both the size AND the hash need to be the same before a collision occurs. I assume (without any testing or research) that this minimizes that chance. And since it is readily available adding it is cheap.

      Yes, the map of hash->filename is on the backup location. There’s only one db, for all ‘repositories’. The origin of each file (ie disk X) is stored in the database. I call this a ‘repository’. A single ‘source’ can have multiple repositories (ie photos, documents, code). File paths are stored relative to the ‘root’ of a repository.

      Knowing a repository is enough to restore the entire thing from the sqlite db and the bck files, all of which reside at the backup location.

Leave a Reply

Your email address will not be published.