2010/10/01 § 2 Comments
You don’t know it, but I’ve started writing this blog several times before it actually went live, and every time I scraped whatever post I started with (the initial run was on blogger.com). I just didn’t think these posts were all too interesting, they were about my monstrous home server, donny. Maybe this is still not interesting, but I’m metaphorically on the verge of tears and I just have to tell someone of what happened, to repent me of my horrible sin. You may not read if you don’t want to. I bought donny about 2.5-3 years ago, to replace my aging home storage server (I had about 3x250GB at the time, no RAID). There’s not much to say about donny’s hardware (Core 2 Duo, 2GB of RAM, Asus P5K-WS motherboard), other than the gargantuan CoolerMaster Stacker 810 chassis with room for 14 (!) SATA disks. Initially I bought 8×0.5TB SATA Hitachi disks for it, and added more as I had the chance. I guess I bought it because at the time I’d hang around disks all day long, I must’ve felt the need to compensate for something (my job at the time was mostly around software, but still, you can’t ignore the shipping crates of SATA disks in lab).
Anyway, most of its life donny ran OpenSolaris. One of our customers had a big ZFS deployment, I’ve always liked Solaris most of all the big Unices (I never thought it really better than Linux, it just Sucked Less™ than the other big iron Unices), I totally drank the cool-aid about “ZFS: the last word in File System” (notice how the first Google hit for this search term is “Bad Request” :) and dtrace really blew me away. So I chose OpenSolaris. Some of those started-but-never-finished posts were about whether I was happy with OpenSolaris and ZFS or not, I never found them interesting enough to even finish them. So even if I don’t wanna discuss that particularly, it should be noted that if we look at how I voted with my feet, I ended up migrating donny to Ubuntu 10.04.1/mdadm RAID5/ext4 when my wife and I got back from our long trip abroad.
Migration was a breeze, the actual migration process convinced me I’ve made the right choice in this case. Over the time with ZFS (both at work and at home) I realized it’s probably good but certainly not magical and not the end of human suffering with regard to storage. In exchange for giving up zfs and dtrace I received the joys of Ubuntu, most notably a working package management system and sensible defaults to make life so much easier, along with the most vibrant eco-system there is. I bought donny 4×2.0TB SATA WD Cavier Green disks, made a rolling upgrade for the data while relying on zfs-fuse (it went well, despite a small and old bug) and overall the downtime was less than an hour for the installation of the disks. At the time of the disaster, donny held one RAID5 array made of 4x2TB, one RAID5 array made of 4x.5TB, one soon-to-be-made RAID5 array made of 3x1TB+1×1.5TB (I bought a larger drive after one of the 1TB failed a while ago), and its two boot disks. I was happy. donny, my wife and me, one happy family. Until last night.
I was always eyeing donny’s small boot disks (what a waste of room… and with all these useless SATA disks I’ve accumulated over the years and have lying about…), so last night I wanted to break the 2x80GB mirror and roll-upgrade to a 2x1TB boot configuration, planning on using the extra space for… well, I’ll be honest, I don’t know for what. I’ll admit it – I got a bit addicted to seeing the TB suffix near the free space column of df -h at home (at work you can see better suffixes :). I just have hardware lying around, and I love never deleting ANYTHING, and I love keeping all the .isos of everything ever (Hmm… RHEL3 for Itanium… that must come in handy some day…) and keeping an image of every friend and relative’s Windows computer I ever fixed (it’s amazing how much time this saves), and never keeping any media I buy in plastic… and, well, the fetish of just having a big server. Heck, it sure beats farmville.
So, indeed, last night I broke that mirror, and installed that 1TB drive, and this morning I started re-mirroring the boot, and while I was at it I started seeing some of the directory structure was wrong so I redistributed stuff inside the RAIDs, and all the disks where whirring merrily at the corner of the room, and I was getting cold so I turned off the AC, and suddenly donny starts beeping (I didn’t even remember I installed that pcspkr script for mdadm) and I get a flurry of emails regarding disks failures in the md devices. WTF? Quickly I realized that donny was practically boiling hot (SMART read one of the disks at 70 degrees celsius), at which point I did an emergency shutdown and realized… that last night I disconnected the power cable running from the PSU to several fans, forgot to reconnect it, and now I’ve effectively cooked my server. Damn.
I’m not sure what to do now. I still have some friends who know stuff about harddisks (like, know the stuff you have to sign NDAs with the disk manufacturers in order to know), and I’m trying to pick my network about what to do next. Basically, from what I hear so far, I should keep donny off, let the disks cool down, be ready with lots of room on a separate host to copy stuff out of it, boot it up in a cool room, take the most critical stuff out and then do whatever, it doesn’t matter, cuz the disks are dead even if they seem alive. I’m told never to trust any of the disks that were inside during the malfunction (that’s >$1,000USD worth of disks…), once a disk reached 70 degrees, even far less, don’t get near it, even if it’s new. Admittedly, these guys are used to handling enterprise disk faults, where $1,000USD in hardware costs (and even many many times that amount) is nothing compared to data loss, but this is the gist of what I’m hearing so far. If you have other observations, let me know. It’s frustratingly difficult to get reliable data about disk failures on the Internet; I know just what to do in case of logical corruption of any sort; but I don’t know precisely what to do in a case like this, and in case of a controller failure, and a head crash, and so on, and so forth. I know it’s a lot about luck, but what’s the best way to give donny the highest chance of survival?
On a parting note, I’ll add that I’m a very sceptic kind of guy, but when it comes to these things I’m rather mystical. It all stems from my roots as a System Administrator; what else can comfort a lonely 19-year-old sysadmin trying to salvage something from nothing in a cold server room at 03:27AM on a Saturday? So now I blame all of this for the name I gave donny. I named it so because I name all my hosts at home after characters from Big Lebowski (I’m typing this on Dude, my laptop), and I called the server donny. The email address I gave it (so it could send me those FzA#$%! S.M.A.R.T reports it never did!) was named Theodore Donald Kerabatsos. The backup server, which is tiny compared to donny and doesn’t hold nearly as much stuff as I’d like to have back now, is called Francis Donnelly. The storage pools (and then RAID volumes) were called folgers, receptacle, modest and cookies (if you don’t understand, you should have paid more attention to Donny’s funeral in The Big Lebowski). And, indeed, as I predicted without knowing it, it ended up being friggin’ cremated. May Stallman bless its soul.
I guess I’m a moron for not thinking about this exact scenario; I was kinda assuming smartmontools and would be SMART (ha!) enough to shutdown when the disks reach 50 degrees, and maybe there is such a setting and I just didn’t enable it… I guess by now it doesn’t matter. I’m one sad hacker. I can’t believe I did this to myself.