You know, I had just gotten home from about five hours of grading, dealing with a couple of entirely unrelated matters, and so on, and was looking forward to a calm night... suddenly, out of nowhere, came a spate of weird error messages when I tried to check my e-mail.
I pinged the server. Did it answer? Yes.
Did it answer SSH's? No.
Did the main file server answer pings? No.
Oh, bugger. That file server is a tank; I'm kinda surprised that anything short of it catching on fire could panic it. But it didn't catch fire, since the mail server is only a meter and a half away, and if the building were on fire it wouldn't be answering pings either.
So I get there, and have a fun time fixing it. Apparently the memory had somehow failed - but after taking it out and cleaning the contacts, everything was fine again. I'm not sure if there's a deeper problem in there somewhere.
(Of course, this makes it sound easy. What really happened is that I got there, found it panicked and incapable of bouncing; the motherboard was giving an error message. During the bloody eternal time it took one of the machines to load up the mboard manual to decipher what that meant, I had grabbed another not-quite-as-important machine, and was about 90% of the way through a full-brain transplant: taking every part of the file server that knew it was a file server and putting it in that small computer. Then I found out that the error code was memory trouble, so I decided to check and find the source; taking out the RAM chips and fiddling a bit made it work again, so I reverse-brain-transplanted. This mostly worked, except that the ethernet suddenly was unhappy; but taking it out, cleaning its contacts and putting it back seemed to take care of that too.)
So, some important discoveries for the evening:
- Response time is doing pretty damned well. From server failure to my finding out about it was an hour; from my finding out about it to it being completely fixed (including time to drive like a bat out of hell from home to campus) was just under an hour. The only real way to speed that up would be to have it on a pager, and frankly they don't pay me enough for that.
- Lack of hardware, OTOH, is a problem. When I was doing that full-brain transplant I realized that the little box of screws, mounting rails, and so on - simple pieces of machined steel that one needs when fixing things - was nowhere to be found. This was turning into a significant issue when I decided to reverse the procedure. Conclusion: A box of hardware always needs to be easily on hand.
- Good hardware: The SuperMicro SC760 server case is fscking excellent. It's possible to open it up completely, from all sides, and field-strip it within seconds, or keep it running, with basically no effort. In future, all cases, rack and tower alike, that I buy will be from this company. The motherboard (An Epox EP-8KHA+) is also nice; its diagnostics, both LED's and on-screen, are excellent.
- Flaky hardware: Something strange is going on in there. I don't know if it's the RAM, (Crucial, registered ECC) the motherboard, or something else that caused this. The CPU fan was making some noise until I played with it a bit, and I noticed that the main CPU voltage was running a bit low - 3.1V instead of 3.3. But it's back up to full now. I need to figure out what the hell is going on.
- Metal: Ow ow fsck fsck goddammit... but adrenaline keeps one from noticing that until after the repairs are done with.
Ah, the joys of sysadmin life. At least I can bill for this.
