Wow! This was the status update I was waiting on seeing. Thank you for your work & for the time you took. I hope you had a nice vacation (these things always happen whenever you're going on vacation).
If you don't mind, I would like to make some suggestions (obviously I'm not trying to tell you how to do your job but merely trying to help in some way). Your bullet points one by one:
1: This is understandable. Though does the budget allow for $6.99/m? If so - you might wish to consider a kimsufi server (kimsufi is an OVH brand. It's a huge provider in Europe and they've got a few (think it's 4 or 5) data centers they build and operate) - Kimsufi is their 'cheapest' brand which they use to convince people to move to their 'soyoustart.com' brand and eventually their main OVH brand. The servers of kimsufi are extremely cheap the cheapest going at $6.99 with a 500GB hard disk which should be plenty of space to make backups to. I'm actually also vouching for them (I've used them - currently on their soyoustart brand - Hint! Soyoustart also sells dedicated gameservers - might be worth looking into!).
If you can consider this - I would definitely recommend you look into rsync as well. rsync essentially makes backups - but instead of 'full' backups - it'll be incremental (except for the first of course, the first backup will be full - obviously) and it might prevent having to recover an entire hard disk - and transferring data from one server to another would be reasonably fast in case it's ever needed.
rsync doesn't work well for windows. I had tried it before and it didn't work out good. Linux rysnc can be beast but for windows, no. I might look into the 7$/month VPS thingy. I am tight on budget right now but for future, sure.
2. Yes, this is also understandable. One of my other suggestions was to set up backup 'shims' - essentially for ET this would mean a small ET server running on another server to which people would connect and they'll get the "Server is full, go to <different server>" - in this case, the DNS entry for silent.clan-fa.com could be updated temporarily or the IP re-routed (if the servers are at the same providers that is). The suggestion I'm making here would be that the 'server is full' message would actually say 'The server is under maintenance' or something like that and it'll allow people to get 'caught and redirected' to another server which would be more informative and I would guess it'd help minimize player-losses. Heck, it doesn't even need to be an actual ET server with a config - it merely needs to be a program mimicking ET and it's redirect. Something to consider at least as part of a 'rapid response' in case shit hits the fan.
* in fact, the IP wouldn't even need to be re-routed - one could essentially boot a Linux livecd; set up iptables to redirect packages to another IP if needed temporarily.
We already use DNS but if whole machine is down there is no point in redirect. We got 1 machine in US and 2 in Europe. So if one in Europe goes down, I can redirect but for US to Euro players would whine for ping and we would see 100 reports of OMG my ping increased.
3. It always happen when one goes onto vacation! Generally when I go on holiday - I tend to let my servers know I'll beat 'm up if they break whilst I'm away (They have yet to actually listen, defiant machines I'll tell ya).
hehe somewhat agree but current hard drive served us for like 5 years+.
4. I get this as well - but perhaps combining with the 1st-point it wouldn't be that much overhead - 'backup' server would run the monitoring installation and the other servers merely an agent reporting the statuses; all in all, a simple I/O check could've given you an edge and it might even have allowed you to prevent this entire issue all together. (now there's no guarantee, but - it's why monitoring is so important - it seems like an overhead but in the situations I've been in [I actually had an internship at a data center a few years ago and I've worked with quite a few of their clients and whilst we monitored the boxes for being up/down we didn't do extensive monitoring on disks and the likes since that wasn't our place - though a box went down and the owner called for us to check it out - we would [if covered under the agreement; otherwise we couldn't legally touch the box]) - I really can't stress this enough though - monitoring = information = rapid response & preemptive preventing problems - they give a HUGE amount of insight that can also be used to uncover performance issues and the likes.
Well IO checks for Windows server are not often possible. HDTune is the only way I know off. Which needs to be checked manually. For linux yes.
5. Nagios is actually something that generally runs on Linux (though you can also use it to monitor Windows machines) - what goodies are you referring to specifically though? top, htop, iotop?
We run munin on other Linux servers and I have one on master and other as slaves. Nagios is over kill for our needs and extra over head. Munin is light and does the same job for us.
Once again I sincerely hope you had a great vacation and thank you for the hard work. I hope you'll take my suggestions under advisement and you should definitely come to Silent or HC one time whenever I'm there, let's see who's the better shooter eh!
Take a break man, you deserve it.