Archive for April 9th, 2008

Case Study: Recovering from a Server Disaster Wednesday, April 9th, 2008

Business continuity is something very important to my clients and how quickly you recover from a disaster becomes crucial.

Here is the back story.

Server: PowerEdge 1600SC running Windows 2000 Terminal Server, Number of Users: 30*

*These are Terminal Server Users, not just file and print. So they have full virtual computing desktops running Microsoft Office applications.

I remotely access this site to carry out backups. When I connected in over the weekend and I could not reach the server, this was the first sign of trouble.

On the assumption it was going to be a simple issue like a power cord or a server that needed manually rebooted I waited until 8am on Monday morning to contact the site.

Day 1

The server was still powered up but unreachable and had no video out.

Once I get this news I leave for the customer’s site.

Within 1 hour I am on-site with the server out of its environment for testing. The Dell Poweredge 1600sc has power but does not POST (Power on Self Test).

In the next hour I have thoroughly checked the machine for obvious things like damaged capacitors or faulty memory. The most likely cause of the problem is the power supply (I am later proved wrong).

Now I have some hard choices

1) Do I gamble a replacement power supply will fix it. I can get one next day.

2) Start rebuilding a replacement server for immediate deployment.

I decide to do both.

I order the power supply and spend the rest of the day installing Windows 2000 Server and configuring the thirty user profiles and restoring the data. I have to log into each users account and carry out seven customisation tasks so this is a lengthy job that I complete at 8pm.

The most recent day’s data is still trapped on the un-bootable server but I have nearly all of it, enough to get the company back online.

Day 2

8am. The replacement Terminal server is deployed and the customer is back in business.

I then take another spare server and put the broken server’s hard disk in as a second drive so I can grab the latest data that wasn’t backed up due to the original failure. The data recovery is 100% successful.

I then restore the last of the data on the replacement server that is now live.

At 10am the replacement power supply for the PowerEdge 1600SC turns up. So I quickly install it only to find exactly the same thing happens. Powers up, but no POST.

So the PowerEdge 1600SC is dead. The only thing left that it can be is the motherboard. The machine is well outside of the 3 years on-site warranty from Dell. However we had this recovery plan in place for protection.

Now the users are back online and the customer and I can decide how best to proceed.

Summary

A business critical server installed, configured and replaced within 1 business day with no data loss.

Lots of things went wrong in reality but getting the users back within 1 business day was never really in question as I had access to a replacement server and the problem solving skills to complete the job.

Tips

1) Have a disaster recovery plan that you have confidence in.

2) Agree a service level agreement with your support partner.

3) Backup, Backup, Backup.