|
Disaster Recovery Options with VMWare ESX server. Virtual Infrastructure 3 is more than just a leap and a bound ahead of ESX 2.x and the competition. The options that is has now are things we have been wanting and needing, so VMWare has listened to the feedback people gave. But one frustrating thing is how the Disaster Recovery aspect of VMWare has not really changed. In fact I think it has been neglected, and let me explain.
Disaster Recovery in ESX 2.x was no easy achievement if you wanted to setup failover to another site with minimal data loss. There are lots of different ways to perform a disaster recovery setup with VMWare and below I will outline a few. If you are looking for ideas on how to do DR with ESX then this may help you. The basic premise behind disaster recovery is to have the smallest amount of data loss and the smallest amount of downtime possible between the disaster and the restoration of service. Keep in mind that this is normally this is based on agreed SLA's with the business. Depending on what SLA's are agreed upon, resourcing and budget constraints you may choose different options. I have tried to list the most common setups that are used in a VMWare environment. Minimal Data Loss So you have your disaster recovery site and you can get or have the disaster recovery hardware. The first thing you have to plan is how to backup and restore your data, and what plan is going to be acceptable based on your SLA's. One option is everyday agent based filesystem backups inside each VM. Then restore that backed up data at your DR site into new vm's as if they were normal servers. Thats a great idea, could take a while though and requires a bit of fiddling depending on your backup and restore solution. While this system is not ideal and could be time consuming, VMWare technology is allowing you to have the exact same hardware for your restore. Imagine trying to get similar hardware at a dr site that complies with your aging fleet of servers, good luck. If you want to take it a step further, why not replicate a fresh copy of the VMDK files encompassing the data to your DR site. Easy if the sites are connected with high speed links and do not have expensive data costs. If you do not have such links, why not take a backup of the VMDK files ? You could do this hot in ESX 2.x using ESX Ranger from Vizioncore or by scripting it. In ESX 3.x you could script it or use VMWare Consolidated Backup to do the same. Got large budgets and nice toys to play with ? Then how about using hardware to do the replication to the DR site. Either by using snapshotting technology or SAN replication. The latter the better option for minimal data loss. While expensive, this seems like the *ultimate* solution for minimal data loss. It can be a pig to configure, and will be a pig to configure. If you have ever known anyone who has set up SAN replication, ask them about it and then remember to duck. They will be throwing things at you! Once configured though, it is a fantastic solution. I have the data, so now what ? "I have the process and the data at the DR site. How should I restore the VM's and what about my VM configuration ?" I hear you say. First thing is first, get your networking setup sorted out. Decide whether you are going to have the same subnet or a different subnet. Have your disaster recovery site prepared and ready to go for connectivity to your remote sites and the internet for mail, internet and any other online services that you need. Then get your ESX hosts setup properly at the DR site. Remember, if you are on a different subnet then you will need a different ESX farm (2.x) or ESX cluster (3.x) at the DR site. To keep your virtual machine configurations you will need to copy the .vmx files to the disaster recovery site as well. In ESX 2.x these were stored on the local ESX host and in ESX 3.x they are stored on a shared location if you have shared storage. Identify where these .vmx files are and back them up or replicate them to the DR site. This way you not only have your config files, but you have your data files as well. In the event of a disaster then you just need to have your .vmx files in the correct location, slight tweaks will allow you to change the .vmx files to point at the correct disk location for your .vmdk files. If your subnet is different you will also need to tweak the virtual machine operating system and dns entries to reflect the IP address changes. You could use the vmxfile backup utility located here to do the vmx file copy if you like. So what is the ultimate solution ? In my mind the ultimate solution would have been for the VMWare high availability option of Virtual Infrastructure 3 to do this for us. High availability works by detecting a host failure and restarting vm's in resource cluster based on rules. Because the .vmx files and the .vmdk files are on shared storage this is possible. The failure for this in a true disaster situation is that the storage has to be in one place or another. So if you lose that shared storage then how can HA start the VM's on another host in the cluster ? Simple it can't, even if the host is in the cluster at another physical location on a high speed link. If the storage is gone so is the virtual machines. My mind wandered when I learnt about the high availability option and I thought about a cluster spread across two geographical locations looking at two identical SANS continually replicated via hardware replication. Maybe there was a way to trick HA into failing over to a replica of the shared storage ? Alas at this stage there is not, I believe the HA option in the future will come in two seperate components. One that does the HA function as it is now and one that does the true DR function that VMWare is currently dancing around the edges of. Customers have implemented .vmdk and .vmx file replication now, some have even automated their own HA options with scripts so that when the machines fall over the DR site comes online. This is the true DR that we want from VMWare and they haven't delivered the automated solution that we know they are capable of. So now, if you are serious about implementing this then you need to sit down and plan. There are other options out there and products that will do part or all of this for you with some integration and tweaking. Whatever way you choose to do it make sure you have fun, because DR always is! |