Maintaining Business Continuity with Virtualization-
Added 13th Jan 2010Article Highlights
- 51% Of Indian CIOs say that business continuity drives their information security spending. Source: CIO Research
- No matter how you decide to do it, your storage configuration — whether it involves SAN or host-based replication — is the most important part of the warm site design and should not be treated lightly.
EMC VMware's ESX 3.0 was released a bit more than three years ago. While ESX 2.5 was a solid virtualization platform, ESX 3.0 seemed to push server virtualization into the realm where a lot of small and large businesses alike could really sink their teeth into it. The new high-availability features in ESX 3.0 were a huge draw to many businesses seeking better uptime, and the refined centralized management offered by VirtualCenter 2.0 was compelling. Support for a wider set of hardware such as iSCSI SANs also allowed high-end functionality at a lower price.
If business continuity is important and you have multiple offices or a sufficiently large campus, building a warm site is a great use of your old hardware.
Now that we're three years down the road, many of these initial adopters of ESX 3.0 are starting to replace their hosts with new ones and preparing to upgrade to vSphere 4.0. That seems to be leaving a lot of server admins staring at a stack of three-year-old virtualization hosts that aren't yet finished doing their jobs. Sure, they might not be quite fast enough to go the distance with increased production loads, and you might like to have some more performance headroom, but it's always a painful decision to turn off a bunch of expensive servers and not do anything with them. Instead of tossing their old hosts in a dumpster, many enterprises are opting to reuse them. Some turn them into development clusters to separate development loads from production loads. Some make them available for testing and training. And some use is as the seed hardware for a warm site. Even if the old hardware can't run all your production resources at 100 percent resource availability, having some immediately available production capability in a production site failure scenario is better than none - and it bridges the gap between the time of the disaster and the time that you can get replacement hardware on site. Assuming that business continuity is important to your organization and you have multiple offices or a sufficiently large campus, building a warm site is a great use of your old hardware. It certainly isn't free and there are a number of common pitfalls that you'll want to steer clear from, but it's definitely a worthy endeavor if downtime costs you money. Here's how to do it.
Define the Service Level
First, you need to define the level of service you want to grant with your warm site. Do you want to protect all of your machines or just a subset? How quickly do you want to be able to recover (RTO)? How old can your data be when you do recover (RPO)? Your answers to these questions may change as you work through the design process and start attaching price tags to varying levels of service, but you should never let what you can afford directly drive what you provide. It may be that, to be useful, a warm site would cost more than you can currently afford to spend on it. In that case it's better to save your pennies and do it correctly than to implement something that won't accomplish your organization's goals.
Assess Your SAN Situation
The SAN is the first piece of hardware that needs to be looked at, as it tends to be the most expensive. If possible, using asynchronous SAN-to-SAN replication is the best way to implement a warm site. Depending on the SAN platform in use, such replication might simply be impossible or uneconomical. For example, if you run a FibreChannel SAN with no iSCSI connectivity and don't have the tremendous luck to have dark fiber running to your warm site, implementing SAN replication might be out of the question without hardware such as an FCIP gateway or software such as EMC's RepliStor. If you're in this boat, be sure to consider these factors the next time you are weighing an upgrade to or replacement of your current SAN. On the other hand, users of devices such as NetApp filers should add more SnapMirror licensing, and users of Dell EqualLogic PeerStorage arrays have everything they need already. No matter what your SAN, to perform SAN-to-SAN replication, you're going to need a second one. If performing SAN-to-SAN replication is out of the question, you still have options. There are several good host-based replication software packages available that will run on the ESX hosts and do direct host-to-host replication. These include Vizioncore vReplicator and NSI DoubleTake for VI. They are usually licensed per VM rather than per host, which can make them unattractive depending upon the number of guests you want to replicate. The big caveat here is that you will need a large amount of directly attached storage on the old hosts that are being moved across to the warm site. (If they had been attached to your production SAN, they may no longer have any disks in them.) No matter how you decide to do it, your storage configuration - whether it involves SAN or host-based replication - is the most important part of the warm site design and should not be treated lightly.
Figure Your Bandwidth Needs
Once you've determined what the storage is going to be at your warm site, you need to consider how you're going to get your data there. If your warm site is on your campus or otherwise fiber-attached, there's not much to worry about unless your data sets are truly massive. Although the network connectivity to your warm site is probably the most straightforward of all of the decisions that you need to make, it can easily blow your budget, since WAN bandwidth generally has a recurring monthly cost. Failingto properly estimate the required WANbandwidth can have disastrous long-termbudgetary consequences.
For example, let's say your initialcalculations show that you're going to needtwo T1s' worth of bandwidth (3.0Mbpsto replicate an estimated 25GB of storagedeltas per 24-hour period to maintainwhatever RPO you've set. But it turns outyou actually need to move 35GB per day tomeet that RPO - a difference of roughlyone more T1 circuit. Depending on yourbandwidth costs, that small differencecould cost as much as an entirely newSAN or a few new virtualization hosts overthree years' time. So if estimating your replication bandwidth needs is so important, there must be a tried-and-true way of doing it, right? Not really. There are some tricks to determine how much data is turning over on your VMs, but you can't always trust what they tell you.
The first and easiest method is to use VMware's built-in snapshot functionality. Take a snapshot of every VM you want to replicate, wait a period of time equal to what you'd like your replication period to be based on your RPO, then examine the snapshot files on your VMFS volumes to see how big they are. (Note: Be sure you have enough free space on your VMFS volumes before you do this.) That figure is roughly how much data has changed on those VMs in that period. If you do this at several times during different parts of your production day and month, you should get a reasonably good idea of how quickly your data is changing.
However, that's not all there's to it. Depending on your SAN platform, your SAN may replicate data in larger blocks han VMware's snapshot files allocate. Thus, a single change of a 1KB file within a VM may be seen as a change to a 16MB block on your SAN - essentially magnifying he amount of data that needs to move by 16,000 times. This magnitude difference would be a fairly rare occurrence, but it shows that you can't easily predict actual data volumes based on snapshots.
To combat this problem and generally increase the amount of data your WAN can carry, using some form of WAN accelerator that includes de-duplication echnology is a wise move. Examples of such products include Cisco's WAAS and Riverbed's Steelhead. Both platforms have their own strengths and weaknesses, but they operate in much the same way. They optimize the WAN data flow through intelligent re-windowing and other TCP enhancements, but they also retain a remote cache of what has previously been sent over the WAN link.
In the event that they get a cache hit (a packet that has the same data payload as one seen previously), that packet is not re-sent. Instead, just a pointer to that packets payload is sent to the device on the other end of the circuit. In the example of a 1KB change requiring 16MB of data transmission, a WAN accelerator could essentially nullify the problem.
Related Articles
- Are You Ready for a Disaster?
- The D.C. Government Goes All the Way With Google
- Cloud Computing: Tales from the Front
- Superusers to IT’s rescue
- 20 IT Mistakes to Avoid





