It would be no overstatement of fact to say that in the last five years virtualization has radically changed the landscape of IT infrastructure for the better. Workloads encapsulated into standardized virtual machines have significantly increased our ability to optimize and use physical resources in a way that saves much time and money. In addition to these economic benefits, new avenues have opened up to tackle data protection and disaster recovery, allowing us to increase service uptime while also reducing business risk. This white paper focuses on some of the common challenges experienced while implementing and using SAN-based replication for disaster recovery and it examines an alternative approach to disaster recovery to help resolve these issues.
Pre-virtualization disaster recovery plans were underpinned by application-level features hooking directly into specific hardware to achieve the required business recovery goals. To ensure that disaster recovery could be achieved, network infrastructure, hardware, software and application data were replicated to an offsite location, commonly referred to as the Disaster Recovery (DR) site. Depending on an application’s required Recovery Point Objective (RPO) and Recovery Time Objective (RTO), costs could spiral upwards to achieve small improvements in both RPOs and RTOs. When you increase the amount of application uptime provided from 99.99% to 99.999%, the cost increase is not linear, it’s exponential. With the advent of virtualization, the infrastructure stack gained a new layer, enabling the movement of workloads between geographically dispersed locations. Importantly, this is achieved without requiring application-specific engineering, because workloads are compartmentalized and encapsulated into virtual machines. In a virtual machine, everything needed to support that workload can be encapsulated into a set of files in a folder and moved as a single contiguous entity.
Scope and Definitions
The virtualization and storage layers are examined below; i.e., virtual machines (VMs), hypervisors and storage area networks (SANs). Application-level replication is beyond the scope of this document.
There are many potentially overlapping terms, which people often interpret differently. For the purposes of this paper, I will use “Continuous Data Protection (CDP),” “synchronous,” “fault tolerant,” “asynchronous” and “high availability.”
CDP consists of synchronous replication, which in turn involves double-writing to two different devices where an application (or hypervisor) only receives a confirmation of a successful write to storage when both devices have acknowledged completion of the operation. CDP can help you achieve a zero RPO and RTO but requires strict hardware compatibility at both source and destination sites. This allows you to deploy VMs in a cross-site, fault-tolerant configuration, so if you have an infrastructure problem, you can failover to the DR site without any downtime.
Synchronous solutions are expensive and require a lot of network bandwidth but are appropriate for some mission-critical applications where no downtime or data loss can be tolerated. One issue with synchronous replication is that data is transferred to the DR site in real time. This means that if the disaster is driven by some kind of data corruption, malware or virus, then the problem that brings down the production site simultaneously does the same to the DR site. This is why synchronous implementations should always be combined with an asynchronous capability.
This paper primarily is concerned with asynchronous replication of virtual infrastructures for disaster recovery purposes. An asynchronous strategy takes a point-in-time copy of a portion of the production environment and transfers it to the DR site in a time frame that matches the required RPO/RTO goals, and this may be “near real-time/near CDP” or scheduled (hourly, daily, etc.). This is more akin to high availability than fault tolerance. High availability in virtual environments refers primarily to having cold standby copies of VMs that can be powered on and booted in the event that the live production VM is lost. This approach underpins most currently implemented DR strategies.
The next section examines how SAN technologies approach asynchronous replication and the differences between SAN-level and VM-level strategies to achieve DR objectives.
SAN Replication Overview
SAN devices are typically engineered to aggregate disk resources to deal with large amounts of data. In recent years, additional processing power has been built into the devices to offload processing tasks from hosts serving up resources to the virtual environment. The basic unit of management for a SAN device is a Logical Unit Number (LUN). A LUN is a unit of storage, which may consist of several physical hard disks or a portion of a single disk. There are several considerations to balance when specifying a LUN configuration. One LUN intended to support VMs running Tier-1 applications may be backed by high-performance SSD disks, whereas another LUN may be backed by large, inexpensive disks and used primarily for test VMs. Once created, LUNs are made available to hypervisors, which in turn format them to create volumes; e.g., Virtual Machine File System (VMFS) on VMware vSphere and Cluster Shared Volume (CSV) on Microsoft Hyper-V. From this point on, I will use the terms “LUN” and “volume” interchangeably. A LUN can contain one or more VMs.
For SAN devices, the basic mechanism for creating a point-in-time copy of VM disk data is the LUN snapshot. SANs are able to create LUN-level snapshots of the data they are hosting. A LUN snapshot freezes the entire volume at the point it is taken, while read-write operations continue without halting to another area of the array. Continue reading