Oct

Data Protection in Data Centers

Natural phenomena such as tsunamis, strong earthquakes, volcanic eruptions or extreme floods are rare, but large fires, damage to power lines or power outages can completely paralyze the work of the data center. To continue IT systems to function even in extreme conditions, businesses rely on the so-called metro cluster or stretched clusters, nodes are distributed between two or more sites.

High availability is always achieved by redundancy – This is true in the case of training for extreme situations, when the entire data center must be protected from failures in the supply of electricity or natural disasters. If one data center goes down, geographically dispersed cluster automatically, without interrupting workflow should be switched to the second, and if necessary, a third data center. In fact, it is nothing like a local cluster, exploded between two or three sites, with locally mirrored storage system.

In accordance with the concept of geographically distributed clusters on each site must have separate storage layer, which in turn corresponds to the principle of high-availability, that is, the cluster of two nodes (Node). This cluster provides storage space for service nodes. Latest available data is mirrored between two sites, and together form one four-node geographically distributed cluster.

Geographically dispersed clusters can be organized in such a way that they do not remain single point of failure. Thus, the failure of the hardware – no matter at what level – no need to switch manually between sites. The great advantage of such solution is that when a problem occurs, switching is done transparently and without any administrator intervention. If in this process only asynchronous replication technology for data has been used, the decision to adopt emergency measures would still have to take the person that would lead to significant delays. In addition, it would require the presence of a plan of action in case of emergency, which clearly indicate how and when to implement the change. Automating this process ensures continuous operation of all applications.

In addition to enterprise data protection, the territorial distribution of the clusters has another important advantage. Metro cluster does not need to be stopped to refresh their hardware or software. Moreover, they are quite simple to implement and operate. However, the connection between the sites must have a very low latency, as long delays adversely affect the performance of the entire system. Since with increasing distance of delay increases, the distance between the DPC must not exceed 50 km.

Thus, metro cluster is beneficial to enterprises that either occupy a very large area, or have branches that are separated from one another by no more than 50 km. For this reason, this concept is not widespread, because in most cases, the distance between the branches of companies is not much more, so metro cluster is out of the question. In all other cases, companies will be able to increase the availability of their systems to a new at small investments to achieve previously unattainable level.

Scenarios of System Failure:

Each geographically dispersed cluster has many weak points that can paralyze the system. Therefore, the main problem lies in the fact that for each of the possible cases to provide automatic backup solutions to prevent interference between the applications is important.

Here are some of the important systems failure scenarios and the possible consequences:

Hard Drive Failure:

In this case, usually negative consequences for the future of the systems does not happen. The administrator can replace the failed drive in the “hot” mode, then data is automatically synchronized.

Failure of the Important Components of Disk Shelves:

In case of refusal of the SAS cable (Serial Attached SCSI), SAS-HBA (Host Bus Adapter) or expander for SAS (SAS Expander) multiple access technology (Multi-Pathing) in the storage nodes (Storage Node) ensure continuous operation of all services. And in this case, the administrator can quickly replace defective items.

Failure of the Entire Disk Shelves:

Arrays of RAID-Z2 hard drives is distributed to systems with a simple serial connection disc (Just a Bunch of Discs, JBOD) in such a way that even a complete failure of one JBOD-system can survive without a loss. When such a system resumes, will be synchronized only with the data that have changed up to this point. Thus, all services will continue to function without downtime or significant drop in performance.

Disruptions to the Storage Node:

Upon cancellation of the entire server node storing its duties for a few seconds, go to the second server located on the same site. What is happening in this case is, a brief interruption of the flow of input-output data can be seen on superior service nodes, but does not affect the operation of applications, because each time data mirroring is done on the second platform.

Downtime on Service Node:

In case of failure of all service node using the ZFS file system, there is a brief – lasting for a few seconds – interrupt input and output streams of data to applications and from them. The switching time is determined by the number of used services such as NFS Share, CIFS Share or iSCSI Target, and does not depend on the volume of data. One of the features of the ZFS technology, which distinguishes it from other file systems and storage systems, is that it is never required to perform a full file system check. For server applications, this switch is transparent; and if Fibre Channel applicable, they need to get the Multi-Path operating system driver with support for asymmetric access to the logical elements (Asymmetric Logical Unit Access, ALUA), which in many cases is a standard feature. In this case, the cluster is configured in such a way that in case of failure, services are first transferred to the neighboring nodes, and the need to switch to a geographically remote site occurs only if the work of the branch is broken completely.

Unavailability of the Entire Site:

In the worst case, possible failure of the branch as a whole. Only in this situation, geographically dispersed cluster uses the redundancy at the data center for failover and systems located on the second floor, take the support of all the services. Thus, application servers retain access to all services, even a half of service nodes, that is, with limited performance. Because in such a scenario, mirroring, reading and writing of data between geographically separated branches are not made, the duration of the delay is reduced. In operation, for example, the database performance is often even better. You need to transfer only the data that has been modified during the down time, so after removing the local problems of the affected data center, it will be able to quickly return to normal operation.

Related Pages: