Why model cloud system reliability?

How often have you seen services such as your online e-mail account, online streaming video, or online social profile page hanging on you? Likely, you have experienced at least one situation where you were not able to navigate smoothly through these services. Or even worse, you lost data stored in the cloud. That brings me to the crux of the discussion in this blog: why and how is reliability modeling of cloud systems crucial to a robust and uninterrupted user experience?

Under the hood of cloud services lays a complex interaction of software and bare metal hardware tailored to an application (cold storage, compute storage etc…). The ultimate truth is that both hardware devices (hard drives, servers, switches, networks etc…) and software (OS, Swift, etc…) fail (hopefully gracefully) at some point. There is also a chance of rare extraneous events like power outages and inclement weather that could contribute to system failures. All of these events can affect system performance and lead to downtime, which often directly translates to frustrated customers and lost revenue for the service provider. It is hence very important for a data center (DC) architect to make sure that the cloud services are as immune to these calamities as possible.

Cloud reliability models can be invaluable to a DC architect trying to design a resilient system. These models provide estimates for hardware systems reliability, systems availability, data durability, data availability, number of annual replacements, and replacement costs. DC architects can quickly modify hardware and software parameters (number of drives, types of drives, replication/erasure coding, storage zones, network layouts, risk choices etc…) to meet their quality of service goals. In the Cloud Modeling and Analytics team at Seagate, we develop detailed system-level reliability models. One such model is the storage-node-based Markov Chain data availability model for OpenStack Swift object storage (Figure 1.). This model estimates the probability of the system being in the state of unavailability by accounting for node failures, switch failures, node rebuilds, and node reboots.

Figure 1: Schematic of the Markov Chain data availability model for a configuration with three storage servers per rack that incorporates the effects of switch failures, node failures, node rebuilds and node reboots

Hopefully, I have convinced you of the value that cloud reliability models add. Now, let me let you in on a little secret. An additional exciting benefit that these models provide is the prospect of real-time reliability monitoring and health prediction for a DC. Various kinds of live environmental (e.g. temperature) and system-level (e.g. workload) data collected from a DC can be integrated with these models to estimate its current health. Furthermore, the interdependencies between various components (e.g. drives) can be incorporated and models improved on-the-fly based on monitoring data to provide accurate health predictions. To learn more about these exciting applications, stay tuned for my next blog on real-time cloud reliability analytics and prediction.

Author: Ajaykumar Rajasekharan