How To Optimize Datacenter Performance with Numerical Models

The primary purpose of the modern datacenter is to contain clusters of computer systems running cloud services. Most cloud services are focused on the storage and retrieval of data objects such as web pages, photos, or movies. These data objects span the range of sizes from one kilobyte text files all the way up to multi-terabyte datasets uploaded for long-term archival.

In an ideal world, these clusters of computers would have all the storage, network, and computation capacity required for any application. But ours is not an ideal world, and there are practical constraints to system designs such as limited budgets, energy usage constraints, and cooling and space allowances. Therefore, computer clusters must be designed to fit these constraints and still be optimized for the target application.

Depending on the type and use of the data, certain levels of performance are expected which dictates minimal hardware requirements. There are some obvious rules of thumb that can be followed when designing a system. For example, a system that serves emails should be very responsive, and more expensive fast storage may be the best choice. However, a long-term data archival system does not need to be as responsive, and because capacity is more important, less expensive slower storage that is more capacious is probably the better choice.

Simple rules of thumb can be helpful in architecting optimized systems, but they are not precise. Companies would prefer not to build a computer system only to find out later that it does not perform to the required specifications. Or worse, find out that they spent too much money on an under-utilized system! Building small-scale prototypes can have some predictive power, but there are performance issues that may only manifest at full scale. This is where numerical models of datacenter performance are useful. Models can run scenarios on many different hardware configurations without wasting time or money, and can provide specific performance predictions, which simple rules of thumb cannot do.

Numerical modeling can take on several forms. At its simplest, a model can simply compare the peak capacities of the various parts of a computer cluster. For example, this kind of analysis can quickly determine if the hard drives or network capacity is likely to be the bottleneck for a storage system. More precise models use statistical distributions of system load and response times to determine performance. This kind of analysis can provide some idea of the most likely performance of a system under realistic loads. Models can also be highly-detailed event-based simulations that track specific requests to the system throughout the fulfillment process. These models can explore specific constraints to the system that simpler models may not expose.

There are many questions that can be answered using numerical models. What are the performance differences between ten-thousand 7,200 RPM drives and fourteen-thousand 5,400 RPM drives, and are the slower drives acceptable for my use case? What about between two types of 7,200 RPM, but one is a hybrid drive? Would a multi-tiered archival storage solution provide enough speed and capacity at a lower cost than a single tier of storage? Should network links be made faster at increased cost, and if so, which network links in particular? Similarly, would a slightly different network topology give better performance?

For example, the plot below shows the results of comparing the peak throughputs of the hard drives and network connections in a simulated Openstack Swift system. Each workload has a different average object size and distribution of uploads and downloads, which gives different peak performances for each. Some workloads are proxy server bandwidth limited (in teal), others are object server bandwidth limited (in black), and one workload (“Small Object Operations”) is hard drive limited (in red).

Prototype Openstack Swift System Datacenter Performance

 

What if the hard drives in the system are replaced with a different, lower-performance model? The plot below shows that while most applications do not show any performance degradation, the “Small Object Operations” throughput drops by about a factor of two. Depending on the target use of this prototype Swift system, the substantially lower small object performance may or may not be acceptable, but it is predicted and quantified, and the system architect can take it into consideration.

Prototype Openstack Swift Datacenter Performance

Unlike rules of thumb, numerical models can compare the consequences of making relatively small changes that can help save money, build a better system for the target application, and prevent design mistakes from happening in the first place. Therefore, numerical models are a valuable tool to use when designing complicated computer systems and should be referenced early on in the designing process.

Author: Stephen Skory

2014-03-25T17:15:07+00:00

About the Author: