Preventing Disaster: Factors for Choosing an Enterprise Lustre Solution

I read an interesting blog post from CSC (IT Center for Science) in Finland describing the near catastrophic crash of their scratch file system in February 2016. The first sentence of the blog gives you an idea of what happened:

“A month ago CSC’s high-performance computing services suffered the most severe unplanned outage in years. The outage was due to data corruption that occurred in the parallel filesystem that houses the work directly.”

Imagine the impact of your company losing, say, 1.7 PB data. That’s exactly what almost happened to CSC. Losing that data would not only cost CSC time but most likely also have a significant financial impact.

Scratch File Systems Are Not Intended for All Data

While a scratch file system is supposed to contain only data you can afford to lose, parallel file systems such as Lustre® are increasingly being used for all data storage within a datacenter. The right thing to do is to make sure you have a backup of “important data” but with the capacities and the rate at which data is being created, this isn’t always feasible. Most data could probably be restored from backups, re-created at the cost of additional compute cycles, or imported from other sites. But this would take significant time (most likely weeks) and effort. Even then some data may be irretrievably lost.

The blog post sparked a number of questions in my mind about the choices made and what could have been done to prevent the CSC near disaster. Those questions focus on 1) Choice of storage vendor, 2) Choice of Lustre version, and 3) Buying support vs. self-support.

Choice of Storage Vendor

When evaluating an HPC storage system, there are many factors for the purchase decision. In HPC, the decision usually starts with performance or capacity, and let’s face it, with enough hardware any vendor can promise to hit the requirement. This is why the architecture of the storage system must be considered.

Seagate’s unique architecture co-locates a balanced amount of storage processing, capacity and I/O throughput capability that allow the system to scale in the most effective and balanced way.

The architecture that was at fault for the CSC data loss used a set of central controllers to manage a large number of RAID sets (in Lustre, called OSTs) and exports these OSTs to 2 or more OSS servers. This architecture is becoming more and more vulnerable, where one of the storage controllers affected a large portion of the file system (at least 1.7 PB). A storage controller is basically just a specialized type of server with lots of I/O, most server motherboards have an MTBF of just a few hundred thousands hours (there are lots of components and the math is fairly straightforward). Compare that to a disk drive that has some 1M or 2M hours MTBF and as we know they can fail, hence technologies like RAID are used to provide data protection.

Seagate ClusterStor with LustreAt Seagate we recognized many years ago that as systems are getting bigger and bigger, ‘motherboard’ failures are going to become common and that the impact is at best massive system performance degradation and increasingly data loss. That is why we pioneered a distributed processing architecture where each ClusterStor® solution is made up of a number of redundant OSS storage controllers (each supporting a single OST). Should even one of these OSS controllers/servers malfunction, only a single OST would be affected. This granularity not only delivers higher performance than other solutions on the market (Seagate solutions currently support throughput of over 120 GB/s real, i.e., benchmarked client based IOR performance, per rack), but also limits any possible issues with data corruption to a single small component supporting a bounded number of drives (42) vs up to 1200 behind a third-party controller/server pair.

While the intent of Lustre is to provide very fast streaming I/O between compute systems and the storage backend, the nature of the Lustre parallel file system has changed from being exclusively a scratch file system to becoming a file system where ALL data is stored (the main reasons being capacity and performance). Lustre is one of the few file systems that can grow to 100+ PB in a single namespace and performance scales linearly with capacity (in most cases).

Choice of Lustre Version

CSC is using a “self-supported Lustre filesystem” and the blog author writes about commercial Lustre versions. Support is clearly one aspect, but you can be proactive vs reactive by getting a system that is pre-engineered and tested with the underlying platform. When buying a Lustre solution from Seagate, it’s delivered as a fully engineered system vs a bag of bits – all hardware and software is both pre-tested and under a single support contract supplied by a single vendor. And the TCO over three years or more is lower – and provides the peace of mind a supported solution delivers.

Buying Support vs Self-support

Today, the tools available to manage a Lustre solution (both open source and from vendors such as Seagate) make the day-to-day management of a large storage solution easy and more straightforward than in the past. It’s when something goes horribly wrong that you need deep expertise. Very few sites have that knowledge in-house – and most of the experienced people capable of doing root cause analysis and providing a fix for the problem reside in two companies: Seagate and Intel. So the recent announcement of Seagate adopting IEEL and working together to support Seagate Lustre customers will hopefully provide real peace of mind that the top experts in the world have their back when the s*** hits the fan.

Over the last year, four new systems have made the Top 10 (of the Top 500), all of which are powered by Seagate. That’s real market leadership and there is a reason why the leaders in HPC trust Seagate with their data.

Torben Kling Petersen, PhD, is a Principal Engineer at Seagate

2016-12-01T09:35:57+00:00

About the Author: