At Seagate, we see every day how data continues to evolve from its past role as static information stored and forgotten, into its new state as a living organism, touching every interaction of modern life. Nowhere has this impact been greater than in helping advance the life sciences, where the organic and expanding nature of data literally impacts our ability to study, understand and improve the health of actual living organisms in the real world. The growing ability to capture, access and analyze ever-increasing amounts of data has led to enormous data sets, thus also increasing the need for deeper and more complex data analysis. For the data scientist working in life sciences, this intensifies the need to accelerate I/O intensive workloads in order to shorten time to discovery, improve efficiency and reduce the overall total cost of ownership (TCO) of systems managing these growing data sets.
As biomedical data repositories are now measured in petabytes, the data management infrastructure that supports scientific research has become critical to researchers’ ability to glean insights more quickly and efficiently from their analysis tools. As a central part of that toolkit, the storage and data management infrastructure must be designed for data integration, implementing a solution such as IBM Spectrum ScaleTM parallel file system, so that even when storage is in different locations, using a shared namespace means there’s no need to duplicate data. It must address the challenge of high Input/Output Operations Per Second (IOPS) with small block data to ensure applications don’t hang and analysis doesn’t stall. It must be very scalable — so as more data enters the system requiring more analysis, additional blocks can be added without diminishing performance.
Ongoing advances in life science computing are remarkable, enabling more frequent and faster breakthroughs in medical research. In the coming months we expect to see numerous innovations enter the deployment phase, including the further development of synthetic blood, a new ultrasound therapy to mitigate amyloid plaques that clump around neurons and contribute to Alzheimer’s disease, and chimeric antigen receptor (CAR) T-cell therapies in which a patient’s own CAR T-cells are genetically engineered and reintroduced to the body in order to destroy leukemia cells.
The importance of designing data infrastructure that can keep up with these advanced life sciences workloads can’t be understated. The high number of IOPS with small block data can overtax many systems. Deploying a solution that accelerates I/O intensive workloads is crucial. Continuing advances in next-generation genomic sequencing is leading to a future in which we’ll benefit from personalized medicine, helping to solve previously intractable diseases, accelerating drug and device development, and improving disease control and health for all.
Today, supercomputers and advanced data storage systems are working to:
- Advance understanding of Hepatitis B and the discoveries of new therapies
- Investigate brain synapses to aid in Alzheimer’s research, schizophrenia and manic depressive disorders
- Help prevent cancers caused by radiation therapy, and reduce the occurrence of heart disease
Why life science data grows so fast
The rapid improvement in genome sequencing over recent years illustrates why the data-management challenge has also grown so quickly. Not too many years ago, sequencing a single complete human genome cost $1 billion and took several years of effort. Back then, since each research set was limited to a small sample at a time, the size of the total sequencing data set remained small for any given biomedical or biological research project.
But today, according to data from the National Institutes of Health (NIH), the cost of generating a high quality whole human genome sequence has fallen below $1,500 and continues to drop. Genomic sequencers have dropped drastically in cost, and that has allowed many labs and healthcare providers to conduct sophisticated analyses on genomes, transcriptomes and interactomes. The results of that research has enabled researchers to understand the causes of certain cancers, develop personalized treatment, and prevent disease for specific regions and populations. For example, pharmacogenomics, part of precision medicine research, leverages both pharmacology and genomics to develop effective, safe medications and dosing tailored to variations in an individual’s genes.
Looking to the near future with research tools continuing to advance, data sets will exceed the petascale and reach the exascale. For example, consider the exponential growth in data produced by scientists who study and work to resolve the chemical details of human cells. A human cell is approximately 10 micrometers long; today scientists can simulate it at a scale of about 1/100th to 1/1000th. Soon, we’ll reduce that simulation size by a factor of 10 — thus requiring a computer 1,000 times as powerful.
Indeed across every type of biomedical research, these advancing capabilities are generating huge amounts of data in diverse formats. As this digital knowledge continues to explode, new high performance computing (HPC) systems that support sophisticated analysis and workflows are in high demand, and powerful HPC storage and data management solutions are needed to support this increased computing speed.
IT architects support rapidly increasing data needs with ClusterStor and IBM Spectrum Scale
To enable system architects to meet these burgeoning demands, IBM and Seagate now recommend IBM Spectrum Scale high performance parallel file system with the Seagate ClusterStor G300N storage system and Nytro XD flash accelerator as one of the best solutions for life sciences applications. Because of the solution’s ability to accelerate I/O intensive workloads and shorten time to discovery, it helps researchers improve efficiency and reduce overall total cost of ownership (TCO).
This next-generation storage technology solves the demands of next-generation genomics and biomedical analytics, powering a “learning system” that helps identify disease patterns, treatments and outcomes, giving researchers and physicians an advantage so they can identify effective treatments, publish results and spread the word today, not tomorrow.
The demanding I/O requirements of many life sciences workflows mean that traditional storage solutions can’t scale. A high-performance parallel file system like IBM Spectrum Scale, paired with the Seagate ClusterStor G300N, is needed to accelerate I/O intensive workloads.
The main advantages of a parallel file system include high performance, scalability, and the ability to distribute massive amounts of data across multiple nodes.
Life sciences applications use petabytes of data, and performance is enormously important to providing actionable insights. Seagate ClusterStor G300N paired with IBM Spectrum Scale has great I/O performance. Most life science customers will need multi-protocol access to their storage systems, as well as high-performance parallel access. IBM Spectrum Scale with Clustered Export Services supports these alternative access methods, and helps with these high-performance I/O requirements.
Parallel file systems distribute single object data across multiple storage nodes. Application servers — or clients — have simultaneous parallel access over multiple data paths to storage servers or nodes, making a parallel file system well suited for I/O subsystems. This prevents bottlenecks common to other systems: instead of accessing a single server to access the data, clients access multiple servers. For example, if you have a Spectrum Scale filesystem with 10 NSD storage servers, when a client reads data, it will issue 10 reads to those servers in rapid succession. The servers then respond with data almost simultaneously.
Importantly, it’s also able to scale out as needed. As more sequencing analysis is performed, a system architect in a traditional data center may typically add storage, but performance may suffer because other systems can’t efficiently scale to utilize the added storage. With the Seagate ClusterStor G300N and IBM Spectrum Scale, the IT team can add an additional block and build from there without diminishing performance. Instead, performance actually grows proportional to growth.
Why Seagate ClusterStor G300N paired with IBM Spectrum Scale?
The right storage architecture for a specific life sciences workload depends on performance, reliability, availability, efficiency and budget requirements. In healthcare for example, the Health Insurance Portability and Accountability Act (HIPAA) influences the type of storage required. Similarly in biotech and pharmaceuticals, security and privacy are critical requirements. The growing need for virtualization also influences storage requirements in other life science industry segments. Among these different needs, the ability to access data from various sources, accelerate complex analytics, and ensure the highest efficiency, performance, and capacity with minimum downtime are the most important today.
Seagate’s new Nytro Intelligent I/O Manager is software that handles mixed I/O including random, unaligned or small block storage, making it the most flexible system to handle any workload, anytime. Because it is a true hybrid system incorporating the right balance of SSD and HDD storage, the 300N provides the best performance with the best value.
In nearly every case, a hybrid architecture including hard disk drives (HDD), flash or solid state drives (SSD), and powered by a parallel file system and workload accelerator with active archive object storage, is the best choice.