Seagate and UC Researchers Collaborate to Accelerate Genomics Data Analysis

Seagate has entered a multi-year joint research and development agreement with the Genomics Institute and Baskin School of Engineering at UC Santa Cruz to accelerate genomics data analysis using computational storage technology. The project’s initial focus is to accelerate the analysis of the Human Cell Atlas (HCA).

The Human Cell Atlas (HCA) is a global, scientist-led collaboration to map all cells in a healthy human body as a resource for studies of health and disease. The human body is made of trillions of cells. Different types of cells express different sets of genes, and make up the various tissues and organs of the body. The Human Cell Atlas will catalogue all cell types and sub-types in the human body, identify the genes they express, and characterize all cell types, numbers, locations, relationships, and molecular components.

Advances in single-cell genomics have only recently made it realistic to complete such an atlas, which could reveal a new view of human biology and new opportunities for diagnosing, monitoring, and treating disease. Available freely to scientists all over the world, the Human Cell Atlas will make work in almost every biomedical research lab go faster, much like the human genome sequence did after it was completed and published.

Access to massive HCA data sets

The Human Cell Atlas project is organizing and standardizing enormous amounts of data for billions of cells, across multiple modalities, generated by hundreds of labs around the world. Recently, the Chan Zuckerberg Initiative (CZI) funded HCA-selected “seed networks.” These projects involve 20 countries and more than 200 labs and will begin sequencing specific organs, such as the heart, eye, or liver, in the healthy human body. The resulting cellular and molecular maps will be a resource for understanding what goes wrong when disease strikes.

Peter Alvaro

It’s important to make this data open and easily accessible to all researchers, to enable the scientific community to innovate rapidly without barriers to data access, and to make it easy for computational researchers to develop and share new analysis approaches. To do that, the Genomics Institute is working with the European Bioinformatics Institute, the Broad Institute of Harvard and MIT and other specialists in biology, computation, and medicine to design and build the Data Coordination Platform — an advanced, modular cloud-based architecture for organizing and sharing data.

As these maps grow in size, traditional architectures are being strained. By leveraging Seagate’s Active DriveTM computational storage technology, the UC Santa Cruz research team hopes to increase access and accelerate analysis of these molecular maps to reduce the time from data generation to insight and discovery.

Accelerate discovery by speeding analysis from hours to seconds

“Our goal is to speed up the analysis from batch time scales of hours to interactive time scales of seconds or faster. If the time from question to answer for an investigator drops from hours or minutes to seconds, the entire experience and approach changes, ultimately accelerating discovery,” said Peter Alvaro, Assistant Professor, Computer Science and Engineering, UC Santa Cruz Baskin School of Engineering.

“This partnership with Seagate is an excellent example of how university-industry collaborations can accelerate meaningful research for the benefit of society,” said Alexander Wolf, dean of the Baskin School of Engineering. “Baskin Engineering’s expertise in genomics and computational biology, as well as in storage, data, and distributed systems — combined with Seagate’s computational storage technology — could lead to consequential results and far-reaching impact that might not otherwise be possible.”

Alexander Wolf

What is causing the profusion of data

Steadily declining sequencing costs as well as the rapid development of new techniques is leading to an explosion in genomic data ready for analysis. Historically, the last stages of genomic analysis required computation on relatively small amounts of data compared to the source genomic sequences. As the cost of sequencing has dropped to unexpectedly affordable levels, the architectures to handle the data analysis have not kept up.

Existing sequencing techniques combine millions of cells to generate a single ‘bulk’ measurement. As recent advances are enabling massively parallel single-cell sequencing of millions of individual cells, the size of the resulting data has increased several orders of magnitude. This technique is rapidly translating from research into the clinic, where reducing the analysis and exploration time could ultimately lead to the acceleration of precision medicine at scale.

Josh Stuart

“Today, the primary users of this architecture will be researchers around the world, as all the data in the Human Cell Atlas will be public. But once the atlas is complete, these techniques will be translated into the clinic where interactive time scales are a requirement” said Josh Stuart, Professor, Biomolecular Engineering and Associate Director, UC Santa Cruz Genomics Institute. “Single-cell sequencing is poised to revolutionize cancer treatment, as it helps convey a detailed picture of the tumor microenvironment, which facilitates selection of combination, targeted, and precision therapies,” Stuart explained.

Seagate has a track record developing storage systems with integrated compute capability. As part of this project, Seagate is identifying vertical and associated applications where computation can be moved closer to storage in order to leverage proximity as a way to minimize total computational time. The intention is to create efficiencies by bringing computation power closer to the location where data resides. As the quantity of data and computing power grows exponentially, capturing these efficiencies becomes critical to operate at the speed business and science require.

One of the highest leverage areas where new computational architectures can have an impact is at the point of analysis where insights occur and science leaps forward. We look forward to moving this application of computational storage forward with the hope that clinicians and their patients will ultimately benefit.


About the Author:

Edward Gage
Edward Gage is Vice President, Seagate Research Group.