BLOG

Storage and compute: tandem needs for AI workflows.

Hard drives and SSDs join GPUs, CPUs, HBM, and DRAM as vital components in AI applications.

Table of Contents

storage-and-compute-tandem-needs-for-ai-thumbnail-image storage-and-compute-tandem-needs-for-ai-thumbnail-image storage-and-compute-tandem-needs-for-ai-thumbnail-image

The adoption of artificial intelligence (AI) applications continues to grow worldwide. Simultaneously, the capabilities of the IT solutions that enable AI are accelerating rapidly. Unprecedented innovation follows. 

Currently, the processor (logic) side gets most of the attention from enterprise leaders and investors for its contribution to AI. To be sure, processors are essential to AI and high-performance computing. But AI success doesn’t solely depend on compute and high-speed performance. Just as important, AI applications also rely on data storage, which provides an initial repository of raw data, enables checkpointing that builds trust into AI workflows, and stores inferences and the results of AI analysis. 

Any successful AI implementation calls for a synergy of compute and data storage resources. 

As large data centres scale their AI capabilities, it becomes clearer how AI applications do not solely rely on the compute side of an AI data centre architecture. The compute cluster comprises processors with high-performance, high-bandwidth memory (HBM), dynamic random-access memory (DRAM), and fast-performing local solid-state drives (SSDs) — building the powerful engine for AI training. The compute cluster components are local, typically right beside each other, because any added distance could introduce latency and performance issues.

AI applications also depend on the storage cluster, which includes high-capacity network hard drives and network SSDs (meant to be higher capacity in comparison to the more performant local SSDs in the compute cluster). The storage cluster is networked (distributed), because there is less concern for the storage performance speed at scale. Distance of components is a smaller factor in its total latency equation, as compared to the compute cluster’s, where the expected latency can be into the nanoseconds. Data ultimately flows to the storage cluster, consisting predominantly of mass-capacity hard drives for long-term retention.

This article examines how compute and storage work together in multiple phases of a typical AI workflow.

Performance and scalability for AI.

Some technologies in AI workflows are more performant and some more scalable, but each is integral to the process. On-device memory is highly performant, commonly composed of HBM or DRAM attached to processors — graphic processing units (GPUs) and central processing units (CPUs) or data processing units (DPUs). DPUs are offload functional engines, attached to CPUs, which help with specific tasks. Some architectures use them, while others do not. Memory’s high throughput allows efficient data ingestion and model training aspects of AI.

SSDs’ low latency and sufficient capacity allow for fast inferencing and frequent access to stored content. In AI data centre architecture, fast-performing local SSDs are included in the compute cluster, close to the processors and memory. Local SSDs usually run triple-level cell memory and also have high durability, but they are usually more expensive than network SSDs and don’t have the same high capacity. 

Network SSDs, with higher data storage capacity compared to the local SSDs, are included in the storage cluster, with other specific responsibilities throughout an AI application’s workflow. Their performance speed does not match the speed of local SSDs. Network SSDs are comparatively less durable in drive writes per day, but they make up for it in their larger capacity.

Network hard drives, also part of the storage cluster of AI data centre architecture, are the most scalable and efficient IT devices in AI workflows. These devices have comparatively moderate access speeds, but very high capacity, which is perfect for instances not requiring rapid frequent access.

AI’s infinite loop.

AI workflows operate in an infinite loop of consumption and creation, requiring not only compute-enabling processors and memory, but also storage components. The interrelated steps of an AI workflow include source data, train models, create content, store content, preserve data and reuse data. Let’s look at the roles compute and storage play in these stages.

Step 1: source data.

The data sourcing stage involves the definition, discovery, and preparation of data for AI analysis.

Compute: GPUs play a foundational role in the data sourcing stage by promoting high-speed data preprocessing and transformation. They complement the CPUs, running repetitive calculations in parallel while the main application runs on the CPU. The CPU acts as a primary unit, managing multiple general purpose computing tasks as the GPU performs a smaller set of more specialised tasks.

Storage: In the data sourcing stage, both network SSDs and network hard drives are used to store the vast amounts of data needed to create something new. The network SSDs act as an immediately accessible data tier, offering faster performance. Network hard drives provide spacious, dense, scalable capacity and also provide the raw data with long-term retention and data protection.

Step 2: train models.

In the model training step, the model learns from stored data. Training is a trial-and-error process where a model converges and is safeguarded with checkpoints. The training requires high-speed data access.

Compute: GPUs are critical during the model training stage, where their parallel processing capabilities allow them to handle the massive computational loads involved in deep learning. AI training involves thousands of matrix multiplications, which GPUs handle simultaneously, speeding up the process and making it possible to train complex models with billions of parameters. CPUs work alongside GPUs, orchestrating data flow between memory and compute resources. CPUs manage tasks like batch preparation and queue management, so the right data is fed into the GPUs at the correct times. They also handle the optimisation of the model’s hyperparameters, performing calculations that may not require the parallel processing power of GPUs.

In model training, HBM and DRAM are essential for fast data access, holding active datasets in close proximity to the processors. HBM, which is typically integrated into GPUs, significantly boosts the speed at which data can be processed by keeping the most frequently used data accessible to the GPUs during training.

Local SSDs serve as fast-access storage for the datasets used in this stage. They store intermediate training results and allow quick retrieval of large datasets. They are particularly useful for training models that require quick access to large amounts of data, such as image recognition models involving millions of images.

Storage: Hard drives economically store the vast amounts of data needed to train AI models. In addition to providing required scalable capacity, hard drives help maintain data integrity — storing and protecting the replicated versions of created content. Hard drives are cost-effective in comparison to other storage options, providing reliable long-term storage, and preserving and managing large datasets efficiently.

Among other things, network hard drives and network SSDs store checkpoints to protect and refine model training. Checkpoints are the saved snapshots of a model’s state at specific moments during training, tuning and adapting. These snapshots may be called upon later to prove intellectual property or show how the algorithm arrived at its conclusions. When SSDs are used in checkpointing, the checkpoints are written at a quick interval (i.e., every minute) due to their low-latency access. However, that data typically gets overwritten after a short duration because of their small capacity compared to hard drives. In contrast, saved hard drive checkpoints are typically written at a slower interval (for example, every five minutes), but can be kept nearly perpetually due to the hard drive’s scalable capacity.

Step 3: create content.

The content creation phase involves the inference process that uses the trained model to create outputs.

Compute: During content creation, GPUs execute the AI inference tasks, applying the trained model to new data inputs. This parallelism allows GPUs to perform multiple inferences simultaneously, making them indispensable for real-time applications like video generation or conversational AI systems. While GPUs dominate the computational tasks during content creation, CPUs are crucial for managing the control logic and executing any operations that require serial processing. This includes generating scripts, handling user inputs, and running lower priority background tasks that don’t need the high throughput of a GPU.

The content creation step uses HBM and DRAM. Memory plays a crucial role here in real-time data access, fleetingly storing the results of AI inferences and feeding them back into the model for further refinement. High-capacity DRAM allows for multiple iterations of content creation without slowing down the workflow, especially in applications like video generation or real-time image processing.

During content creation, local SSDs provide the fast read/write speeds necessary for real-time processing. Whether AI is generating new images, videos, or text, SSDs allow the system to handle frequent, high-speed I/O operations without bottlenecks, ensuring content is produced quickly.

Storage: The primary storage enablers of the creation step are HBM, DRAM and local SSDs.

Step 4: store content.

In the content storage stage, the newly created data is saved for continued refinement, quality assurance and compliance.

Compute:
Although not directly involved in long-term storage, GPUs and CPUs may assist in compressing or encrypting data as it’s being prepared for storage. Their ability to quickly process large data volumes means content is ready for archiving without delay. Memory is used as a temporary cache before data is moved into long-term storage. DRAM speeds up write operations, saving AI-generated content quickly and efficiently. This is especially important in real-time AI applications, where delays in storing data could lead to bottlenecks.

Storage: The content storage stage hinges on both network SSDs and network hard drives saving data for continued refinement, quality assurance and compliance. Network SSDs provide a speed-matching data tier and are used for short-term, high-speed storage of AI-generated content. Given their lower capacity compared to hard drives, SSDs typically store frequently accessed content or content that must be immediately available for editing and refining.

The process of iteration gives rise to new, validated data needing storage. This data is saved for continued refinement, quality assurance and compliance. Hard drives store and protect the replicated versions of created content and provide the critical capacity to store the content generated during AI processes. They are particularly suited for this because they offer high storage capacity at a relatively low cost compared to other storage options like SSDs.

Step 5: preserve data.

In the data preservation stage, replicated datasets are retained across regions and environments. Storage resources are usually used in this stage.

Storage: Stored data is the backbone of trustworthy AI, allowing data scientists to ensure models act as expected. Network SSDs are used as a performance gasket to connect hard drives to the local SSD layer and help data move around the ecosystem.

Hard drives are the primary enablers of data needing longer term storage and data protection. They help maintain the outcomes of AI content creation, securely storing the generated content, so it can be accessed when needed. They also provide the scalability needed to handle increasing data volumes efficiently.

Step 6: reuse data.

Finally, in the data reuse stage, the source, training, and inference data is applied to the next iteration of the workflow.

Compute: GPUs play a significant role in the data reuse phase by re-running models on archived datasets for new inferences or additional training, allowing the AI data cycle to start again. Their ability to perform parallel computations on large datasets allows AI systems to continuously improve model accuracy with minimal time investment. CPUs query and retrieve stored data for reuse. They efficiently filter and process historical data, feeding relevant portions back into the training models. In large-scale AI systems, CPUs often perform these tasks while managing the interactions between storage systems and compute clusters.

When historical data is retrieved for reuse in another iteration of the AI model’s analysis, memory guarantees fast access to large datasets. HBM allows the rapid loading of datasets into GPU memory, where they can be immediately used for retraining or real-time inferencing.

Storage: Content outputs feed back into the model, improving accuracy and enabling new models. Network hard drives and SSDs support geo-dispersed AI data creation. Raw datasets and outcomes become sources for new workflows. SSDs accelerate the retrieval of previously stored data. Their low-latency access promotes rapid re-integration of this data into AI workflows, reducing wait times and increasing overall system efficiency. Hard drives fulfil the mass-capacity storage requirements of the AI data reuse stage, allowing the model's subsequent iteration to be implemented at a reasonable cost.

Storage is the backbone of AI.

As we’ve seen, AI workflows call for high-performing processors as well as data storage solutions. On-device memory and SSDs have their place in AI applications due to their high-speed performance, allowing for fast inferencing. But we like to think of hard drives as the backbone of AI. They are especially critical given their economic scalability, a must-have in many AI workflows.

Seagate hard drives featuring Mozaic 3+™ technology — our unique implementation of heat-assisted magnetic recording (HAMR) technology — are a powerful choice for AI applications due to their areal density, efficiency and space optimisation benefits. They provide unprecedented areal density of 3 TB+ per platter, currently available in capacities starting at 30 TB and shipping in volume to hyperscale customers. Seagate is already testing the Mozaic platform, achieving 4 TB+ and 5 TB+ per platter.

Compared to current generation perpendicular magnetic recording (PMR) hard drives, Mozaic 3+ hard drives require four times less operating power and emit ten times lower embodied carbon per terabyte.

In AI workloads, compute and storage work in tandem. Compute-centric processing and memory — as well as highly performant SSDs — are essential in AI applications. So too are scalable mass-capacity data storage solutions, with Seagate hard drives leading the way.