AI drives unprecedented data growth.

As models advance and AI becomes pervasive, data creation will grow exponentially.

Creation and innovation will explode with AI.

Generative AI is ushering in a new era where rich media proliferates in nearly every facet of daily life, from personalised gaming to medical imaging to content production and beyond.

The AI applications that empower users to create, analyse, and develop are becoming more accessible, unleashing AI-driven data growth. And it’s just the beginning. People and machines will generate data at a pace unlike any before as innovative use cases scale.

AI is a data growth force multiplier.

AI has always been a data consumer. Now, it's a powerful data creator.

In just 1.5 years, AI created 15 billion images.¹ By 2028, image and video creation with AI models will grow 167 times.² Ultimately, the age of AI is sparking a major data growth inflection point driven by three key factors: richer content, more replication and longer retention.

Richer content.

The transformative potential of AI lies in multimodal models that consume and produce rich media.

More replication.

AI data gets copied countless times as models are trained and produce outputs.

Longer retention.

Preserving data fuels AI development and provides transparency.

Richer content.

The transformative potential of AI lies in multimodal models that consume and produce rich media.

The smart chatbots and search summaries we use today are mere baby steps in AI's growth. The real transformative potential lies in multimodal AI models that consume and produce rich media.

Richer inputs — like imagery, audio, video and 3D animation — create richer outputs that can support stronger, more intuitive experiences. As multimodal AI applications expand in scope and capabilities, people and businesses will be able to create at an unprecedented pace.

Future rich media AI will touch industries everywhere.
  • High-resolution 3D motion graphics for gaming
  • Ultra HD video for virtual sets in filmmaking, complete with animated extras
  • 3D CAD generators and physics simulators for architecture, engineering, construction and manufacturing
  • AI medical assistants in radiology, oncology, surgery
  • Molecular synthesis for drug discovery and testing
  • Hyper-personalised advertising, games and online experiences
     
All this rich media will be used to enhance next-gen AI models.

In this new world where we can create hours of content, thousands of images, and terabytes of data, three things will happen. More people will use AI to create increasingly data-intensive content, AI will vacuum up all that data for training the next generation of models, and the amount of data the world creates and stores will explode.

More replication.

AI data gets copied countless times as models are trained and produce outputs.

Enabling successful AI models and applications requires more data replication. Whether to ensure model quality through checkpointing, distribute applications geographically, iterate outputs, or modify them into multiple formats, copying data is integral to AI as models are dispersed across cloud and enterprise environments.

Generating and duplicating new content is only a portion of the replication that happens throughout the AI data lifecycle. Data footprints mushroom during the AI development and production processes, and expand exponentially once AI deploys and begins generating content. Throughout the cycle, the entire data ecosystem gets duplicated repeatedly for regulatory compliance.

Replication multiplies data at every step.
  • As data is discovered, collated, and labelled for training, it's also duplicated.
  • Regular checkpoints during training back up progress, creating hundreds of heavy files in a typical training run.
  • When models and applications are deployed, their data is copied at numerous nodes and instances.
  • More and more people will use AI to create and iterate multiple concepts, experiments and versions.
Longer retention.

Preserving data fuels AI development and provides transparency.

The data an AI model consumes and creates is a treasure trove of model behavior, usage patterns, and raw material. The more data we preserve, the better we can train and optimize models to produce better quality output.

Training a model begins with a large pool of labelled data. Saving data throughout the training run, including checkpoint data, can provide insight into future model behaviour. Once the model is deployed and generating results, each prompt and response is a valuable source for evaluating model performance, tuning the model, and preparing the next training run.

Data should be preserved at every reasonable point in the data cycle. 
  • Improving and developing AI requires fresh data and insights — preserved data can provide it.
  • Smarter AI in the future may be able to draw insights from stored data, creating new value. 
  • Copyright laws demand works be licensed for use; preserving data provides an auditable trail.
  • Regulations require secure storage to demonstrate compliance with privacy, legal and ethical guidelines.
Trustworthy AI depends on data transparency.

Keeping data long term is critical for establishing an AI model's trustworthiness. Documenting each decision the model makes and analysing the results helps developers spot model drift and hallucinations.

Tracing errors back to training data can help unpack a given model's decision-making processes and provide data for retraining and optimisation. All these data points should be preserved and shared to provide objective, transparent evidence of the model's performance.