Demystifying Gen AI Infrastructure

The AI economy will be larger than the $400 Bn/Yr cloud economy and requires a transformation of data center infrastructure.

No single chip – even flagship GPUs – can power a Gen AI solution. Thousands of machines work together, networked by 10x the amount of fiber used today in traditional data centers.

Gen AI needs new architectures and technologies to achieve the scale it is demanding.

The current footprint of the cloud is not designed for Gen AI. Alongside new chips and new power and cooling, new fiber connectivity is an essential component of fielding a complete solution.

"The multiple GPUs required to train LLMs don't all operate on the same board or even the same rack. They're spread out due to the power consumption and heat, and we want to network and cluster as many GPUs as we can."

— Microsoft Azure CTO Mark Russinovich, Microsoft Mechanics 05/23/23

AI Economics

The AI Economy Will Be Larger Than the $400 Bn/Yr Cloud Economy

$10trn

US businesses spend $10 trillion on payroll; for every 10% benefit (by way of productivity increases or labor replacement) driven by Generative AI, the economic gain would be $1 trillion annually.

$3.5trn

McKinsey & Co estimates that Gen AI could add the equivalent of $2.6 to $4.4 trillion annually.¹ (For comparison, the United Kingdom's entire GDP in 2021 was $3.1 trillion.)

66%

Goldman Sachs estimates that two-thirds of occupations could be partially automated by AI.²

To feed the explosion of AI applications, the CapEx for Gen AI-capable data centers will eclipse current cloud CapEx.

By the time infrastructure suppliers catch up with demand over the next several years, aggregate data center CapEx will more than double from current levels.

Since 2006, hyperscalers and digital infrastructure providers have spent over $5 trillion to support public cloud computing.³ Today, over $400 billion in annual global CapEx is being spent on digital infrastructure.⁴

The internet-scale datasets powering Gen AI require specialized hardware, power, and connectivity—well beyond what the current infrastructure supports.

High-performance computing has emerged with new architectural patterns, new transmissions protocols, and new connectivity technology to train and run AI models.

Fiber, deployed at many multiples of today's density, is making training and runtime possible.

Gen AI Beneficiaries

multiple companies are involved at multiple layers

Powering AI

Compute & Data

Traditional software hosted in the cloud today uses only a fraction of the compute on shared servers—the hardware is much more powerful than what runs on it.

Training Generative AI models is an entirely different problem. The workload consumes thousands of interconnected and networked GPUs that run for months, processing massive datasets.

Despite this burden, training is merely the entry fee: Running the models consumes the lion's share of costs and maintenance.

This paradigm shift in data center engineering across the multiple use cases of training and runtime radically challenges the design and management principles for hyperscalers.

"It's 150 miles of fiber optic cable."

— NVIDIA CEO Jensen Huang, introducing the next-gen DGX GH200 supercomputer

Training vs Runtime

Training

Runtime

Training a model often spans many days and incurs substantial costs, reaching into millions of dollars. Research firm Epoch AI estimated that the computing power necessary to train a cutting-edge model was doubling every six to ten months; by that measure, training costs would exceed $1 Bn in 2026.⁵

Extensive networking infrastructure distributes training data and recombines the efforts of chips distributed throughout a datacenter, located for optimal power and heat management.⁶

For instance, OpenAI's ChatGPT3 was trained on an Azure supercomputer utilizing a massive assembly of 285,000 Infiniband-connected CPU cores and 10,000 NVIDIA Tensor Core GPUs. Providers like Inflection AI and CoreWeave are developing massive capacity positions for access to chips, attracting over $1.3 Bn in a recent around of funding to build out a cloud comprising 22,000 NVIDIA H100s.⁷

Even once they are trained, models like PaLM and ChatGPT need robust computing power and dense networking to perform inference and respond to users.⁸ AWS' VP of EC2, Dave Brown, estimates 90% of infrastructure spend will be from inference.⁹

Reducing the cost, chip requirements, and power requirements of model inference is an area of intense research and advancement, and recent months have seen remarkable progress in condensing these models to run on less powerful hardware.

Certain open-source large language models (LLMs) exhibiting GPT-3-level performance have now been optimized to a scale that enables them to operate at the edge, even on a smartphone.¹⁰

Modeling Chip Requirements

different types of models have varying requirements for power and cost

The "large" in LLMs is aptly named: The amount of data required to train one is so huge, most people don't have a working frame of reference for it. LLaMa, an open-source model from Meta, was trained on "1.4 trillion tokens" – a single human would take 22,000 years to read it all.¹¹ Google's PaLM is trained on a dataset 3,000,000x larger than one used to train models merely a decade ago.

Dataset size helps drive the number of chips needed in training, where the workload is parceled out across many-to-many fiber meshes to thousands of GPUs. Even smaller training workloads like fine-tuning might require hundreds of chips for multiple days to adjust the model to a new or narrower use case.

No other area of computing is growing at this rate, and the pace of innovation is only outstripped by the rush of hyperscalers trying to compete meaningfully in the AI economy.

NVIDIA CEO on Gen AI Infrastructure

Chip Disambiguation

MTIA training chips by Meta, image credit meta

CPU

GPU

Others

Central Processing Units (CPUs) are the general-purpose chips that power most computers. They are flexible, require the least amount of power of any chip, and are designed to handle a wide variety of tasks.

However, they are not optimized for any particular workload; for long-running or intensive operations, even moderate specialization can dramatically reduce costs—this makes them a poor fit for the more complex activities in data science and artificial intelligence.

Graphics Processing Units (GPUs) were originally designed to draw and render graphics. With this specialization in geometry and support for parallel processing, GPUs are a superior choice for machine learning operations over CPUs.

While GPUs are the most common type of chip used for training large language models, their original design for graphics means there are a number of additional optimizations possible for improving their performance for artificial intelligence.

Tensor Processing Units (TPUs) are specialized chips designed by Google to accelerate machine learning workloads. They are specifically optimized for matrix multiplication, a common operation in neural networks and are highly performant for inference.

Google is not the only player creating custom silicon. AWS has developed two different chips: Trainium and Inferentia, each of which are designed for specific phases of the AI/ML lifecycle of training & inference.

Tesla's D1, Meta's MTIA, and potentially Microsoft's Athena are all examples of published activity in the design and development of chips, each tightly optimized for their company's specific use cases and expanding their supply chain options.¹²

Securing Generative AI

Hallucinations & Exposure

Dataset size puts whole-cloth custom models out of reach, but their growth using human input also introduces two serious problems for enterprises: Hallucinations and Data Exposure.

LLMs today predict the next-most-likely word, but they don't ensure accuracy. They confidently "hallucinate" facts, quotations, mis-reference or misinterpret correlations from their training data alongside useful and accurate information. On the topic of data exposure, Gartner's Avivah Litan is quoted in Forbes, "Employees can easily expose sensitive and proprietary enterprise data when interacting with generative AI chatbot solutions." She continued, "There is a pressing need for a new class of AI trust-, risk-, and security-management tools..."^X

Caught between not having enough data to build entirely new large models and not having the necessary security controls to use commercially available LLMs safely, enterprises can explore techniques like transfer learning or other fine-tuning approaches like LoRA. This allows the use of smaller datasets to specialize existing LLMs with company-specific data safely.

For cloud providers, this poses new challenges: (1) the large scale of densely networked, high-performance compute needed to train massive models and (2) the proliferating workload from retraining many customer-specific enterprise models.

Hallucination

Transfer Learning

LoRA

Definition

Instances where the model generates output that is not grounded in reality or its training data. These "hallucinations" are a consequence of the statistical nature of LLMs, which generate text probabilistically based on patterns learned during training, rather than a genuine understanding of reality. Vector databases open possibilities for improving data quality in responses.

A flexible strategy in machine learning where a model developed for one task is reused as the starting point for another. This technique is particularly beneficial when the new task has limited training data. By leveraging the learned patterns from the original task, the model can achieve better performance on the new task more rapidly than starting from scratch. This approach is common in fields like computer vision and natural language processing.

Low-Rank Adaptive fine-tuning (LoRA) is an approach to fine-tuning. Unlike traditional fine-tuning that modifies all model parameters, LoRA introduces a small, learnable, low-rank factorized matrix to the model's pre-existing parameters. This significantly reduces the number of parameters needed to adapt the model to a new task.

internal data center view with nvidia supercomputers

Inside the Data Center

High-performance Compute & AI

Fiber to connect GPUs together is just as critical as the chip itself. To enable the transformation of the data center, fiber density will increase 10x.

For example, the next-gen supercomputer from NVIDIA, the DGX GH200, features over 150 miles of optical fiber in a single unit.

Prior fiber architectures were not concerned with aggregating workloads; a many-to-many networking footprint is now an absolute requirement for training these models.

AI applications like ChatGPT can require the resources of the entire data center to support a single purpose.

The following section discusses the changes needed to deploy and operate Gen AI infrastructure:

Transmission Protocols
Topologies
Physical Connectivity Design

"To train a model of GPT-5 size, I wouldn't be surprised if they use 30,000, maybe 50,000 H100s from NVIDIA."

— Tesla CEO Elon Musk ¹³

Transmission Protocols

The incredible growth in volume of fiber infrastructure is not the only change to support HPC networks: We have to revisit the protocols that we are deploying.

Widely used, Ethernet is the most flexible network protocol that offers scalability, easy management, and cost-efficiency. However, Ethernet does not service the complete problem set required to train AI applications. For this reason, Remote Direct Access Memory (RDMA) protocol supported by InfiniBand technology has been adopted by infrastructure providers, enabling high data transfer with low latency.

Providers are exploring new protocols to meet the needs for high throughput and low latency in the data center with the scalability and cost effectiveness of Ethernet. The general sense is RDMA over Converged Ethernet (RoCE) will become the HPC transmission protocol of choice for hyperscale data centers supporting AI applications.

Microsoft CTO on AI Infrastructure

Software-based protocol; utilizes CPU resources to manage the data flow.
Packet loss is tolerated. Lost packets are retransmitted and reassembled by the recipient.
Data Center Ethernet is a lossless protocol, supporting high-bandwidth, low-latency applications. Advancements such as Explicit Congestion Notification (ECN), Congestion Notification Packet (CNP) and Priority Flow Control (PFC) were developed to manage the traffic flow and virtually eliminate dropped packets.

High-speed, low-latency transport technology that is scalable and efficient without the use of CPU resources.
Natively supports Remote Direct Memory Access (RDMA), enabling data transfer directly from memory devices in a serialized manner from one node to another without utilizing CPU resources.
Dramatically accelerates transport of data between server clusters while reducing latency.

RDMA over Converged Ethernet leverages RDMA via a converged Ethernet transport network.
Additional congestion management protocols are used with ECN, CNP, and PFC to prioritize RDMA traffic.
Maximizes performance and low latency of RDMA with the high data rate and flexibility of Ethernet.
3 modes of RoCE can be configured – Lossless, Semi-lossless, or Lossy.

Data Center Topology

Upgrading the Network

In addition to deploying more fiber and new transmission protocols to support Generative AI, there are even changes to consider to the topology of the network infrastructure. InfiniBand is not managed the same way that Ethernet is.

Data center topologies greatly affect performance, reliability, and cost. Choosing a specific topology requires a strategic choice between those three considerations. No design is perfectly balanced and whichever is deployed will shape the output and performance of the data center.

Before we dig into InfiniBand topologies, it is important to understand a basic principle of transmission: Ethernet is a shared medium with traffic managed across single links; InfiniBand is a switched fabric topology. This means nodes are connected directly through a single serial-switched path—not a shared one.

This is critical when designing the network and physical layer infrastructure that require clusters of GPUs to be interconnected to process data efficiently for a single application. To scale InfiniBand clusters reliably, you simply add additional switches. This requires a scalable fiber infrastructure to easily connect additional InfiniBand switches when required.

A fat tree network topology is a hierarchical design that consists of a central core switch, connected to aggregation switches, which are in turn connected to a number of edge switches. The edge switches are responsible for connecting to the end nodes, such as servers or workstations. Fat tree networks are well-suited for high-performance computing (HPC) and other applications that require high bandwidth and low latency. They are also relatively easy to scale, as additional edge switches can be added to the network without disrupting the existing connections.

For clusters spanning multiple locations, a dragonfly topology may be used to connect remote clusters. A dragonfly network is also a hierarchical design consisting of a central spine switch connected to several leaf switches. The leaf switches are responsible for connecting to the end nodes, such as servers or workstations. Any topology can be used within each cluster that uses dragonfly to connect clusters.

Because latency is critical when connecting nodes working on a shared workload, shorter distances are required between clusters. When designing an InfiniBand physical layer infrastructure, multiple options are available for interconnecting nodes and switches.

With ever-greater fiber density, new transmission protocols, and changing topologies, hyperscalers are facing an unprecedented level of change to support the newest advancements in Generative AI. However, even the physical layer of connectivity has competing technologies. DAC is most prevalent for short distances, given its low cost and power consumption. However, these chips are not close together: they have to be distributed throughout a data center for power consumption and temperature management. DAC is an insufficiently antifragile solution for Generative AI, and there are viable alternatives.

"The computing fabric is one of the most vital systems of the AI supercomputer. 400 Gbps ultra-low latency NVIDIA Quantum InfiniBand with in-network processing connects hundreds and thousands of DGX nodes into an AI supercomputer."

— NVIDIA CEO Jensen Huang, GTC 2023

Up to 800 Gbps for maximum distance of 500m
Can be used as single 800 Gbps or broken out to 2x400 Gbps
Dual MPO8 interface over singlemode fiber
Highest flexibility in patching using singlemode jumpers for long or short runs

Active and passive twinax copper cable
OSFP/QSFP-DD 800G transceiver attached to each end of cable
Utilized for very short patching – max length of 2 meters
Lowest power consumption

800 Gbps over both singlemode and multimode fiber cable
Single 800G AOC or 800G – 2x400G breakout cable options
MMF lengths up to 100 meters

Denser connections, different network designs, different transmission protocols: Generative AI is driving complex changes for data center providers for both training and inference. These advancements in networking and connectivity are critical to marry up all the processors working together in training and executing this next generation of models.

We have just started to unpack the new complexities for data center providers and model trainers: HVAC, power, supply chain access, likely regulatory environment, and more. The pace of innovation across every component of this ecosystem is unprecedented. viaPhoton is excited to partner with you through it.

NATHAN BENTON

Vice President, Sales

TED NIELSEN

Chief Digital Officer

MARK HENDRIX

Vice President,
Engineering

JEFF JONES

Director,
Solution Architecture

CHRIS HASLEY

Director, Cloud & Hyperscale Sales

Thank you for your interest.

Demystifying Gen AI Infrastructure

The current footprint of the cloud is not designed for Gen AI. Alongside new chips and new power and cooling, new fiber connectivity is an essential component of fielding a complete solution.

"The multiple GPUs required to train LLMs don't all operate on the same board or even the same rack. They're spread out due to the power consumption and heat, and we want to network and cluster as many GPUs as we can."

AI Economics