Skip to content

LAYER 06 SCALE-UP RACK GPUs / pod · NVLinks

Last revised · MAY 13, 2026

How 72 GPUs behave like one machine.

NVL72 combines compute trays, switches, copper links, power, and cooling into a rack-scale training unit.

72

GPUs / rack

5 184

NVLink cables

120 kW

rack power

130 TB/s

NVLink fabric

Native unit

GPUs / pod · NVLinks

What constrains it

Rack-scale performance depends on how many GPUs can communicate as one coherent system.

FIG. L06 · SIGNATURE SCALE-UP THREE WAYS TO WIRE A RACK

Fit overview · pinch to zoom

FIG. 08
Three ways to wire a rack
TOPOLOGIES
NVL72
NVIDIA
all-to-all · 72 GPUs · ≈1 MW
TPU Pod
GOOGLE
3D torus · 8,000+ chips · 6 neighbors
Trainium Pod
AWS
hybrid → dragonfly · converging on NVL-class scale
every chip connects to every chip
6 neighbors, bounce to reach far chips
clusters of all-to-all, sparser between
Three architectures that solve the same problem of dense rack-scale communication — NVL72 (all-to-all NVLink, 72 GPUs), Google TPU pods (3D torus, ~9 k chips), AWS Trainium (fused node + UltraCluster). Each dot is a chip; line density encodes how directly chips can talk before traffic spills into a slower fabric.

What this layer does

Scale-up is the step from a single accelerator package to a rack-level machine. The headline number — 72 GPUs in one rack — isn’t actually the trick. The trick is the switched copper fabric between them. NVL72’s nine NVSwitch trays sit in the middle of the rack; every GPU has eighteen NVLink ports that fan out to all nine switches. The longest GPU-to-GPU path is one switch hop in, one switch hop out — at NVLink speed.

When software issues a memory copy or a collective across all 72 chips, the work runs at fabric bandwidth (~130 TB/s aggregate, ~700 ns end-to-end latency), not at network speed. That latency-and-bandwidth delta is the entire reason the rack behaves as one machine.

How 72 GPUs behave like one machine

The trick is the fabric, not the rack

Every GPU is two hops from every other GPU.

The headline number — 72 GPUs in one rack — isn’t actually the trick. The trick is the switched copper fabric between them. Nine NVSwitch trays sit in the middle of the rack; every GPU has 18 NVLink ports fanning out to all nine switches. The longest GPU-to-GPU path is one switch hop in, one switch hop out — at NVLink speed.

Fit overview · pinch to zoom

NVL72 switched fabric schematic Thirty-six GPUs on the left and thirty-six on the right connect through nine central NVSwitch trays. A highlighted orange path traces a single memory copy from one GPU to another across two switch hops. NVL72 · 72 GPUs · 9 NVSwitches · single fabric domain COMPUTE TRAYS 1–9 36 GPUs · 4 per tray COMPUTE TRAYS 10–18 36 GPUs · 4 per tray 9 NVSWITCH TRAYS every GPU → every switch NVSwitch 1 NVSwitch 2 NVSwitch 3 NVSwitch 4 NVSwitch 5 NVSwitch 6 NVSwitch 7 NVSwitch 8 NVSwitch 9 HOP 1 · ~350 ns HOP 2 · ~350 ns 1.8 TB/s per-GPU NVLink 130 TB/s aggregate fabric ~700 ns GPU → GPU round-trip ~120 kW rack power
Logical view of the NVL72 fabric. The orange path traces a single memory copy: source GPU (lit, left mid-row) → NVSwitch 5 → destination GPU (lit, right mid-row). Every other GPU pair in the rack is exactly the same two-hop distance.

18

NVLink ports per GPU

9

NVSwitch trays in the rack

2

max hops, GPU → GPU

5 184

copper NVLink cables

INSIDE NVL72

Two hops at fabric speed.

Per-GPU NVLink
1.8 TB/s
Switch hops, end to end
2
Round-trip latency
~700 ns
Aggregate fabric
~130 TB/s

A memory copy from GPU 7 to GPU 50 leaves the source, crosses one NVSwitch, lands on the destination. NCCL all-reduce across 72 GPUs runs end-to-end at fabric bandwidth.

ACROSS RACKS

Four-plus hops through the network.

Per-GPU IB / Ethernet
800 Gb/s
Leaf + spine hops
4–6
Round-trip latency
~5–10 µs
Per-pair fabric
100 Gb/s

Once the work leaves the rack, every collective walks through NICs and an Ethernet or InfiniBand spine. Latency rises by ~10×, per-GPU bandwidth drops by ~18×, and tensor-parallel layers stop fitting in budget.

~10× latency · ~18× per-GPU bandwidth · the gap is what "one machine" buys

What software sees

One address space

All 72 GPUs see one pooled memory.

CUDA programs treat the rack as a single device set. cudaMemcpy across the fabric reads HBM on the destination GPU through NVLink at near-local speeds.

Collectives, not packets

NCCL runs at fabric speed.

All-reduce, all-gather, and all-to-all run as switch-mediated collectives across the 72-chip domain. Tensor-parallel and expert-parallel comms become essentially free.

The domain boundary

Outside the rack is a different stack.

Cross-rack traffic falls back to InfiniBand or RoCE Ethernet plus standard NCCL transports. The rack is the unit of dense AI computation; the data-center fabric is everything beyond it.

How to read the wiring

The same question — how big can the fast-communication domain be? — gets three different answers from NVIDIA, Google, and AWS. The signature figure above lays them out; the rubric below is how to compare them.

Reading rubric

Three questions to ask of any scale-up topology.

Scale-up architectures look different on paper but answer the same question: how many accelerators can stay in one fast conversation before traffic spills into a slower fabric? The three diagrams above are alternative answers; the questions below are how to compare them.

Ask first

Who can talk at full speed?

Scale-up is about the size of the fast communication domain before you fall back to slower scale-out networking.

Then ask

What does the workload need?

Dense model parallelism likes all-to-all bandwidth. Larger pods can trade locality for reach when the job can tolerate more routing.

Finally ask

Where does the boundary move?

From GPU package to rack, from rack to pod, or from fused node to cluster. The architecture decides where communication starts getting expensive.

The rack is the last place copper wins

Inside NVL72, every GPU-to-GPU link is copper. Step one rack outward and the physics flips: the bits must cross meters, not centimeters, and laser-driven optics are the only thing that survives the trip. The two markets are booming in lockstep but to different suppliers.

Credo’s Q3 FY2026 revenue was $407.0 M, up 201.5% year over year — sales dominated by active electrical cables that live entirely inside the rack. Coherent’s Datacenter & Communications segment reported $1.362 B in Q3 FY2026, up 41% year over year, on 800G and 1.6T optical modules that exist precisely because copper cannot leave the rack. One sheet-metal wall, two industries.

One of the two industries is now partly owned by the other

On March 2, 2026 NVIDIA and Coherent announced a multiyear advanced-optics agreement paired with a $2 B private placement — 7,788,161 Coherent shares at $256.80, per the Coherent Q3 FY2026 10-Q filed May 6, 2026. The same filing ties Coherent’s capacity buildout directly to NVIDIA’s multiyear purchase commitment.

The copper-inside / optics-outside split was clean when the two industries were separate companies. They aren’t anymore. Vertical integration into the rest of the data center’s wiring is starting at the GPU vendor.

Copper-inside-the-rack has a sell-by date

Celestica’s Q1 2026 release disclosed a co-packaged optics Ethernet switch award with a hyperscaler, ramping in 2027. On April 13, 2026 Credo agreed to acquire DustPhotonics for $750 M cash plus stock, guiding to more than $500 M of combined optical revenue in fiscal 2027 — a cable company buying its way into silicon photonics. At OFC 2026, Eoptolink demoed a 6.4 Tbps near-packaged-optics module built from thirty-two 200G lanes.

CPO moves the optics onto the switch package itself. When the laser starts on the chip, the sheet-metal wall stops being the boundary. The rack is the last place copper wins — for now.

The rack is where chips stop being parts and start behaving like infrastructure.