LAYER 06 SCALE-UP RACK GPUs / pod · NVLinks
How 72 GPUs behave like one machine.
NVL72 combines compute trays, switches, copper links, power, and cooling into a rack-scale training unit.
72
GPUs / rack
5 184
NVLink cables
120 kW
rack power
130 TB/s
NVLink fabric
GPUs / pod · NVLinks
Rack-scale performance depends on how many GPUs can communicate as one coherent system.
FIG. L06 · SIGNATURE SCALE-UP THREE WAYS TO WIRE A RACK
What this layer does
Scale-up is the step from a single accelerator package to a rack-level machine. The headline number — 72 GPUs in one rack — isn’t actually the trick. The trick is the switched copper fabric between them. NVL72’s nine NVSwitch trays sit in the middle of the rack; every GPU has eighteen NVLink ports that fan out to all nine switches. The longest GPU-to-GPU path is one switch hop in, one switch hop out — at NVLink speed.
When software issues a memory copy or a collective across all 72 chips, the work runs at fabric bandwidth (~130 TB/s aggregate, ~700 ns end-to-end latency), not at network speed. That latency-and-bandwidth delta is the entire reason the rack behaves as one machine.
How 72 GPUs behave like one machine
Every GPU is two hops from every other GPU.
The headline number — 72 GPUs in one rack — isn’t actually the trick. The trick is the switched copper fabric between them. Nine NVSwitch trays sit in the middle of the rack; every GPU has 18 NVLink ports fanning out to all nine switches. The longest GPU-to-GPU path is one switch hop in, one switch hop out — at NVLink speed.
18
9
2
5 184
Two hops at fabric speed.
- Per-GPU NVLink
- 1.8 TB/s
- Switch hops, end to end
- 2
- Round-trip latency
- ~700 ns
- Aggregate fabric
- ~130 TB/s
A memory copy from GPU 7 to GPU 50 leaves the source, crosses one NVSwitch, lands on the destination. NCCL all-reduce across 72 GPUs runs end-to-end at fabric bandwidth.
Four-plus hops through the network.
- Per-GPU IB / Ethernet
- 800 Gb/s
- Leaf + spine hops
- 4–6
- Round-trip latency
- ~5–10 µs
- Per-pair fabric
- 100 Gb/s
Once the work leaves the rack, every collective walks through NICs and an Ethernet or InfiniBand spine. Latency rises by ~10×, per-GPU bandwidth drops by ~18×, and tensor-parallel layers stop fitting in budget.
All 72 GPUs see one pooled memory.
CUDA programs treat the rack as a single device set. cudaMemcpy across the fabric reads HBM on the destination GPU through NVLink at near-local speeds.
NCCL runs at fabric speed.
All-reduce, all-gather, and all-to-all run as switch-mediated collectives across the 72-chip domain. Tensor-parallel and expert-parallel comms become essentially free.
Outside the rack is a different stack.
Cross-rack traffic falls back to InfiniBand or RoCE Ethernet plus standard NCCL transports. The rack is the unit of dense AI computation; the data-center fabric is everything beyond it.
How to read the wiring
The same question — how big can the fast-communication domain be? — gets three different answers from NVIDIA, Google, and AWS. The signature figure above lays them out; the rubric below is how to compare them.
Three questions to ask of any scale-up topology.
Scale-up architectures look different on paper but answer the same question: how many accelerators can stay in one fast conversation before traffic spills into a slower fabric? The three diagrams above are alternative answers; the questions below are how to compare them.
Who can talk at full speed?
Scale-up is about the size of the fast communication domain before you fall back to slower scale-out networking.
What does the workload need?
Dense model parallelism likes all-to-all bandwidth. Larger pods can trade locality for reach when the job can tolerate more routing.
Where does the boundary move?
From GPU package to rack, from rack to pod, or from fused node to cluster. The architecture decides where communication starts getting expensive.
The rack is the last place copper wins
Inside NVL72, every GPU-to-GPU link is copper. Step one rack outward and the physics flips: the bits must cross meters, not centimeters, and laser-driven optics are the only thing that survives the trip. The two markets are booming in lockstep but to different suppliers.
Credo’s Q3 FY2026 revenue was $407.0 M, up 201.5% year over year — sales dominated by active electrical cables that live entirely inside the rack. Coherent’s Datacenter & Communications segment reported $1.362 B in Q3 FY2026, up 41% year over year, on 800G and 1.6T optical modules that exist precisely because copper cannot leave the rack. One sheet-metal wall, two industries.
One of the two industries is now partly owned by the other
On March 2, 2026 NVIDIA and Coherent announced a multiyear advanced-optics agreement paired with a $2 B private placement — 7,788,161 Coherent shares at $256.80, per the Coherent Q3 FY2026 10-Q filed May 6, 2026. The same filing ties Coherent’s capacity buildout directly to NVIDIA’s multiyear purchase commitment.
The copper-inside / optics-outside split was clean when the two industries were separate companies. They aren’t anymore. Vertical integration into the rest of the data center’s wiring is starting at the GPU vendor.
Copper-inside-the-rack has a sell-by date
Celestica’s Q1 2026 release disclosed a co-packaged optics Ethernet switch award with a hyperscaler, ramping in 2027. On April 13, 2026 Credo agreed to acquire DustPhotonics for $750 M cash plus stock, guiding to more than $500 M of combined optical revenue in fiscal 2027 — a cable company buying its way into silicon photonics. At OFC 2026, Eoptolink demoed a 6.4 Tbps near-packaged-optics module built from thirty-two 200G lanes.
CPO moves the optics onto the switch package itself. When the laser starts on the chip, the sheet-metal wall stops being the boundary. The rack is the last place copper wins — for now.
The rack is where chips stop being parts and start behaving like infrastructure.