Skip to content

LAYER 05 ACCELERATOR DIE transistors · tFLOPS

Last revised · MAY 13, 2026

Why modern accelerators are becoming multi-die systems.

The accelerator is no longer just one rectangle of silicon. It is a tightly integrated package of logic, memory interfaces, and die-to-die links.

2+

logic dies

HBM

co-packaged memory

D2D

local links

reticle

scaling limit

Native unit

transistors · tFLOPS

What constrains it

Large accelerators are constrained by reticle area, package layout, and the yield of multi-die assembly.

FIG. L05 · SIGNATURE ACCELERATOR WHERE COMPUTE LIVES

Fit overview · pinch to zoom

FIG. 07
The accelerators of 2026–27
5 ENTRIES
CHIP
MAKER
NODE
MEM
FP4 PF
Rubin
NVIDIA
N3
HBM4
~5
TPU v7
Google
N3
HBM3e
~3
Trainium 3
AWS
N3
HBM3e
~2.5
MI400
AMD
N2
HBM4
~4
Ascend 910D
Huawei
7 nm
HBM2e
≈0.8
Huawei is on 7 nm and still in the conversation. Most of the gap to Rubin is networking and packaging — the levers China can keep pulling without ASML.
Use this as a quick comparison plate. The deeper story is not one winning number, but how compute, memory, and packaging rise together.

What this layer does

Modern accelerators combine compute, cache, memory interfaces, and package constraints in one design problem. Multi-die assembly expands what the chip can do, but it also makes layout and yield harder to manage.

That shift matters because the word “chip” now hides a lot. Some leading accelerators are really multiple reticle-scale logic pieces, dense package interconnect, and large memory systems presented to software as one product.

Read the architecture

Read the accelerator as a packaged compute system

Modern accelerators scale by dividing the silicon, then rebuilding the illusion of one chip.

The signature figure compares finished products. This guide explains the architectural move underneath them: bigger AI accelerators are becoming coordinated assemblies of dies, memory, and local interconnect.

Constraint 01

One die hits the reticle wall.

Once the useful accelerator outgrows what fits cleanly in one lithography field, scaling by “just make the die bigger” breaks down.

Constraint 02

Chiplets recover manufacturability.

Splitting logic across multiple dies can improve yield and preserve architecture growth, but only if the package reunites them well.

Constraint 03

Die-to-die links must feel local.

NVIDIA’s Blackwell Ultra uses two reticle-sized dies connected by a 10 TB/s die-to-die interface so software still sees one accelerator.

Constraint 04

Memory and network keep score.

Google’s Ironwood exposes dual chiplets with dedicated HBM, while HBM bandwidth and scale-up links determine whether all that compute stays busy.

NVIDIA

Blackwell Ultra

2 dies 288 GB HBM3E

10 TB/s D2D · up to 8 TB/s HBM

Google TPU7x

Ironwood

2 chiplets 192 GiB HBM

7.38 TB/s HBM · D2D link 6× 1D ICI

Do not overread

Peak FLOPs alone hide memory, packaging, and software-exposed topology.

Watch instead

Dies per accelerator, HBM bandwidth, and die-to-die links tell a truer story.

Why it matters

The “chip” is now a small compute system packaged as one product.

The package now carries the whole market

NVIDIA’s Q1 FY2027 data-center revenue was $75.2 billion in a single quarter. That is roughly thirteen times AMD’s $5.775 billion data-center quarter (Q1 FY2026, up 57% year over year) and nearly nine times Broadcom’s $8.4 billion AI semiconductor quarter (Q1 FY2026, up 106% year over year).

The competitive set is also narrower than the technical menu suggests. Marvell’s FY2026 data-center revenue was $6.10 billion, or 74% of total company revenue, almost entirely custom silicon for a few hyperscalers.

Reticle limits forced multi-die assembly. Multi-die assembly then concentrated the work in the few firms that can finance interposer supply, HBM allocation, and substrate capacity in lockstep.

One wafer can also be the whole chip

The chapter’s argument is that reticle limits forced multi-die packaging. Cerebras is the live counterexample. The WSE-3 is a single 46,225 mm² die — one chip per wafer, with no interposer and no die-to-die links.

The public market just priced that bet. On May 15, 2026 Cerebras closed its Nasdaq IPO of 34,500,000 Class A shares at $185.00, for $6.4 billion of gross proceeds, and now trades as CBRS (Cerebras closing 8-K). FY2025 revenue was $510 million, up 76% year over year (Cerebras 424B4).

Multi-die assembly is still the dominant architecture. Wafer-scale is the listed exception.

The “not-NVIDIA” accelerator stack runs through two suppliers

TPU, Trainium, Maia, and MTIA are designed inside hyperscalers but fabricated outside them. Broadcom and Marvell are the public-market proxies for that custom-ASIC layer.

Broadcom’s Q1 FY2026 AI revenue was $8.4 billion, up 106% year over year, and the company guided Q2 AI semiconductor revenue to $10.7 billion (Broadcom Q1 FY2026 release). Marvell’s FY2026 data-center revenue was $6.10 billion, 74% of total company revenue, almost entirely custom silicon for a few hyperscalers (Marvell FY2026 10-K).

Each firm serves only a handful of hyperscaler customers. Two semiconductor companies have quietly built a multi-billion-dollar AI silicon business that scales with hyperscaler capex rather than with merchant GPU demand.

The accelerator is where many earlier constraints become one physical product.