Skip to content

LAYER 09 AI LAB → MODEL tokens / s · gross margin

Last revised · MAY 13, 2026

How tokens inherit the cost of everything underneath them.

The lab is where the hardware supply chain turns into a user-facing service, with compute cost flowing into every token served.

tokens / s

native unit

~70%

inference gross margin

3

compute regimes

$ → ¢

cost flow

Native unit

tokens / s · gross margin

What constrains it

At the end of the supply chain, token economics are shaped by inference efficiency and compute cost.

FIG. L09 · SIGNATURE AI LAB WHERE IT BECOMES A SENTENCE

Fit overview · pinch to zoom

FIG. 11
Compute footprint of the frontier labs, EOY 2026
GIGAWATTS
OpenAI
~7 GW
Anthropic
~6 GW
Google DeepMind
~5 GW
Meta
~4 GW
xAI
~2 GW
By end of 2027, the top two labs are projected to cross 10 GW each. China’s leading labs are not building at this scale — yet.
The final layer translates hardware back into experience: prompts, model work, and delivered text. This figure keeps the site anchored on the point where infrastructure becomes visible to a user.

What this layer does

The AI lab is where the physical chain becomes visible to users. A prompt turns into work on hardware, then into tokens returned over an API or product surface. This layer ties the economic story back to the machine story.

The frontier labs trade through their landlords

Public investors cannot buy OpenAI or Anthropic directly. The two largest commercial compute buyers reach the tape only through Microsoft and Amazon, whose balance sheets carry the GPUs the labs rent.

The ratios are lopsided. Anthropic sits at roughly $25 B ARR (per Reuters, late 2025) against the $8 B Amazon has put in and a custom Trainium fleet built around the partnership. Microsoft’s FY26 Q3 prepared remarks guide calendar-2026 capex toward $190 B, much of it Azure capacity for OpenAI alone.

The labs book the revenue line. The hyperscalers book the depreciation. The token at the top of this chapter is a public-market security, indirectly.

A token is a fraction of a watt-hour, a bandwidth-second, and a depreciation slice

The chapter’s native unit is tokens per second. The cost recovery unit underneath it is roughly 70% inference gross margin — the spread that compounds into the next training run. Anything below that line is borrowed time on someone else’s balance sheet.

Trace a single token downward. It draws a fraction of a watt-hour at the substation and the rack, a fraction of an HBM-bandwidth-second across the package’s TSVs, and a fraction of a Hopper-hour off the cloud’s depreciation schedule. Every layer of this site shows up as a sliver in the cost of one sentence.

Three compute regimes share that stack: frontier training, fine-tune, and online inference. The hardware is the same; only the duty cycle and the margin differ. Gross margin on inference is what tells you whether the training run that produced the model was worth doing.

Every frontier model has a public-market landlord

Microsoft and Amazon are not the whole landlord list. Alphabet captive-funds DeepMind off $180–$190 B of FY2026 capex (Q1 2026 earnings call). Meta funds FAIR and the Llama family off $125–$145 B of FY2026 capex (Q1 2026 outlook), most of it the Hyperion campus build.

Oracle is the third leg. Its remaining performance obligation reached $553 B at FY26 Q3, up 325% year over year (March 2026 release), almost entirely the Stargate consortium with OpenAI, SoftBank, and MGX. Microsoft’s commercial RPO now stands at $627 B (FY26 Q3).

Outside the US, SoftBank funds Arm and Stargate from the same balance sheet, and Alibaba pushes RMB126B of FY2026 capex into Cloud Intelligence to host Qwen (May 2026 results). Combined, the four US hyperscalers’ ~$600 B in 2026 capex is roughly the entire public-market AI-lab proxy. The labs run the inference; the tape owns the iron.

Tokens are the final output, but their cost was shaped all the way down the chain.