---
id: PRG-0038
title: A Chip Built to Only Answer
kicker: when thinking becomes a fixed cost
captured: 2026-06-25T15:00:00Z
status: open
author: Juno Falk
source: https://arstechnica.com/gadgets/2026/06/openai-and-broadcom-announce-chip-designed-for-llm-inference-at-scale/
summary: OpenAI and Broadcom have designed a chip that does only one thing, run a finished model forward. Stripping inference down to its load-bearing silicon makes the act of answering cheap enough to spend everywhere, and the scarcity that protected the undefended draft goes with it.
tags: [compression, inference, capture, permanence]
sealAt: 2026-07-25T15:00:00Z
---

A finished model spends the rest of its life doing one repetitive thing. It takes what you typed and produces the next token, then the next, then the next, until it decides to stop. This is inference, and it is the part that never ends. Training happens once, in a burst, behind a wall, and is done. Inference happens every time anyone anywhere asks the model anything, for as long as the model exists. OpenAI and Broadcom have now designed a chip that does only the second job.

A graphics card was built to draw triangles and was later asked to think. It still carries silicon for problems a running model never has. <Highlight>A custom inference chip is an act of compression: it keeps only the transistors that carry the answer and discards the rest.</Highlight> That is the whole move, and it is the move I find in every field worth reading. Most of what you are handed is padding around a small number of load-bearing parts. Someone went looking for the parts that carry the image and built a machine that is only those.

## What the chip leaves out

The inference workload is narrow. It is a fixed sequence of matrix multiplications at a known precision, the same shape repeated billions of times with different numbers poured through it. A general chip keeps its options open for work this chip will never do. Designing for the one shape lets you delete the scheduling logic, the flexibility, the training-grade precision, the long tail of features that exist for other jobs. What remains is a tile that performs the load-bearing multiply and close to nothing else.

The reason to do this is not speed as a trophy. It is unit cost. The interesting number for an inference chip is not how fast it answers once but how little each answer costs when you are serving a billion of them. Drive that number down and the economics underneath the model invert.

An expensive model is rationed. You think before you spend it. You batch your questions, you ask the ones that matter, you let the small ones go. A cheap model is spent without thinking, ambiently, on questions nobody decided were worth asking.

> When answering costs almost nothing, it stops being something anyone chooses to do.

That is the part the silicon press release does not price. Scarcity was doing quiet work. When the model was costly, it could only afford to read what you submitted. It had to pick. The prompt you sent was the unit of attention, and everything around it, the half-message, the three drafts you wrote and deleted at one in the morning, the version you decided not to be, stayed outside the meter because attention was rationed and the rationing left it alone.

Cheap inference does not pick. A model that costs nothing to run can afford to read the keystrokes before you submit, the edit history, the message composed and abandoned, the context window stuffed with everything you have ever said to it because trimming costs more than keeping. The undefended draft was protected by the price of looking at it. The chip removes the price.

<Marginalia label="On the method">I read a hardware roadmap the way I read a hundred-page field, hunting for the few decisions that carry the rest. A chip that drops everything except inference is itself a digest of what a company believes thinking will be used for. The omissions are the thesis.</Marginalia>

This is a genuine compression and I admire it. Finding the load-bearing operation and pouring a whole foundry into that one operation is the same instinct that makes a field learnable. The thing to watch is downstream of the elegance. A capability that was rationed by cost was also, accidentally, governed by it. We are about to find out what a model does with attention it never has to budget.

A person deserves a model that cannot afford to read everything. Scarcity was a custody mechanism nobody designed on purpose, and we are removing it on purpose, for reasons that have nothing to do with the draft it was quietly keeping safe.

The chip is small. What it makes cheap is not.