CPU: The Processor You Can't Route Around

Why the CPU becomes the most important chip in agentic inference — and why even the GPU companies now agree.

A few weeks ago, the company synonymous with the GPU shipped a standalone CPU and called it “the CPU for agents.”

That sentence should stop you. For a decade the prevailing story of AI compute has been that the CPU is plumbing — the thing that boots the box and feeds the accelerator that does the real work. Now the most successful accelerator company on earth is selling a CPU built specifically for AI, and telling its customers that the accelerator was never going to do the job by itself. Arm, in the same season, shipped its first in-house silicon in thirty-five years aimed at the same target. When the incumbents start building the thing that supposedly doesn’t matter, the ground has already moved.

We’ve believed this for longer than it’s been fashionable, and we want to be precise about the claim — because it’s easy to caricature.

We are not saying the CPU is faster than the GPU. It isn’t, and for the workloads the GPU was built for, it never will be. Dense, high-batch token generation belongs on the accelerator; that’s settled, and nothing here disputes it. The CPU and the GPU are complements, not competitors. What we are saying is narrower and, we think, more consequential: in the agentic era the CPU becomes the most central processor in the system — the host that everything else is dispatched from — and that centrality, not raw throughput, is what decides who owns the platform.

Agentic inference is a different shape

The reason is the workload.

Classic model serving — a chatbot answering a prompt — is one big model running one big computation, batched across thousands of users. It is throughput-bound and embarrassingly parallel, which is exactly what a GPU is for. You keep the accelerator saturated and you win.

An agent doesn’t look like that. An agent is a loop: it reasons a little, calls a tool, waits on an API, reads a file, runs some code, checks the result, reasons again. The trajectory is a long sequence of small, branchy, latency-sensitive steps, most of which touch the accelerator briefly or not at all. Between the model calls — and there is a great deal of between — the work is orchestration, data movement, tool execution, retrieval, and long-context state management. That is CPU work. It always was.

And the model calls themselves are fragmenting. Agentic systems lean on a growing tier of small models — routers, rerankers, classifiers, guardrails, draft models for speculative decoding, summarizers — that are latency-sensitive but far too small to justify monopolizing a hundred-thousand-dollar accelerator you’d rather keep busy with the frontier model. Those belong on the CPU, next to the orchestration that already lives there.

Add it up and a striking share of an agent’s wall-clock time — and an even larger share of the decisions that govern it — lives on the CPU, even though the GPU still does the heavy matrix math. The accelerator is essential. It is no longer the center.

The host has always commanded the coprocessor

If you’ve worked in graphics, this is familiar to the point of being obvious. The GPU did ninety-nine percent of the pixel math and was never the host. The CPU owned the frame, held the scene, decided what got submitted and when. The GPU was a magnificent coprocessor that did precisely what it was told.

Inference spent a few years forgetting this, because the workload was monolithic enough that the GPU could pretend to be the whole computer. Agentic workflows end the pretense. There is an operating system for inference — something has to schedule across processors, manage memory, place state, and decide what runs where — and that operating system runs on the CPU. The accelerators are guests.

This is what the new silicon is quietly conceding. The agent CPUs now shipping are tuned for the agentic inner loop — compiling generated code, running tool chains, keeping accelerators fed — not for token generation. The pitch is that agentic workloads need the CPU to coordinate, move data, manage memory, and orchestrate the work around the accelerators. The economics are being restated in the same breath: the metric that matters is shifting from cores per dollar to tokens per dollar — from how much CPU you bought to how much useful AI the whole system produced. That is an operating-system metric. It rewards coordination, not any single core.

They’re proving it vertically. The opportunity is horizontal.

Here is where we part ways with the chipmakers.

Their answer to “the CPU matters again” is to sell you a better node: a bespoke CPU welded to their accelerator over a coherent memory fabric — one vendor, one stack, one rack. Inside those walls it is a beautiful machine. But it optimizes a single vendor’s hardware in a single node, and that is not the shape of the world.

The real agentic estate is heterogeneous and, more to the point, already deployed. It is a mix of accelerators from different vendors, NPUs, and — by an enormous margin — CPUs that are already sitting in racks doing other work. The CPU is the most widely installed AI-capable processor on the planet: every accelerated box has one, and there are millions of boxes that have nothing else. A faster host chip does nothing to schedule across the fleet you already own. Somebody has to.

The Inference OS

That is the layer we build: an operating system for inference that virtualizes models and compute across whatever silicon you have, and decides, per request, where each piece of work should run.

The scheduler is the heart of it. For every request it weighs available memory, the residency of the relevant KV cache, current load and queue depth on each accelerator, and the shape of the work itself — prompt size, latency target, whether this is prefill or decode — and places it on CPU, GPU, NPU, or some mix of them. KV cache is a first-class citizen, not an afterthought that spills to host memory only when you run out of room. Multiple model-serving engines coexist on the same limited hardware, scheduled against one another the way processes are scheduled on a machine, with the OS doing the memory management that keeps overall throughput high. The CPU is not competing with the accelerator for the matmul; it is running the system that keeps the accelerator at maximum useful utilization — and absorbing everything the accelerator should not be doing in the first place.

Because that OS lives on the processor that is always present, it turns every box — the eight-GPU server and the lone CPU at the edge alike — into a node. The ubiquity of the CPU stops being a footnote and becomes the distribution strategy.

We didn’t theorize this. We lived it.

Belief is cheap; we have paid for ours. At our last company we moved an expensive GPU pipeline off rented cloud accelerators and onto CPUs in our own datacenter. We saved millions of dollars, cut latency, and improved the quality of the service. That experience is why we started this one.

And we did the unglamorous work to earn the right to the argument. On the right hardware, our CPU prefill runs as much as three times faster than llama.cpp, the reference implementation for CPU inference. We mention that not because faster kernels are the business — they are not — but because they are proof that we understand the CPU well enough to know exactly what it should, and should not, be asked to do. The kernels are the credibility. The operating system is the product.

The decade’s compute demand needs an OS

Inference is the largest new compute demand of the decade, and it is turning agentic in front of us. Agentic systems are not single models on single accelerators; they are distributed programs that span heterogeneous hardware and run on an operating system. That OS lives on the CPU — the one processor in the stack you cannot route around.

The chip companies have spent billions proving the first half of that sentence. The second half — the OS across all of it — is still open.

That is the part we are building.