The inference OS schedules per request

Incoming request

OpenInfer · the inference OS

What the scheduler weighs

Free memory

KV residency

Queue depth

Prompt size

Latency target

Prefill / decode

Placed on your silicon

CPU

idle host cores

GPU

any vendor

NPU

accelerators

Per request, in microseconds: weigh the inputs, place it on the cheapest silicon that meets the target.