Chain it all to the GPU
Small models in parallel on the CPU
Frontier model — GPU
one large model, kept at the matmul
The small models, in parallel on the CPU
A tier of small models per turn. Run them in parallel on the CPU, not chained to the GPU — or the big model's throughput collapses.