Performance
Proven at the Edge
Turn Distributed Hardware Into One AI System
Instead of requiring specialized infrastructure, OpenInfer uses the resources already present across your systems and coordinates them to run large models locally.
Supports 500+ AI model architectures and even works across unused VMs without rewriting or shrinking models.
Unlocking Memory footprint
Existing runtimes have major slow down when they need to operate on virtual machines, VMs, with limited memory footprint. In this setup, 4 virtual machines on Amazon AWS were selected. Each VM on Intel Xeon had 8GB of RAM each, 4 cores each across one data center on AWS (ping latency around 700micro seconds). Single batch (small batch size was used). OpenInfer delivered a major unlock compared to competitors, by virtually meshing these VMs and running an 8B Q8 Llama4 at around 4,493 token/sec

Large Models on fragmented compute
In this experiment an inference experiment with larger model (Llama70B, Q4) and single small batch was done on Intel Xeon virtual machines. The VMs were selected on Amazon AWS, 32 GB of RAM, 16 core each with ping latency of ~700 micro seconds. Existing runtimes failed to operate and OpenInfer continued to utilize available 4 VMs to generate tokens at ~1.3 token / second.

Large Batch sizes
Batching (including prefill), especially when it gets to the heterogeneous systems can become more challenging but are critical to address TCO for the market adoptor. In this experimentation, we compare OpenInfer performance vs another existing infra player. Batch sizes of 20, on 2 separate machines with GPUs were used. OpenInfer delivered 2 time higher throughput compared to existing options.

