
Decode is repetitive: why caching primitives and kernels matters
LLM inference feels slow because decode is expensive at scale. Prefill runs once, but decode runs per token—overhead multiplies across the entire output. We address this by...
From memory and compute pipelines to context management and assistant workflows, we design the full AI stack, and here we share our progress to drive the future of local intelligence together.

Follow new releases, engineering breakthroughs, and examples of Local AI in action — all built to run closer to where your product lives.

LLM inference feels slow because decode is expensive at scale. Prefill runs once, but decode runs per token—overhead multiplies across the entire output. We address this by...

Today, we’re excited to share a big step forward for OpenInfer: we’ve officially joined the Intel® Partner Alliance and Microsoft’s Pegasus Program. These are two of the most...

Porting desktop code to mobile often breaks because mobile file systems are sandboxed and heavily restricted compared to desktop environments. Operations like using relative paths,...

For enterprises, AI isn’t just an opportunity—it’s a liability. Privacy breaches, security gaps, and loss of control can cost more than any productivity gains. From regulatory...

At OpenInfer, we believe the future of AI will not be defined by a single, all-powerful “superintelligence.” Instead, it will emerge through multiplicity — a society of AI agents,...

Edge devices have limited memory footprints, making Mixture of Experts (MoE) models with active parameter selection the optimal solution for deploying sophisticated AI reasoning...

In our recent posts, we’ve explored how CPUs deliver impressive results for local LLM inference, even rivaling GPUs, especially when LLMs push on hardware's memory bandwidth...

AI inference workloads on CPUs are mostly memory- bound rather than compute-bound, with performance bottlenecks arising from poor cache utilization, static thread scheduling, and...

When most people think of AI acceleration for client devices, they think GPUs. Some may nod to NPUs or specialized ASICs. But the CPU, the most ubiquitous compute unit in every...

Client-Side Inference, Reimagined: Llama 4 Scout Goes Local Deploying large AI models across devices is hard. Llama 4 Scout, which we showcase here, typically wouldn’t fit on...

GPUs are a cornerstone of modern AI workloads, driving both large-scale model training and real-time inference applications. However, achieving full utilization of these powerful...

We are thrilled to announce that VentureBeat has covered our latest $8M funding round, highlighting our mission to redefine AI inference at the edge. OpenInfer is building the...

We’re excited to announce the first preview build of our OpenInfer Engine—a powerful AI runtime designed to make on-device inference simple, seamless, and developer-friendly. This...

At OpenInfer, our primary goal is to make integration effortless. We’ve designed our inference engine to be a drop-in replacement—switching your endpoints is as simple as updating...

At OpenInfer, we strive to redefine the boundaries of Edge AI performance. Our latest update demonstrates a 2-3x increase in tokens per second (tok/s) compared to Ollama/Llama.cpp....

At OpenInfer, we're dedicated to pushing the boundaries of what's possible with large language models (LLMs). These models, while immensely powerful, often come with hefty hardware...

OpenInfer is on a mission to help AI Agents to run on any device. In this video one of our engineers Vitali shares a brief demo of how you can run large models and large context...
OpenInfer is now available! Sign up today to gain access and experience these performance gains for yourself.
