As AI adoption grows and agents multiply inference demands, a reliable inference foundation becomes critical to running AI initiatives profitably.

Today at Red Hat Summit, we are excited to announce that Red Hat AI Inference now runs on any managed Kubernetes service. This expansion enables organizations to leverage a consistent, open inference stack and Kubernetes-native operations wherever they already run their workloads. At launch, we are delivering validated deployment blueprints on 2 platforms: CoreWeave Kubernetes Service (CKS) and Azure Kubernetes Service (AKS). 

With this release, Red Hat AI Inference Server becomes Red Hat AI Inference, extending and building upon its existing vLLM-based inference server capabilities with llm-d-powered distributed inference orchestration.

The case for an open inference foundation

Organizations scaling AI face a choice that rarely gets surfaced until it's too late: build inference on a proprietary solution that constrains your hardware, model, and deployment options, or assemble your own stack from open source components that lack the enterprise support needed to run reliably in production.

Neither is a foundation. One trades flexibility for dependency. The other trades agility for operational fragility. Both become harder to sustain as AI demands grow, agents multiply inference runs, and the cost of getting it wrong compounds.

What organizations need is an inference foundation that is open enough to run any model on any hardware in any environment, and supported to operate with confidence at production scale, without rebuilding it every time their infrastructure or strategy evolves.

That is what Red Hat AI Inference delivers.

What Red Hat AI Inference brings to any Kubernetes cluster

Red Hat AI Inference delivers on 3 capabilities that matter most at scale: the flexibility to run any model on any hardware accelerator across any environment, the ability to manage token economics by maximizing throughput and reducing cost per token, and the capacity to scale predictably under unpredictable demand—including the compounding inference load that agentic workflows generate. On any Kubernetes environment, this is delivered through a validated deployment of vLLM, KServe, llm-d, a dedicated Istio instance (managed by the Sail Operator), cert-manager, LWS, and the Gateway API, working as a cohesive stack as a technology preview.

At its core is llm-d, the distributed inference orchestration project Red Hat introduced in 2025 with CoreWeave, Google Cloud, IBM Research, and NVIDIA as founding contributors. Now a CNCF Sandbox project, llm-d solves the problem that emerges when inference workloads outgrow a single server: how to orchestrate distributed serving intelligently, maximize KV-cache reuse across GPU nodes, and route requests to where they can be handled most efficiently.

The business impact of getting this right is measurable. In production environments running large-scale models, teams using llm-d's intelligent routing have observed a 3x improvement in output throughput and a 2x reduction in time to first token compared to standard round-robin load balancing. These results were documented by Red Hat and Tesla engineers running this stack in production, serving Llama 3.1 70B at scale.1 

Because Red Hat AI Inference is built entirely on standard Kubernetes APIs and open source components, the same deployment, the same configuration, and the same operational practices work across every validated environment. 

2 Kubernetes environments, 1 consistent stack

Red Hat AI Inference runs on CoreWeave Kubernetes Service and Azure Kubernetes Service—2 of the environments where enterprise teams are already running Kubernetes at scale.

On CoreWeave CKS, Red Hat AI Inference runs on bare-metal GPU infrastructure with NVIDIA InfiniBand networking delivering up to 3,200 Gbps per node and first-to-market access to NVIDIA accelerated computing technology. Their Tensorizer technology, contributed upstream to vLLM, enables CKS to offer 5x faster model loading compared to standard approaches through a zero-copy architecture.2  

On Azure AKS, Red Hat AI Inference runs within the enterprise-grade platform built for AI workloads at global scale. With 60+ available regions, enterprise networking and governance capabilities, organizations across industries can deploy the same open inference platform within the security boundary and operational framework they already maintain on Azure.

In both environments, Red Hat delivers a consistent deployment experience—the same open source components, Kubernetes-native operations,  and configuration. When your stack is consistent, your team's knowledge and operational practices transfer cleanly across environments.

One stack. Any Kubernetes. Consistency everywhere.

As your AI strategy evolves, your inference foundation doesn't need to change. If you start on CoreWeave and need to bring workloads on-premises, the foundation is the same. If you run on AKS today and need the performance headroom of CoreWeave for a new frontier model workload, you can use the same high-performant and consistent foundation.

This is open source governance of your infrastructure in practice. The llm-d project is governed in the open, hosted as a CNCF Sandbox project, and built on the same community model that made Linux and Kubernetes the universal enterprise infrastructure standards they are today. Red Hat AI Inference inherits that governance and that portability.

Get started with Red Hat AI Inference today

Visit the Red Hat AI Inference webpage to learn more, take advantage of the no-cost 60-day trial of Red Hat AI inference, or read the Red Hat AI Inference documentation.  

Read CoreWeave’s announcement: CoreWeave and Red Hat Join Forces to Rethink Hybrid Inference and CoreWeave’s tutorial on Red Hat AI Inference on CKS.

Hear more from Red Hat and CoreWeave—live on YouTube, May 27 at 1:15 PM ET → Join the conversation

Talk to a Red Hat AI specialist Contact us

  1. Production-Grade LLM Inference at Scale with KServe, llm-d, and vLLM, Tang, Shaw et al., April 2026.
  2. CoreWeave llm-d founding contributor announcement, May 2025.

Resource

The adaptable enterprise: Why AI readiness is disruption readiness

This e-book, written by Michael Ferris, Red Hat COO and CSO, navigates the pace of change and technological disruption with AI that faces IT leaders today.

About the authors

Carlos Condado is a Senior Product Marketing Manager for Red Hat AI. He helps organizations navigate the path from AI experimentation to enterprise-scale deployment by guiding the adoption of MLOps practices and integration of AI models into existing hybrid cloud infrastructures. As part of the Red Hat AI team, he works across engineering, product, and go-to-market functions to help shape strategy, messaging, and customer enablement around Red Hat’s open, flexible, and consistent AI portfolio.

With a diverse background spanning data analytics, integration, cybersecurity, and AI, Carlos brings a cross-functional perspective to emerging technologies. He is passionate about technological innovations and helping enterprises unlock the value of their data and gain a competitive advantage through scalable, production-ready AI solutions.

Naina Singh leads AI Inference Product Strategy at Red Hat, where she works with enterprises running LLM inference in production. She focuses on the operational and economic decisions that determine whether inference runs profitably at scale. She holds two patents and an MBA from UNC Kenan-Flagler.

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds