Kubernetes for AI Workloads: From Infrastructure Challenges to Scalable Solutions

By Nick Schouten

Kubernetes is rapidly evolving to meet the complex demands of AI and data workloads. For professionals responsible for data platforms, AI infrastructure, and enterprise-grade developer productivity, recent innovations mark an important inflection point in how these systems are designed and operated. Many of these advancements were showcased earlier this year at KubeCon 2025 in London, where the community highlighted new patterns, tools, and strategies that continue to shape enterprise adoption.

Below, I outline the most critical trends and innovations that data and AI teams should be aware of to remain competitive and efficient.

Challenges for Kubernetes with Large Language Models

Initially, the rise of large language models (LLMs) presented significant challenges for Kubernetes. Beyond the fact that the most capable models were proprietary, their immense size made self-hosting within Kubernetes impractical, pushing organizations toward external services.

Furthermore, core LLM requirements like maintaining state and "stickiness" (consistent routing to specific instances) were fundamentally at odds with Kubernetes' original design, which prioritized stateless, easily scalable workloads. This mismatch briefly threatened to derail the momentum Kubernetes had gained as the universal platform for modern applications.

Leader Worker Set (LWS)

Part of the problem, the proprietariness (yes I had to google that word) was solved by external sources, with DeepSeek being a notable example. However, this doesn’t solve the issue of model size.

To overcome this, Kubernetes introduced the Leader Worker Set (LWS). LWS enables a group of pods to act as one unit of replication and collectively run inference workloads, effectively allowing large models such as DeepSeek to be distributed and executed across multiple pods. This capability is made possible by Python Ray. You can read more here:

https://github.com/kubernetes-sigs/lws

From a practical perspective, deploying and managing large language models (LLMs) like DeepSeek directly within a private Kubernetes cluster may only be viable for organizations with significant resources.

For many enterprises, the decision to self-host an LLM to address privacy concerns often clashes with the existing reliance on cloud providers for data storage and other critical services. If sensitive data already resides in the cloud, the incremental privacy benefit of self-hosting only the LLM itself may be limited. A truly comprehensive privacy and security strategy would necessitate a full on-premise or fully self-hosted ecosystem, encompassing not just the LLM but also data storage, networking, and other infrastructure.

Furthermore, even when technically possible, the economic overhead of running and maintaining LLMs on private infrastructure can be substantial. The costs associated with specialized hardware, continuous operational management, and the expertise required to optimize these complex workloads often outweigh the benefits for all but the largest and most specialized organizations.

This all to say that I don’t think I will be using LWS any time soon.

Smaller Models (7B-28B)

While the challenges of self-hosting extremely large language models (LLMs) persist, a significant development has made Kubernetes more viable for LLM workloads: the dramatic improvement of smaller LLMs. Over the past year, models ranging from 7B to 28B parameters have seen substantial advancements in performance and capability.

Crucially, these smaller, yet highly effective, models can often be run efficiently on a single Kubernetes pod. This aligns perfectly with Kubernetes' strengths, allowing organizations to leverage its orchestration capabilities for deploying, scaling, and managing these LLM models with relative ease. This again makes Kubernetes an increasingly attractive platform for a wide range of LLM applications within enterprises.

Existing tools and frameworks such as KServe, Dapr, Ray, Kubeflow have all created setup for fine-tuning and running Small LLM’s.

Implication: Organizations currently relying on external or proprietary AI services should reevaluate their infrastructure strategy. Kubernetes now offers a scalable, flexible, and controllable platform to host AI workloads internally, increasing agility and reducing vendor lock-in.

Kubernetes Gateway API

API management within Kubernetes is evolving. The traditional NGINX Ingress controller is being supplanted by the new Kubernetes-native Gateway API (https://blog.nginx.org/blog/kubernetes-networking-ingress-controller-to-gateway-api). This transition was already underway, but Kubernetes is now also becoming more model-hosting friendly by integrating an extension that supports advanced routing setups like model-aware routing, state/stickyness,… (https://gateway-api-inference-extension.sigs.k8s.io/), all of which were possible in nginx, but very cumbersome to implement.

Enterprises should anticipate and plan for this transition to better support modern AI and data service APIs.

🔗 https://github.com/kubernetes-sigs/gateway-api

New Paradigms in API Security for AI Agents

As AI-powered automation and “copilot” agents become more prevalent, traditional authentication models based solely on identity are proving inadequate. The industry is moving toward context-aware tokenization, where tokens encode not only user identity but also the intent, scope, and approved workflows associated with API calls.

This approach enhances security by ensuring that AI agents operate within strictly defined boundaries and maintains full traceability of their actions—an essential requirement for enterprise compliance.

Explore more 👇

https://datatracker.ietf.org/doc/draft-ietf-oauth-transaction-tokens/

https://tokenetes.io/

https://www.youtube.com/watch?v=CvGbwn5ZrFg

Leveraging LLMs to Automate Kubernetes Controller Development

An exciting talk by some Google engineers at KubeCon 2025 showcased how they used LLM’s to write >1000 k8s controllers.

I heard interesting quotes such as:

“Instead of OOO and abstraction, let’s just 10x the codebase”

“Failure rate of 50% is fine, just run it twice”

Now if my non technical friends who recently graduated to vibe-coders had told me this I would have laughed. But this is google right? So it begs the question, will the future of DRY be Do Repeat Yourself? 🤷

I won’t start repeating their points here, have a look at their talk, it is very thought provoking👇

https://cloud.google.com/kubernetes-engine/enterprise/config-controller/docs/overview

https://www.youtube.com/watch?v=_oIoaW5i-xE

Scalable Internal AI Assistants: Lessons from Spotify’s AiKA

Spotify’s Internal AI knowledge assistant, AiKa, talks about something every engineer has wished to have. A bot that takes over the Hero/Point of contact/… duties.

What is most interesting here is that they do not just provide a RAG trained on internal resources. They provide a framework where every internal team can make their own specialized version of AiKA on their internal knowledge sources.

  • Scoped to specific teams to ensure relevance
  • Pulling contextually appropriate data to maintain accuracy
  • Easily customizable for different business domains

As always in RAG the smaller the R the better the A (I just made that up, but I like it!)

Dive deeper here 👇

https://www.youtube.com/watch?v=FEy2lhe6CM8

https://backstage.spotify.com/discover/blog/aika-data-plugins-coming-to-portal/

AI-Enhanced Observability with OpenTelemetry

Observability frameworks are evolving to incorporate AI-specific metrics such as token usage, inference latency, and retrieval accuracy in retrieval-augmented generation (RAG) pipelines. OpenTelemetry is becoming the de facto standard for capturing and correlating these signals.

This AI-native observability enables proactive troubleshooting and performance optimization in production AI systems.

AI-Driven CLI Tools Improve Platform Operations

Demonstrations of GPT-powered kubectl wrappers show how AI can assist platform engineers by translating natural language commands into Kubernetes operations, providing explanations, and ensuring audit logs. While still in early stages, these tools promise to reduce incident response times and empower less experienced operators.

🔗 https://k8sgpt.ai/

Conclusion: Kubernetes will rise once more

While k8s was not part of the LLM hype train last year. The train has made some turns and k8s is now fully on board 🚂 Next year will be crucial, but especially in the Small LLM space, k8s has a big chance of becoming the de facto standard (again?).


If your organization is exploring how to integrate these capabilities or needs guidance, feel free to reach out, always happy to swap notes!

Nick Schouten | Data Engineer dataroots, a Talan company