Introduction

There’s a version of this story where Kubernetes “came back.” That framing is wrong.

Kubernetes never left—it lost the narrative. Between 2020 and 2023, the conversation was dominated by YAML-hell complaints, cluster complexity horror stories, and a genuine question: is this worth it for most teams? The backlash was real. But the adoption curve was not affected. Production usage kept climbing even as the discourse soured.

Now, in 2026, the framing has shifted entirely. Kubernetes is the de facto substrate for running inference workloads at scale. The 2026 CNCF survey puts production adoption at 82%—not “emerging,” not “early majority.” It’s baseline infrastructure, like TCP/IP. You don’t get excited about TCP/IP. You also can’t run a distributed system without it.

The question in 2026 isn’t “should we use Kubernetes?” It’s two more precise questions: What does Kubernetes actually do in the AI stack? And when does the added complexity actually pay off versus a simpler Docker deployment?

Those are the questions this post answers.


Historical Context: How We Got Here

Kubernetes came out of Google’s internal Borg system in 2014. The core problem Borg solved was operational at scale: Google was running hundreds of thousands of tasks across massive clusters, and manual bin-packing was impossible. You needed a system that could look at a fleet of machines, figure out where to place a job given CPU, memory, and locality constraints, and handle failures automatically.

The open-source version—Kubernetes—inherited that design philosophy but applied it to a world of containers. The early years (2015–2018) were genuinely rough. The API surface was enormous, networking was tribal knowledge, and RBAC was arcane. “Just use Kubernetes” was advice that came with an implicit weeks-long onboarding tax.

What changed by 2020 was managed Kubernetes. EKS, GKE, and AKS absorbed most of the control-plane complexity. You still wrote YAML, but you stopped racking and stacking etcd clusters. The operational floor dropped significantly.

Then came the inference wave.

In 2024–2025, the AI industry crossed a threshold: the hard, expensive problem shifted from training large models to running them at scale. Training a frontier model is something maybe a dozen organizations do. Running inference against that model—serving millions of requests against expensive GPU hardware—is a problem every organization building with AI faces.

This is where Kubernetes’ architecture became decisive.


Core Mechanics: What Kubernetes Actually Does for AI

GPU Scheduling and the Bin-Packing Problem

An NVIDIA H100 costs roughly $30,000 and runs ~$2.60/hour in cloud. If you’re running a fleet of AI agents or inference endpoints, leaving those GPUs idle is not a financial abstraction—it’s a real number on a real invoice.

The challenge is that AI workloads are bursty and heterogeneous. A “Research Agent” might need 40GB of VRAM for complex chain-of-thought reasoning. A “Summarizer Agent” might only need 12GB. Without orchestration, you either over-provision (wasting money) or under-provision (hitting OOM errors).

Kubernetes solves this with the resources specification and, more importantly, with support for Multi-Instance GPU (MIG) partitioning:

Single H100 (80GB) → partitioned via MIG:
  ┌─────────────────────────────────────────┐
  │  Agent Pod A    │  Agent Pod B           │
  │  (3g.40gb MIG)  │  (1g.20gb MIG x2)     │
  │  VRAM: 40GB     │  VRAM: 10GB each       │
  └─────────────────────────────────────────┘

Without MIG (single Docker container):
  ┌─────────────────────────────────────────┐
  │  Agent Pod A                            │
  │  (entire GPU)   VRAM: 80GB allocated    │
  │                 VRAM: ~40GB actually    │
  │                 used -- 50% waste       │
  └─────────────────────────────────────────┘

With MIG and Kubernetes’ nvidia.com/mig-3g.40gb resource requests, the scheduler can bin-pack multiple agents onto a single GPU. A 2024 NVIDIA benchmarking study showed MIG can improve effective GPU utilization from ~40% to ~85% for mixed inference workloads. At $2.60/hour per GPU, that delta compounds fast across a fleet.

The Self-Healing Property for Long-Running Agents

Traditional web services are stateless and short-lived. A request comes in, a response goes out, the container resets. The failure mode is well-understood: a crash loses one request, the pod restarts, traffic routes elsewhere.

AI agents in 2026 don’t fit this model. Agentic pipelines—where a model is given a task, tools, and autonomy to complete a multi-step workflow over hours—are stateful and long-running. An OOM event halfway through a 4-hour research task is not a minor inconvenience; it’s a lost work unit.

Kubernetes’ health probe system handles this differently than a simple docker run --restart=always:

K8s Health Probe Sequence (Long-Running Agent):

Agent Pod                 kubelet                   Scheduler
    |                        |                           |
    | [Running: Step 3/12]   |                           |
    |                        |                           |
    | OOM kill (step 5)      |                           |
    |<--- liveness probe-----|                           |
    | [fails]                |                           |
    |                        |--- mark Unhealthy ------->|
    |                        |                           |
    |<--- terminate ----------|                           |
    |                        |                           |
    |                        |--- schedule replacement ->|
    |                        |                           |
    | [New pod on healthy    |                           |
    |  node, step 1/12]      |                           |

The critical detail is “healthy node.” If the original node is under memory pressure, Kubernetes will schedule the replacement pod on a different node. docker restart restarts on the same machine—if that machine is the problem, you’re in a loop.

KEDA and Scale-to-Zero

KEDA (Kubernetes Event-Driven Autoscaling) is the mechanism that makes Kubernetes economically viable for bursty AI workloads. The core idea: instead of keeping agent pods running 24/7, you scale them to zero when no work is queued and spin them up when a trigger fires.

A concrete example: suppose you have a “Document Processor Agent” that runs when a PDF lands in an S3 bucket. KEDA watches the S3 queue depth:

Queue Depth → KEDA Decision → Pod Count
    0        →   scale to 0  →     0 pods ($0 GPU cost)
    1-5      →   scale up    →     2 pods
    6-20     →   scale up    →     5 pods
    21+      →   scale up    →    10 pods (max)

In a flat Docker deployment, you keep one container running continuously. If your agent processes 500 PDFs per month—roughly 17/day—that idle container is burning GPU-hours for the other 23 hours daily. At GPU-class pricing, that’s real money for what is fundamentally waiting.

KEDA supports scale triggers from Kafka, SQS, RabbitMQ, Prometheus metrics, and HTTP request rate. The startup latency for a typical inference container (pulling model weights from cache) is 15–30 seconds, which is acceptable for async queue-backed work.


The “Invisible Kubernetes” Shift

The reason the complaint volume dropped isn’t that Kubernetes got simpler. It’s that most developers stopped touching it directly.

Platform Engineering is the discipline that emerged to put a usable interface in front of Kubernetes. The pattern: a small platform team (3–6 engineers) runs the cluster, configures security policies, and builds Internal Developer Portals (IDPs) on top of tools like Backstage or Cortex. Application developers interact with a service catalog—they fill out a form with their container image, resource requirements, and environment variables. The IDP generates the Kubernetes manifests and applies them.

The developer never writes a Deployment YAML. They never touch kubectl. The cognitive overhead of Kubernetes is absorbed by the platform layer.

Amazon’s EKS Auto Mode, released in late 2025, takes this further at the managed service level. Auto Mode manages node provisioning, patching, and scaling automatically. You submit a pod spec; EKS figures out which EC2 instance type to provision, when to scale in, and how to apply OS patches without disrupting running workloads. It’s not quite “serverless Kubernetes,” but it removes enough of the operational surface that small teams can run production clusters without dedicated SRE headcount.

The practical implication: the Kubernetes skill gap is bifurcating. You either need deep expertise (platform/infra engineers who understand control-plane internals, CNI plugins, and admission controllers) or surface-level familiarity (application engineers who understand how to write a Deployment spec and read a pod status). The “middle” expertise—knowing kubectl well but not deeply—is becoming less relevant.


Technical Re-Engineering: eBPF and Wasm

Two technologies are changing Kubernetes’ internal architecture in ways that matter for resource efficiency.

eBPF and the Sidecar Tax

The traditional service mesh pattern (Istio, Linkerd) works by injecting a sidecar proxy container into every pod. The proxy handles mTLS, traffic policy, observability. The problem: every pod now runs two containers, and the sidecar is not free. Estimates from the Istio project and independent benchmarks put the sidecar overhead at 10–15% of CPU and memory per pod, plus added latency on every network call.

eBPF (Extended Berkeley Packet Filter) eliminates this by moving network policy and observability into the Linux kernel itself. Cilium’s implementation, and Istio’s Ambient Mode (released for production use in 2024), both use eBPF to implement service mesh features without sidecars.

The performance delta is meaningful:

Sidecar Mesh (Istio classic):
  Pod = App Container + Envoy Proxy
  Memory overhead: ~50-100MB per pod
  Latency added: ~2-5ms per request (proxy hops)
  CPU overhead: ~100-200m per pod

eBPF Mesh (Cilium / Ambient Mode):
  Pod = App Container only
  Memory overhead: ~5-10MB (kernel, shared)
  Latency added: ~0.1-0.3ms (kernel bypass)
  CPU overhead: ~10-30m (kernel, shared)

For a cluster running 200 pods, eliminating sidecars recovers roughly 20GB of RAM and significant compute headroom. That headroom can run more inference workloads.

WebAssembly and the Container Alternative

WebAssembly (Wasm) modules running in Kubernetes via WASI (WebAssembly System Interface) represent a more speculative but real shift. A Wasm module cold-starts in milliseconds versus the 2–5 seconds for a typical container. Memory footprint can be 10–100x smaller depending on the runtime.

In 2026, Wasm is in production for specific workloads—primarily lightweight pre/post-processing pipelines, not heavy inference. The toolchain (Wasmtime, SpinKube) has matured enough that teams are using it at the edges of their AI pipelines. The pattern is a hybrid: a Wasm module handles request validation and preprocessing, hands off to a container running the actual model, and a Wasm module post-processes the response. The GPU-bound inference step stays in a container; the surrounding logic doesn’t have to.


When Docker Wins

All of the above is real. It’s also operational complexity that many teams don’t need.

Three scenarios where staying on Docker is the correct engineering decision:

Prototyping and early architecture. If you’re still iterating on your agent’s core reasoning loop, RAG pipeline, or tool-calling strategy, Docker Compose gives you a full local environment in minutes:

services:
  agent:
    image: my-agent:dev
    depends_on: [redis, vectordb]
  redis:
    image: redis:7
  vectordb:
    image: milvus/milvus:2.4

You can iterate on agent behavior, swap vector databases, and change tool implementations without a cluster in the loop. Kubernetes adds nothing here except friction.

Edge and constrained deployments. The Kubernetes control plane (API server, etcd, scheduler, controller-manager) consumes 2–4GB of RAM at minimum even for a single-node cluster. On a factory floor device or edge inference server with 8GB total RAM, that overhead is unacceptable. Docker or containerd with a simple systemd service unit is the right tool.

Small teams with predictable workloads. If you have two engineers, three internal users, and one agent with steady request volume, Kubernetes RBAC configuration, network policies, and upgrade management are liabilities without corresponding benefits. The operational surface is a maintenance burden your team doesn’t have capacity to manage well. Managing a Kubernetes cluster poorly is worse than not using Kubernetes at all.

The 2026 heuristic that holds up empirically: under 5 engineers and under 3 agents, stay on Docker. It’s not a hard line, but it’s a reasonable prior.


Practical Comparison

Feature              Docker / Compose         Kubernetes
─────────────────────────────────────────────────────────────────
Setup time           Minutes                  Days to weeks
GPU efficiency       Poor (full-GPU per ctn)  High (MIG, slicing)
Self-healing         Manual / restart=always  Automated, node-aware
Scale to zero        No                       Yes (KEDA)
Agent networking     Docker bridge / hosts    DNS-based service disc.
Control plane cost   None                     2-4GB RAM min
Operational burden   Low                      Medium to High
Best for             Dev, edge, MVPs          Production, multi-agent
─────────────────────────────────────────────────────────────────

The GPU efficiency row is the decisive one for AI workloads. At scale, poor GPU utilization translates directly to cost. If you’re managing a fleet of inference workloads and aren’t doing MIG or time-slicing, you’re likely paying 30–50% more than you need to.


Real-World Adoption Data

The adoption numbers are not soft marketing metrics. The 2026 CNCF annual survey—2,800+ respondents across engineering organizations—shows:

  • 82% production Kubernetes adoption (up from 71% in 2023)
  • 66% of organizations running generative AI inference use Kubernetes to manage those workloads
  • 54% are using or evaluating Platform Engineering / IDPs to abstract Kubernetes from application developers
  • eBPF-based networking (Cilium or equivalent) is now the default in 44% of new cluster deployments, up from 12% in 2022

The 66% inference number is the one worth sitting with. It means that of all the organizations actually running inference pipelines in production, nearly two-thirds chose Kubernetes to manage it. That’s a strong signal that the GPU orchestration and reliability properties are driving real decisions, not just architectural preferences.


What This Means for Engineers in 2026

If you’re building AI-backed products, here’s the practical framing:

Understand the platform, even if you don’t operate it. Even if your team uses an IDP that generates Kubernetes manifests, understanding the underlying primitives—Deployments, Services, resource limits, liveness probes—lets you debug production issues that the abstraction layer won’t surface clearly. You need to read a pod spec, not write it from scratch.

GPU scheduling is where the money is. If you’re paying for GPU infrastructure, instrument your utilization. If it’s consistently below 60%, investigate MIG partitioning or time-slicing. The engineering time to set this up is often recovered in the first month’s bill.

KEDA is underused. Scale-to-zero for async, queue-backed agent workloads is well-understood technology in 2026 and can cut idle GPU costs significantly. If your agents run on triggers rather than continuous traffic, this is worth evaluating.

Don’t run Kubernetes to run Kubernetes. The worst clusters I’ve seen are the ones that exist because someone decided to “do things the right way” before the workload justified it. If you’re under the heuristic threshold—small team, few agents, predictable load—Docker Compose is not technical debt. It’s appropriate engineering.

The plumbing analogy is accurate. Nobody wins points for having elaborate water infrastructure in a two-room office. But when you’re building a skyscraper, that infrastructure isn’t optional. Know which building you’re in.