Generative AI on Kubernetes

There's a specific kind of frustration that comes from reading a book that's half about what you actually needed.

Most resources on running AI in production either assume you're a data scientist who picked up Kubernetes last week, or a platform engineer who treats models as black boxes and hopes for the best. Very few sit in the uncomfortable middle — the place where most production engineering problems actually live.

Generative AI on Kubernetes, published by O'Reilly, sits in that middle. And it does it well enough that I want to write down why.

What the Book Is Actually About

The book is structured around four phases of operationalizing LLMs on Kubernetes: inference, production readiness, tuning, and AI-driven applications. That progression is intentional and reflects how most teams actually adopt AI — starting with serving a pre-trained model, then discovering that keeping it running reliably at scale is an entirely different problem.

It doesn't try to teach you machine learning. It explicitly treats LLMs as operational dark-gray boxes — you manage their infrastructure, resources, and lifecycle without diving into neural network mathematics. That framing is honest and useful. But it does give you just enough of the internals to make operational decisions informed ones.

The primer on LLM fundamentals in the introduction — tokenization, embeddings, the prefill and decode phases, compute-bound versus memory-bound workloads — is one of the better short explanations I've read. It's not dumbed down. It's scoped correctly. You close that section understanding why TTFT and TPOT are the metrics that matter, not just that they matter.

The Model Data Chapter Is Underrated

Most practitioners skip straight to serving configuration. The model data chapter deserves more attention than it gets.

The progression from emptyDir volumes and init containers, through PersistentVolumes with ReadOnlyMany access mode, all the way to OCI image volume mounts and modelcars is the first time I've seen this problem laid out with the actual trade-offs made explicit.

The core tension is this: PersistentVolumes backed by NFS or distributed filesystems give you storage efficiency when running many replicas of the same model — one copy, many readers. But they introduce network latency on every model weight read, and at high replica counts, the storage backend becomes the bottleneck. Local approaches like OCI volumes eliminate that latency but require a copy per node.

The book doesn't pretend there's a universally correct answer. It gives you the framework to make the right call for your workload — replica count, inference throughput requirements, network bandwidth, and whether you're running GPU-based or CPU-based inference. That's more useful than a recommendation.

The section on the Kubeflow Model Registry and OCI artifacts for storing model weights is particularly forward-looking. Treating a model file the same way you treat a container image — versioned, immutable, distributable via a registry — is the direction the ecosystem is heading, and the book explains the mechanics clearly.

Observability Is Where Most Teams Are Underinvested

The observability chapter reframes something that most platform engineers miss until they've already been burned.

Monitoring an LLM workload is not the same as monitoring a microservice. CPU and memory utilization tell an incomplete story. The primary compute resource is the GPU, and the two inference phases — compute-bound prefill and memory-bound decode — have fundamentally different performance characteristics that traditional request-rate metrics won't surface.

The key metrics the book covers:

Time To First Token (TTFT) — the user-perceived wait before the first token arrives; maps directly to the prefill phase; exposed in vLLM as vllm:time_to_first_token_seconds
Time Per Output Token (TPOT) — the speed of generation as seen by the user; at least 4–5 tokens per second keeps pace with human reading speed; exposed as vllm:time_per_output_token_seconds
KV cache utilization — because the decode phase is memory-bound, KV cache pressure is often where throughput collapses before any other metric signals a problem
Request queue depth — vllm:num_requests_waiting tells you when the batch is full and requests are backing up before the system appears saturated by any other measure

The book also covers GPU-specific metrics via NVIDIA DCGM, AMD ROCm SMI, and Intel XPU Manager — each exposing hardware metrics through Prometheus exporters. The point worth internalizing: there is no common naming convention across vendors, which means your observability setup needs to account for the specific hardware you're running.

The section on quality metrics and responsible AI — hallucination detection, bias monitoring via TrustyAI, LLM-as-a-judge patterns for asynchronous quality evaluation — is more advanced than I expected from a Kubernetes operations book, and it's the right kind of advanced. These aren't afterthoughts. They're treated as first-class production concerns.

Disaggregated Serving — The Most Forward-Looking Section

If you only read one advanced section, read the disaggregated serving chapter.

The core idea: prefill is compute-bound, decode is memory-bound. They have different hardware requirements, different scaling characteristics, and different failure modes. Running them on the same pool of instances is a compromise that makes neither phase optimal. Disaggregated serving splits them into separate hardware pools so each can be tuned and scaled independently.

The full stack the book describes — Gateway API Inference Extension, distributed KV cache via LMCache, NIXL for point-to-point memory transfer, llm-d for orchestration, KServe LLMInferenceService as the management layer — is genuinely cutting-edge. It's the direction production inference at scale is heading, and the book explains the infrastructure implications clearly.

The network bandwidth requirement alone is worth understanding: traditional pod networking at 10–20 Gbps is insufficient for sharing KV cache blocks between prefill and decode instances at production scale. The requirement is ~500–600 Gbps — an order of magnitude higher — which means RDMA, RoCE, NVLink, or InfiniBand configurations that were previously only seen in HPC environments.

This isn't configuration you'll deploy next week. But understanding the architecture now means the decisions you're making today about GPU node topology, networking, and storage don't create unnecessary migration costs later.

Where the Book Has Gaps

The honest gaps, for teams running on AWS:

Karpenter — the book covers cluster autoscaler and GPU node management, but Karpenter's NodePool and EC2NodeClass patterns for GPU workloads, and its interaction with LLM-specific autoscaling signals like TTFT and KV cache utilization, are not covered.

AWS-specific hardware — Trainium and Inferentia2 are absent. Teams running fine-tuning on trn1 instances or inference on inf2 will need to look elsewhere for the operational patterns specific to Neuron-based workloads.

EKS-specific tooling — ACK, KRO, Pod Identity, and the patterns for managing AWS resources alongside Kubernetes workloads are outside scope. That's fair — the book is genuinely cloud-agnostic — but it means AWS practitioners need a second reference.

These gaps don't undermine the book. They define its scope. For the foundational mental model of GenAI infrastructure on Kubernetes — model serving, production operations, tuning, and AI-driven applications — it covers the ground it sets out to cover, and covers it well.

Who Should Read This

Platform engineers building out GPU infrastructure and trying to understand what's different about LLM workloads before something breaks in production
MLOps engineers operationalizing model serving and discovering that Kubernetes adds a layer of complexity that most ML tutorials skip entirely
Cloud architects designing the infrastructure layer for GenAI products who need a structured mental model before making hardware and networking commitments

It's not a quick read. It rewards attention. And it's the kind of book you close thinking about the decisions you're making differently — which is the only standard that matters.

Generative AI on Kubernetes — Book Review

What the Book Is Actually About

The Model Data Chapter Is Underrated

Observability Is Where Most Teams Are Underinvested

Disaggregated Serving — The Most Forward-Looking Section

Where the Book Has Gaps

Who Should Read This

Comments

More from this blog

NVIDIA's Two Gifts to Kubernetes: DRA and AICR — What They Mean for Your EKS GPU Platform

llm-d on EKS: The New Inference Resource Model That Changes How You Think About GPU Routing

GPU Deadlock on EKS: What Gang Scheduling Actually Is, Why the Default Scheduler Fails You, and Three Ways to Fix It

AWS Community Day Pune 2026 — Notes From a Grateful Attendee and Speaker

Command Palette

What the Book Is Actually About

The Model Data Chapter Is Underrated

Observability Is Where Most Teams Are Underinvested

Disaggregated Serving — The Most Forward-Looking Section

Where the Book Has Gaps

Who Should Read This

Comments

More from this blog