Inference Engineering — Book Review

Some books explain AI infrastructure with clean diagrams and tidy abstractions. This one pulls you into the engine room and shows you what actually happens between a prompt and a response — the memory bottlenecks, the cache misses, the tradeoffs nobody documents — written by someone who spent four years keeping real inference pipelines alive at scale. The tone stays honest and practical throughout, and it never pretends the hard parts are simple.

What the book actually is

A field guide to the inference stack — from CUDA kernels to Kubernetes autoscaling — that treats model serving as a living system you operate, not an architecture diagram you draw once and hand off.

A plain account of tradeoffs: when to quantize, cache, speculate, parallelize, or disaggregate — and what each decision costs in latency, throughput, quality, and engineering attention.

A reminder that the bottleneck is always specific: prefill and decode fail differently, image and LLM inference fail differently, and fixing the wrong one wastes everything.

How it explains the hard parts in simple words

Prefill and decode are different problems. Prefill is compute-bound — it processes your entire input in parallel and determines time to first token. Decode is memory-bound — it generates one token at a time and determines tokens per second. Optimizing one does nothing for the other.

The KV cache is your most valuable asset. It stores attention results so decode runs in linear time instead of quadratic. Prefix caching extends that across requests — but only if your novel tokens arrive as late in the sequence as possible, or you break the cache hit at the first token.

Quantization is a quality contract. Lowering precision from FP16 to INT8 or INT4 frees compute and memory, but you must measure quality impact before shipping it — not after a user complaint.

Speculative decoding trades latency for throughput. EAGLE drafts tokens using hidden states from the target model. N-gram speculation matches prefixes from the input to predict outputs. Both reduce forward passes, but both require smaller batch sizes — meaning higher cost per request.

Disaggregation is powerful and expensive. Separating prefill and decode onto independently scaling workers unlocks enormous efficiency gains — but only makes sense past 100M–1B tokens per day. Before that, the complexity costs more than it saves.

Model parallelism is for when one GPU isn't enough. Tensor parallelism splits the model across GPUs for lower latency. Expert parallelism routes MoE model experts across GPUs for higher throughput. Multi-node inference adds network latency — only worth it when the model doesn't fit any other way.

Production is a systems problem, not a CUDA problem. Autoscaling, cold starts, routing, observability, zero-downtime deployments, and cost estimation all live above the runtime layer — and they determine whether your optimized model actually serves users reliably.

The infrastructure stories that teach

The book doesn't hide the real tensions: a decode phase throttled by memory bandwidth while compute sits idle, a KV cache that fills and evicts constantly because prefix ordering was never considered, a speculative decoding setup that improves p50 latency while quietly destroying throughput at scale. Each scenario ends with a concrete decision, a changed configuration, or a new constraint — and those stack into a system that degrades gracefully instead of collapsing suddenly.

Two ideas that quietly run through the book

The bottleneck always moves. You don't optimize inference once. You find the constraint — compute, memory bandwidth, cache hit rate, cold start time — fix it, and immediately find the next one. The work is iterative and never finished.

Know your use case before you touch the stack. Online or offline. Consumer or B2B. Latency-sensitive or throughput-sensitive. Every optimization decision flows downstream from those answers. Getting them wrong means tuning the wrong thing confidently.

What this looks like in practice

Define your latency budget before touching runtime — end-to-end, not just GPU time, so you know what you're actually optimizing for
Set prefix caching up early and protect it — keep novel tokens as late in your context as possible, or you pay full prefill cost every time
Measure quality before quantizing — establish a baseline with evals before lowering precision, so you know what you accepted and why
Watch batch size when enabling speculation — latency improves, throughput drops, cost rises; make that tradeoff consciously
Fix health probes and autoscaling before adding GPUs — cold start behavior and concurrency limits determine real availability more than raw compute
Only disaggregate at scale — separate prefill and decode workers when the traffic justifies the operational complexity, not before

Who will get the most out of it

Engineers who've deployed models via API and now need to understand what's happening underneath when they own the serving layer.

ML engineers and SREs who want explanations that match production reality — not benchmark presentations from a conference slide.

Builders working with open models on AWS, GCP, or self-hosted clusters who need to close the gap between "it works" and "it works under load."

What felt different to read

The writing is direct: here's the bottleneck, here's why it exists, here's what you can do about it, here's what that costs. It never pretends you can win on latency, throughput, and cost simultaneously — it shows you how to pick which one matters most for your product right now. It also doesn't romanticize any single tool. vLLM, SGLang, and TensorRT-LLM each get honest assessments of where they fit and where they don't.

What this book pushed me to change

Think about prefill and decode separately when something feels slow — they break in completely different ways
Set KV cache allocation explicitly instead of leaving it at defaults, and understand where overflow goes
Treat quantization as a quality contract — measure first, ship second
Watch batch size whenever speculative decoding is enabled, because the throughput cost is real
Stop reaching for disaggregation until the traffic numbers actually justify it
Review the constraint weekly, because it moved while I was busy with something else

Verdict

If you want universal rules and clean architecture blueprints, look elsewhere. If you want a clear, honest view of how inference actually works — the hardware, the software, the techniques, and the production systems — and how real teams make real tradeoffs under pressure, this is worth your full day.

It's free. It treats the inference layer seriously. And the inference layer is where AI products are actually won or lost.

📖 Free PDF at baseten.com/inference-engineering

Inference Engineering — Book Review

What the book actually is

How it explains the hard parts in simple words

The infrastructure stories that teach

Two ideas that quietly run through the book

What this looks like in practice

Who will get the most out of it

What felt different to read

What this book pushed me to change

Verdict

Comments

📚Book Review Series📚

Kubernetes for Generative AI Solutions — Book Review

More from this blog

Generative AI on Kubernetes — Book Review

NVIDIA's Two Gifts to Kubernetes: DRA and AICR — What They Mean for Your EKS GPU Platform

llm-d on EKS: The New Inference Resource Model That Changes How You Think About GPU Routing

GPU Deadlock on EKS: What Gang Scheduling Actually Is, Why the Default Scheduler Fails You, and Three Ways to Fix It

AWS Community Day Pune 2026 — Notes From a Grateful Attendee and Speaker

Command Palette

What the book actually is

How it explains the hard parts in simple words

The infrastructure stories that teach

Two ideas that quietly run through the book

What this looks like in practice

Who will get the most out of it

What felt different to read

What this book pushed me to change

Verdict

Comments

📚Book Review Series📚

Kubernetes for Generative AI Solutions — Book Review

More from this blog