The Inference Era
Large language models (LLMs) are the reasoning engines of modern AI. Today, a major inflection point has arrived: as the world races to deploy AI at scale, model inference has moved to the center of the stack. Welcome to the inference era.
Without proper optimization, however, LLMs can be expensive and slow to serve. Hands-On LLM Serving and Optimization is a comprehensive guide to the complexities of deploying and optimizing LLMs at scale.
In this hands-on, engineering-focused book, authors Chi Wang and Peiheng Hu combine practical examples, code, and strategies for building robust, performant, and cost-efficient AI token factories. Whether you’re building the LLM inference infrastructure or the applications that consume it, a deep understanding of LLM serving will make you a more effective, future-ready engineer as AI transforms how we work and build.
Praise for the Book
Most AI teams treat LLM serving as an afterthought until real traffic hits. This book provides the technical intuition and hands-on understanding every developer needs, from KV caching to speculative decoding, to tackle the hardest questions that arise while configuring serving infrastructure needed to deploy agents in production.
Enterprise AI Architect; former engineer at Google Cloud TPU and Sambanova Systems
The missing manual for LLM serving and inference — comprehensive coverage of LLM serving challenges and optimization techniques such as scaling attention, multi-node inferencing, and disaggregation, with real-world examples. Essential reading for anyone scaling AI infrastructure.
Engineering Manager, Broadcom
Hands-On LLM Serving and Optimization delivers real-world insight into the model serving architectures and optimization techniques required to build scalable, efficient LLM inference systems. Its hands-on approach makes complex LLM serving concepts accessible for anyone.
Engineering leader in LLM inference
As LLMs become core to modern AI platforms, this book offers a practical roadmap for serving and optimizing LLMs at scale — handling large workloads with reliability and cost efficiency.
Vice President (VP) of AI Cloud, Salesforce
What You’ll Learn
Foundations
Learn the foundations of model serving with core concepts, design paradigms, and industry best practices.
Challenges at Scale
Understand the common challenges of hosting LLMs at scale.
Latency & Throughput
Balance latency and throughput to meet the demands of AI applications and business requirements.
Cost Efficiency
Host LLMs cost-effectively with practical, code-backed techniques.
Book Contents
-
Introduction to Model Serving and OptimizationAnatomy of a model, the model lifecycle, what model serving is and why we optimize it, serving paradigms (edge, single-model, multi-model), and serving platforms.
-
Large Language Model ServingInside the transformer, autoregressive generation, decoder-only architecture, attention, the KV cache, prefill and decode, and running LLMs with a serving framework (vLLM).
-
Model-Serving System Design: A Deep DiveBuilding online single-model and multi-model serving services from scratch, batching and streaming, NVIDIA Triton, and cost- vs. latency-optimized designs.
-
Model Serving Best PracticesServing in an agentic world (agents, RAG, CAG), enterprise serving architecture, building with an open source stack or a cloud vendor, build-or-buy strategy, and performance measurement.
-
Challenges When Serving LLMsWhy optimization matters, accelerator chips and GPU specs, model-loading bottlenecks, KV cache sizing, and arithmetic-intensity analysis of prefill and decode.
-
Essential LLM Optimization TechniquesRequest batching and scheduling (dynamic, continuous, chunked prefill), scalable attention and kernel fusion, model compression (quantization, distillation, pruning), and prefix caching / RadixAttention.
-
Advanced LLM Optimization TechniquesSpeculative decoding, multi-GPU and multinode inference (data/tensor/pipeline/expert parallelism), prefill–decode disaggregation, and advanced KV caching for long-context serving.
-
LLM Serving FrameworksvLLM internals (architecture, scheduler, layered optimization), TensorRT-LLM, SGLang, llama.cpp, and how to select the right framework.
-
LLM Optimization in PracticeA step-by-step optimization plan for Qwen3-14B with vLLM, from hardware inspection and benchmarking to quantization, distributed serving, and common trade-offs.
-
Advancements in LLM ServingSemantic caching, performance profiling, multimodal serving, edge AI, multi-LoRA serving, and model serving in reinforcement learning.
Code & Notebooks
All companion source code, notebooks, and examples are organized by chapter in the GitHub repository — from from-scratch serving services to vLLM, TensorRT-LLM, SGLang, quantization, speculative decoding, and end-to-end optimization benchmarks.