The Inference Era

Large language models (LLMs) are the reasoning engines of modern AI. Today, a major inflection point has arrived: as the world races to deploy AI at scale, model inference has moved to the center of the stack. Welcome to the inference era.

Without proper optimization, however, LLMs can be expensive and slow to serve. Hands-On LLM Serving and Optimization is a comprehensive guide to the complexities of deploying and optimizing LLMs at scale.

In this hands-on, engineering-focused book, authors Chi Wang and Peiheng Hu combine practical examples, code, and strategies for building robust, performant, and cost-efficient AI token factories. Whether you’re building the LLM inference infrastructure or the applications that consume it, a deep understanding of LLM serving will make you a more effective, future-ready engineer as AI transforms how we work and build.

Praise for the Book

Most AI teams treat LLM serving as an afterthought until real traffic hits. This book provides the technical intuition and hands-on understanding every developer needs, from KV caching to speculative decoding, to tackle the hardest questions that arise while configuring serving infrastructure needed to deploy agents in production.

Vineeth Kalluru
Enterprise AI Architect; former engineer at Google Cloud TPU and Sambanova Systems

The missing manual for LLM serving and inference — comprehensive coverage of LLM serving challenges and optimization techniques such as scaling attention, multi-node inferencing, and disaggregation, with real-world examples. Essential reading for anyone scaling AI infrastructure.

Winnie Kwon
Engineering Manager, Broadcom

Hands-On LLM Serving and Optimization delivers real-world insight into the model serving architectures and optimization techniques required to build scalable, efficient LLM inference systems. Its hands-on approach makes complex LLM serving concepts accessible for anyone.

Patrice Castonguay
Engineering leader in LLM inference

As LLMs become core to modern AI platforms, this book offers a practical roadmap for serving and optimizing LLMs at scale — handling large workloads with reliability and cost efficiency.

Bhavesh Doshi
Vice President (VP) of AI Cloud, Salesforce

What You’ll Learn

Foundations

Learn the foundations of model serving with core concepts, design paradigms, and industry best practices.

Challenges at Scale

Understand the common challenges of hosting LLMs at scale.

Latency & Throughput

Balance latency and throughput to meet the demands of AI applications and business requirements.

Cost Efficiency

Host LLMs cost-effectively with practical, code-backed techniques.

Book Contents

Introduction to Model Serving and Optimization

Anatomy of a model, the model lifecycle, what model serving is and why we optimize it, serving paradigms (edge, single-model, multi-model), and serving platforms.
Large Language Model Serving

Inside the transformer, autoregressive generation, decoder-only architecture, attention, the KV cache, prefill and decode, and running LLMs with a serving framework (vLLM).
Model-Serving System Design: A Deep Dive

Building online single-model and multi-model serving services from scratch, batching and streaming, NVIDIA Triton, and cost- vs. latency-optimized designs.
Model Serving Best Practices

Serving in an agentic world (agents, RAG, CAG), enterprise serving architecture, building with an open source stack or a cloud vendor, build-or-buy strategy, and performance measurement.
Challenges When Serving LLMs

Why optimization matters, accelerator chips and GPU specs, model-loading bottlenecks, KV cache sizing, and arithmetic-intensity analysis of prefill and decode.
Essential LLM Optimization Techniques

Request batching and scheduling (dynamic, continuous, chunked prefill), scalable attention and kernel fusion, model compression (quantization, distillation, pruning), and prefix caching / RadixAttention.
Advanced LLM Optimization Techniques

Speculative decoding, multi-GPU and multinode inference (data/tensor/pipeline/expert parallelism), prefill–decode disaggregation, and advanced KV caching for long-context serving.
LLM Serving Frameworks

vLLM internals (architecture, scheduler, layered optimization), TensorRT-LLM, SGLang, llama.cpp, and how to select the right framework.
LLM Optimization in Practice

A step-by-step optimization plan for Qwen3-14B with vLLM, from hardware inspection and benchmarking to quantization, distributed serving, and common trade-offs.
Advancements in LLM Serving

Semantic caching, performance profiling, multimodal serving, edge AI, multi-LoRA serving, and model serving in reinforcement learning.

Code & Notebooks

All companion source code, notebooks, and examples are organized by chapter in the GitHub repository — from from-scratch serving services to vLLM, TensorRT-LLM, SGLang, quantization, speculative decoding, and end-to-end optimization benchmarks.

⚡ GPU recommended. Many of the notebooks and examples (vLLM, TensorRT-LLM, speculative decoding, distributed serving, quantization) require an NVIDIA GPU to run. If you don’t have local GPU access, we recommend Google Colab, which offers great GPU support on a convenient pay-as-you-go basis — just upload a notebook, select a GPU runtime, and go.

GitHub Repository Run on Google Colab

About the Authors

Chi Wang

Co-author of Hands-On LLM Serving and Optimization.

Peiheng Hu

Co-author of Hands-On LLM Serving and Optimization.