From first principles
Start at nvidia-smi. End at disaggregated serving. Every primitive is derived, not declared — you'll know why each technique exists before you use it.
120 lessons across 20 phases. every layer of the inference stack, built up from the math, taken down to production.
Trusted engines & frameworks covered in depth
Most ML courses end where inference engineering begins. This one starts there — a curriculum that takes you from understands a model to serves one at scale.
Start at nvidia-smi. End at disaggregated serving. Every primitive is derived, not declared — you'll know why each technique exists before you use it.
Every chapter ends in a real engineering decision. Quantize or not? Speculate or batch? Scale up or out? Each answer is wired to the load profile that demands it.
vLLM, SGLang, TensorRT-LLM, Dynamo — read like a senior engineer. Know what each one optimizes for and when you'd reach for it.
Text, image-gen, video, ASR. Inference engineering isn't an LLM-only practice — and this course doesn't treat it like one.
Hopper, Ada, Blackwell, Rubin. TPUs, Trainium, Groq. You'll know the silicon you're writing against — not just the API surface above it.
No proofs, no derivations of attention from scratch. Concepts arrive when a system call demands them. The pace is set by what ships.
Each chapter is a self-contained module — read sequentially for the full arc, or drop in on a topic when work demands it.
This isn't a survey. Every lesson is the result of a specific question a working engineer asked their inference team — and the answer they wished they'd had.
every chapter starts with the math, then the code, then the production constraint that bends both. you'll write the naive version, profile it, and watch each optimisation peel off a layer of cost.
the curriculum assumes you can read python and have run a model on a gpu. it does not assume you've written a kernel, sized a kv cache, or argued with an autoscaler at 3am. those skills land here.
no frameworks until phase 8. you'll build a working decoder, a kv cache, a sampler, and a batch scheduler from scratch — then read vllm and tensorrt-llm with the right vocabulary in your head.
every phase ends with a production case: a real number you'd hit, a real bug you'd file, a real call you'd make. the goal isn't to know the trick. the goal is to know when to reach for it.
No tiers, no checkout. Create a free account and read every lesson, top to bottom.
Sign in once. Read everything.