New · Chapter 0–7 available

InferenceEngineering

120 lessons across 20 phases. every layer of the inference stack, built up from the math, taken down to production.

Start course →See curriculum

0Lessons

0hReading time

0Chapters

0Latest update

Prefill vs decode bottlenecks10mINF-101

FlashAttention from first principles12mINF-100

KV-cache layout and reuse10mINF-099

Speculative decoding — draft + verify11mINF-098

Quantization: FP8, INT4 trade-offs11mINF-097

Tensor & pipeline parallelism9mINF-096

Trusted engines & frameworks covered in depth

PyTorch

vLLM

SGLang

TensorRT-LLM

CUDA

Triton

Why this course

Built for engineers who ship inference.

Most ML courses end where inference engineering begins. This one starts there — a curriculum that takes you from understands a model to serves one at scale.

From first principles

Start at nvidia-smi. End at disaggregated serving. Every primitive is derived, not declared — you'll know why each technique exists before you use it.

Production-shaped

Every chapter ends in a real engineering decision. Quantize or not? Speculate or batch? Scale up or out? Each answer is wired to the load profile that demands it.

Engine-aware, not engine-locked

vLLM, SGLang, TensorRT-LLM, Dynamo — read like a senior engineer. Know what each one optimizes for and when you'd reach for it.

Multimodal from day one

Text, image-gen, video, ASR. Inference engineering isn't an LLM-only practice — and this course doesn't treat it like one.

Hardware honest

Hopper, Ada, Blackwell, Rubin. TPUs, Trainium, Groq. You'll know the silicon you're writing against — not just the API surface above it.

Engineer, not researcher

No proofs, no derivations of attention from scratch. Concepts arrive when a system call demands them. The pace is set by what ships.

Curriculum

Eight chapters. One inference pipeline, end to end.

Each chapter is a self-contained module — read sequentially for the full arc, or drop in on a topic when work demands it.

Chapter 0 · Inference

Larry's first signpost. What inference is, what this course covers, and what he needs to know to take the road ahead.

3 lessons21 min

Open chapter →

Chapter 1 · Prerequisites

Before Larry leaves the village. Scale, app shape, model selection, the latency-throughput language.

10 lessons75 min

Open chapter →

Chapter 2 · Models

Larry takes apart the engine. Neural nets, transformer blocks, attention, MoE, image-gen — and the bottlenecks that bite each.

15 lessons128 min

Open chapter →

Chapter 3 · Hardware

The silicon Larry will read about for the rest of his career. GPU generations, instances, accelerators, local inference.

10 lessons72 min

Open chapter →

Chapter 4 · Software

Larry opens his terminal. CUDA, PyTorch, the engines, the benchmarks.

13 lessons104 min

Open chapter →

Chapter 5 · Techniques

The gripping chapter. Quantization, speculative decoding, caching, parallelism, disaggregation. Larry's toolbox.

17 lessons156 min

Open chapter →

Chapter 6 · Modalities

Beyond text. Vision, embeddings, ASR, TTS, image gen, video gen — every modality bolted on, every shape change documented.

14 lessons107 min

Open chapter →

Chapter 7 · Production

Larry ships. Containers, autoscalers, multi-cloud, observability, the client. The layer that decides whether 3 AM wakes you up.

19 lessons141 min

Open chapter →

How we teach this

Engineering pace, editorial care.

This isn't a survey. Every lesson is the result of a specific question a working engineer asked their inference team — and the answer they wished they'd had.

every chapter starts with the math, then the code, then the production constraint that bends both. you'll write the naive version, profile it, and watch each optimisation peel off a layer of cost.

the curriculum assumes you can read python and have run a model on a gpu. it does not assume you've written a kernel, sized a kv cache, or argued with an autoscaler at 3am. those skills land here.

no frameworks until phase 8. you'll build a working decoder, a kv cache, a sampler, and a batch scheduler from scratch — then read vllm and tensorrt-llm with the right vocabulary in your head.

every phase ends with a production case: a real number you'd hit, a real bug you'd file, a real call you'd make. the goal isn't to know the trick. the goal is to know when to reach for it.

0Lessons available now

0Chapters

0hReading time

0Glossary terms

Access

The whole course is free. Just sign in.

No tiers, no checkout. Create a free account and read every lesson, top to bottom.

Free

$0forever

Every chapter and all 101 lessons
Every interactive widget and kernel walkthrough
The glossary and reference
New lessons as they ship — no extra cost

InferenceEngineering

Built for engineers who ship inference.

From first principles

Production-shaped

Engine-aware, not engine-locked

Multimodal from day one

Hardware honest

Engineer, not researcher

Eight chapters. One inference pipeline, end to end.

Chapter 0 · Inference

Chapter 1 · Prerequisites

Chapter 2 · Models

Chapter 3 · Hardware

Chapter 4 · Software

Chapter 5 · Techniques

Chapter 6 · Modalities

Chapter 7 · Production

Engineering pace, editorial care.

The whole course is free. Just sign in.

Free

Start with Lesson 1.

InferenceEngineering

Built for engineers who ship inference.

From first principles

Production-shaped

Engine-aware, not engine-locked

Multimodal from day one

Hardware honest

Engineer, not researcher

Eight chapters. One inference pipeline, end to end.

Chapter 0 · Inference

Chapter 1 · Prerequisites

Chapter 2 · Models

Chapter 3 · Hardware

Chapter 4 · Software

Chapter 5 · Techniques

Chapter 6 · Modalities

Chapter 7 · Production

Engineering pace, editorial care.

The whole course is free. Just sign in.

Free

Start with Lesson 1.