In December 2025 the SGLang team did something rare in open source: they wrote a second, smaller implementation of their own system — not to replace it, but to explain it. mini-SGLang is roughly 5,000 lines of Python, ~140× smaller than its parent. It preserves the four-process architecture, RadixAttention, chunked prefill, overlap scheduling, and tensor parallelism. Everything else — 27 attention backends, hierarchical caches, disaggregation, speculative decoding, EAGLE, multimodal — is gone. The result is a system you can read end-to-end in an afternoon, and which makes the production engine's 728k lines suddenly legible.
The previous entry in this series read full SGLang end-to-end. I came away admiring the engineering but uneasy about the reading experience — a 4,006-line single-class Scheduler, a 3,607-line ModelRunner, 27 attention backends. mini-SGLang is the project I wish I'd read first.
This entry is structured around that wish. Each section reads a piece of mini-SGLang, and where the design choice illuminates something about the full SGLang — or about software pedagogy in general — there's a margin note or a numbered Reflection callout.
§ 01 · AccountingThe 140× shrink, accounted.
To know what was kept, count what was cut. Here's the file-level comparison for the modules that appear in both:
| Module / file | full SGLang | mini | ratio |
|---|---|---|---|
scheduler.py | 4,006 lines + Mixins | 280 lines + 7 helpers | 14× |
model_runner.py / engine.py | 3,607 lines | 253 lines | 14× |
radix_cache.py | 828 lines | 253 lines | 3.3× |
| attention backends | 27 backends | 2 backends | 13× |
| models | ~50 architectures | 2 (Llama, Qwen3) | 25× |
| hardware backends | 4 (gpu/mlx/musa/npu) | 1 (gpu) | 4× |
| whole package | 728,969 LOC | ~5,000 LOC | 140× |
The scheduler's 14× compression is the most informative. Full SGLang's scheduler is a single 4,006-line class with Mixins (SchedulerMlxOverlapMixin, scheduler_dp_attn_mixin, scheduler_output_processor_mixin, scheduler_input_blocker). The mini version splits the same responsibilities into 7 small files: scheduler.py (280 lines, the loop), cache.py, config.py, decode.py, io.py, prefill.py, table.py. Same total surface area, completely different cognitive surface.
§ 02 · SkeletonWhat's preserved: the bones.
Despite the 140× shrink, mini-SGLang is the same shape as full SGLang. Four cooperating processes communicate over ZeroMQ for control and NCCL (via torch.distributed) for tensors:
- API Server — HTTP entry, OpenAI-compatible
- Tokenizer Worker — text → token ids
- Scheduler Worker — request batching + forward dispatch (one per GPU)
- Detokenizer Worker — token ids → streamed text
The mini version has one structural simplification: in full SGLang the scheduler is on every GPU (N schedulers for TP=N, fully symmetric). In mini, the scheduler is the only "manager" process and the workers are pure execution endpoints. This is a real architectural decision, not just code shrinkage — and arguably the cleaner one for understanding.
Small is a lens, not a constraint.
It's tempting to read mini-SGLang as a "stepping stone toward the real thing." That framing undersells it. The 5,000-line version isn't lesser — it's a different artifact serving a different purpose. The production engine optimizes for throughput, model coverage, hardware portability, and feature breadth. The teaching engine optimizes for read time per concept. Both are correct; both are necessary.
The cultural innovation in publishing mini-SGLang isn't the code; it's the decision to write code whose primary user is a reader, not an operator. That decision is rare in open source and historically required to come from individuals (Karpathy's nanoGPT, Kaiming He's MAE) rather than from institutional teams. LMSYS doing this at scale is, quietly, a precedent.
§ 03 · PathA guided reading order.
Because mini is small enough to read sequentially, there's a real "correct order." Here's the path I'd recommend, in eight stops:
__main__.py(3 lines) — just callslaunch_server(). The thinnest possible entry.server/— spawns the four processes, wires their ZMQ sockets.tokenizer/— wrap HF tokenizer; simplest worker, helps you read ZMQ patterns.core.py— definesReq,Batch,SamplingParams. These are the value objects that flow through all later modules.kvcache/radix_cache.py(253 lines) — algorithmic substance. RAII handle pattern. Read this carefully.scheduler/scheduler.py(280 lines) + the 6 helpers — the loop. Once kvcache is clear, the schedule decisions are obvious.engine/engine.py(253 lines) — the GPU side.forward_batchdispatches to attention + model.attention/+models/+layers/— the bottom of the stack. Read after you know who calls what.
§ 04 · Scheduler280 lines vs four thousand.
The full SGLang scheduler is the system's heart and also its largest single file. The mini version is the same heart, drawn in 280 lines plus six focused helpers:
scheduler.py # 280 — class Scheduler(SchedulerIOMixin), run_forever()
cache.py # CacheManager — coordinates with RadixCache
config.py # SchedulerConfig — knobs
decode.py # DecodeManager — Phase A: continue decoding
prefill.py # PrefillManager + ChunkedReq — Phase B: admit + chunked prefill
io.py # SchedulerIOMixin — ZMQ I/O
table.py # TableManager — per-request block tables
Each file has one concern. decode.py handles "what to do with already-running requests"; prefill.py handles "what new requests can we admit, possibly chunking"; cache.py mediates with the radix cache; table.py tracks block tables. The main scheduler.py just orchestrates these. This is what 4,006 lines look like when you stop optimizing for production and start optimizing for the reader.
Splitting is the lesson.
The single most teaching-rich move in mini-SGLang is that scheduler split. It shows what the full SGLang's scheduler could look like if cognitive load were optimized over locality. Mixins are clever because they avoid file-explosion, but they also scatter behavior across inheritance chains in ways grep can't easily reveal. Reading a 4,006-line class with four Mixins requires the reader to mentally maintain a partial method-resolution order — a tax that compounds with codebase age.
For your own systems: when a class crosses roughly 800 lines, ask whether you're really building one thing or whether you're papering over a missed decomposition. The teaching version of your future system will probably split it — so the production version should consider splitting it sooner.
§ 05 · RadixCacheRadixCache, quietly refactored.
The 253-line kvcache/radix_cache.py in mini-SGLang isn't just shorter than its 828-line parent — it has a quietly better API. Where full SGLang requires callers to manually pair inc_lock_ref() and dec_lock_ref(), mini introduces a single lock_handle() that returns a RadixCacheHandle dataclass:
class RadixCacheHandle: # frozen dataclass
node: RadixTreeNode
matched_indices: list[int]
# → released automatically when handle goes out of scope
class RadixPrefixCache:
def lock_handle(self, key) -> RadixCacheHandle: ...
def match_prefix(self, key) -> list[int]: ...
def insert_prefix(self, key, value) -> None: ...
def evict(self, num_tokens: int) -> int: ...
def reset(self) -> None: ... # NotImplementedError
def check_integrity(self) -> None: ... # no-op
The handle is a small, idiomatic Python pattern (frozen dataclass + explicit lifecycle). It makes leakage essentially impossible. The full SGLang couldn't introduce this without breaking many internal callers; mini, having no callers to protect, ships the cleaner design.
The teaching version is sometimes better.
You'd expect a minimal version to be strictly a subset of its parent — fewer features, otherwise the same. But mini-SGLang occasionally improves on full SGLang: the RAII handle for KV cache references, the seven-file scheduler split, the elimination of Mixins, the single overlap loop instead of dual normal/overlap loops. These are refactors the production version can't take cheaply, because they'd break callers.
So mini-SGLang functions as a shadow design — a working sketch of what full SGLang could become given a green field. That's another reason to read it: not just to learn the current system, but to glimpse the next version of it.
§ 06 · EngineThe Engine, in 253 lines.
engine/engine.py is mini's equivalent of full SGLang's 3,607-line model_runner.py. The shrinkage ratio is the same as the scheduler (14×), and the technique is the same: pull related concerns into sibling files.
class Engine:
def __init__(self, config: EngineConfig): ...
def _init_communication(): ... # NCCL groups
def _load_weight_state_dict(): ... # HF or generated
def _determine_num_pages(): ... # KV page budget
def _sync_get_memory(): ... # all-reduce free mem
def forward_batch(batch) -> ForwardOutput: # ★ the hot path
if self.graph_runner.can_replay(batch):
return self.graph_runner.replay(batch)
return self.model.forward(batch)
def shutdown(): ...
# graph.py — class GraphRunner (CUDA graph capture per shape bucket)
# sample.py — class Sampler (top-k, top-p, temperature)
# config.py — EngineConfig
Compare this to full SGLang's model_runner.py: 3,607 lines coordinating model forward, attention backend selection (out of 27), KV cache writes, multiple CUDA graph capture strategies (graph/piecewise/breakable/cpu), LoRA, speculative decoding, EP MoE, mixed-precision paths. Every additional concern is real and necessary in production — but it makes the file unreadable as a learning artifact.
§ 07 · CutWhat's missing, on purpose.
A list of what mini-SGLang deliberately leaves out, with brief notes on why each is "not architecture":
| Feature | Why it's cut |
|---|---|
| 25 of 27 attention backends | backends are kernel selections, not architecture — once you understand one backend interface, the others are variants |
| ~50 model architectures | model files are mostly weight loading + forward; pattern transfers from Llama to others |
| AITER / ROCm / MUSA / NPU paths | hardware portability is a layered concern atop a working CUDA path |
| Speculative decoding (EAGLE) | throughput optimization, not foundational |
| Disaggregated prefill/decode (P/D) | distributed-system optimization, not core inference |
| Structured generation (JSON, regex, EBNF) | logits-mask feature, orthogonal to scheduling |
| Multimodal (vision / audio) | preprocessor + adapter, doesn't change the serving core |
| Hierarchical / sparse / Mamba caches | specialized memory pools for specific models |
| Connectors to vLLM / TensorRT-LLM | interop, not architecture |
| sgl-kernel as a separate wheel | release-engineering concern, not runtime |
| Web dashboard / observability stack | operations layer |
None of these are unimportant — they're the reasons production SGLang exists. They're just not foundational to "how an LLM inference engine works." mini-SGLang's curatorial line falls exactly at the boundary between architecture and features that ride on top of architecture.
What's missing teaches what's optional.
For your own work — multi-agent kernel optimization, AMD inference, anywhere — the question "what can I cut from this system and still call it the system?" is the most powerful design audit you can run. mini-SGLang is that audit performed on full SGLang in public.
When you build your own inference or orchestration system, mentally write the mini version first. Identify what cannot be removed without breaking the central claim. Build that first; layer the rest as optional features. Production systems that grew this way (designed-mini-first, even if never published) are conspicuously more maintainable than ones that grew by accretion.
§ 08 · MethodHow to actually use both.
A concrete reading-and-reference workflow that combines mini-SGLang and full SGLang. This is the practical takeaway for anyone who needs to work in either codebase:
- First pass — read all of mini-SGLang sequentially, in the order from Plate III. ~5 hours. Goal: form vocabulary.
- Indexing — for each mini file, note which full-SGLang file/directory is its parent. Keep this as a private cheat sheet.
- Targeted reads — when you need to touch full SGLang, find the mini parent of the target module first. Refresh the simple version (~15 min), then dive into the full version with the simple version's structure in mind.
- Diff reading — when full SGLang adds something mini doesn't have (e.g., disaggregation), read it as an add-on, not a re-learn. The add-on attaches to a part of mini that you already understand.
- Contribution direction — if you find a cleaner pattern in mini that full doesn't have, it's a potential refactor PR for full SGLang. The RAII handle pattern is one such candidate.
§ 09 · EpiloguePedagogy, as code.
What LMSYS did with mini-SGLang is, in a small way, an example for the open-source community at large. Big systems become impenetrable. Documentation tries to bridge the gap and often fails because docs decay faster than code. A second smaller implementation, maintained in lockstep with the main one, is a third path — neither just-code nor just-docs, but code that exists to be read.
If you're an MLSys engineer in 2026, the most valuable hour you can spend this week is reading mini-SGLang. It will give you the vocabulary to read every modern inference engine — vLLM, TensorRT-LLM, your own future system — with confidence. And it will quietly shape how you write your own code: smaller, more decomposed, more readable. That's the real shrink.
— Fin.