mini-SGLang — Source Reading No. 004 (with reflections)

In December 2025 the SGLang team did something rare in open source: they wrote a second, smaller implementation of their own system — not to replace it, but to explain it. mini-SGLang is roughly 5,000 lines of Python, ~140× smaller than its parent. It preserves the four-process architecture, RadixAttention, chunked prefill, overlap scheduling, and tensor parallelism. Everything else — 27 attention backends, hierarchical caches, disaggregation, speculative decoding, EAGLE, multimodal — is gone. The result is a system you can read end-to-end in an afternoon, and which makes the production engine's 728k lines suddenly legible.

The previous entry in this series read full SGLang end-to-end. I came away admiring the engineering but uneasy about the reading experience — a 4,006-line single-class Scheduler, a 3,607-line ModelRunner, 27 attention backends. mini-SGLang is the project I wish I'd read first.

This entry is structured around that wish. Each section reads a piece of mini-SGLang, and where the design choice illuminates something about the full SGLang — or about software pedagogy in general — there's a margin note or a numbered Reflection callout.

5,000

mini-SGLang lines

728,969

full SGLang lines

140×

ratio

cooperating processes

attention backends

models (Llama · Qwen3)

§ 01 · AccountingThe 140× shrink, accounted.

To know what was kept, count what was cut. Here's the file-level comparison for the modules that appear in both:

Module / file	full SGLang	mini	ratio
`scheduler.py`	4,006 lines + Mixins	280 lines + 7 helpers	14×
`model_runner.py` / `engine.py`	3,607 lines	253 lines	14×
`radix_cache.py`	828 lines	253 lines	3.3×
attention backends	27 backends	2 backends	13×
models	~50 architectures	2 (Llama, Qwen3)	25×
hardware backends	4 (gpu/mlx/musa/npu)	1 (gpu)	4×
whole package	728,969 LOC	~5,000 LOC	140×

The scheduler's 14× compression is the most informative. Full SGLang's scheduler is a single 4,006-line class with Mixins (SchedulerMlxOverlapMixin, scheduler_dp_attn_mixin, scheduler_output_processor_mixin, scheduler_input_blocker). The mini version splits the same responsibilities into 7 small files: scheduler.py (280 lines, the loop), cache.py, config.py, decode.py, io.py, prefill.py, table.py. Same total surface area, completely different cognitive surface.

Plate I — The shrink, to scale file-by-file, proportional rectangles comparative

Width is proportional to line count, clipped to fit. The mini bars don't look that small in absolute terms because of the necessary minimum-rectangle rendering — but the 140× total-package ratio is no exaggeration. The scheduler and runner/engine shrink the most. The radix cache shrinks least, because it's algorithmic substance — there's a floor on how short a correct radix tree can be.

§ 02 · SkeletonWhat's preserved: the bones.

Despite the 140× shrink, mini-SGLang is the same shape as full SGLang. Four cooperating processes communicate over ZeroMQ for control and NCCL (via torch.distributed) for tensors:

API Server — HTTP entry, OpenAI-compatible
Tokenizer Worker — text → token ids
Scheduler Worker — request batching + forward dispatch (one per GPU)
Detokenizer Worker — token ids → streamed text

The mini version has one structural simplification: in full SGLang the scheduler is on every GPU (N schedulers for TP=N, fully symmetric). In mini, the scheduler is the only "manager" process and the workers are pure execution endpoints. This is a real architectural decision, not just code shrinkage — and arguably the cleaner one for understanding.

Plate II — Mirroring topology four processes in both engines structural

Same four-role topology, same wire protocols, same conceptual roles. The mini version pushes the scheduler to single-manager rather than per-GPU — a real simplification, but the responsibilities map. This is the architectural skeleton you carry from mini into reading the full code.

N° 1

Small is a lens, not a constraint.

It's tempting to read mini-SGLang as a "stepping stone toward the real thing." That framing undersells it. The 5,000-line version isn't lesser — it's a different artifact serving a different purpose. The production engine optimizes for throughput, model coverage, hardware portability, and feature breadth. The teaching engine optimizes for read time per concept. Both are correct; both are necessary.

The cultural innovation in publishing mini-SGLang isn't the code; it's the decision to write code whose primary user is a reader, not an operator. That decision is rare in open source and historically required to come from individuals (Karpathy's nanoGPT, Kaiming He's MAE) rather than from institutional teams. LMSYS doing this at scale is, quietly, a precedent.

§ 03 · PathA guided reading order.

Because mini is small enough to read sequentially, there's a real "correct order." Here's the path I'd recommend, in eight stops:

__main__.py (3 lines) — just calls launch_server(). The thinnest possible entry.
server/ — spawns the four processes, wires their ZMQ sockets.
tokenizer/ — wrap HF tokenizer; simplest worker, helps you read ZMQ patterns.
core.py — defines Req, Batch, SamplingParams. These are the value objects that flow through all later modules.
kvcache/radix_cache.py (253 lines) — algorithmic substance. RAII handle pattern. Read this carefully.
scheduler/scheduler.py (280 lines) + the 6 helpers — the loop. Once kvcache is clear, the schedule decisions are obvious.
engine/engine.py (253 lines) — the GPU side. forward_batch dispatches to attention + model.
attention/ + models/ + layers/ — the bottom of the stack. Read after you know who calls what.

Plate III — A reading walk 8 stops through the package narrative

The walk is roughly 4-5 hours from start to finish. kvcache and scheduler are the two stops where you should slow down — these contain the algorithmic substance that the rest of the system arranges around.

§ 04 · Scheduler280 lines vs four thousand.

The full SGLang scheduler is the system's heart and also its largest single file. The mini version is the same heart, drawn in 280 lines plus six focused helpers:

scheduler.py    # 280 — class Scheduler(SchedulerIOMixin), run_forever()
cache.py        # CacheManager — coordinates with RadixCache
config.py       # SchedulerConfig — knobs
decode.py       # DecodeManager — Phase A: continue decoding
prefill.py      # PrefillManager + ChunkedReq — Phase B: admit + chunked prefill
io.py           # SchedulerIOMixin — ZMQ I/O
table.py        # TableManager — per-request block tables

Each file has one concern. decode.py handles "what to do with already-running requests"; prefill.py handles "what new requests can we admit, possibly chunking"; cache.py mediates with the radix cache; table.py tracks block tables. The main scheduler.py just orchestrates these. This is what 4,006 lines look like when you stop optimizing for production and start optimizing for the reader.

Plate IV — Same concerns, different shape Scheduler structure: full vs mini organizational

Both engines need to manage prefill, decode, cache, block tables, IO, config. The production engine puts them in one class with Mixins; the teaching engine puts each in its own file. For new contributors, the right-hand layout is meaningfully easier to onboard onto — even if both ultimately do the same work.

N° 2

Splitting is the lesson.

The single most teaching-rich move in mini-SGLang is that scheduler split. It shows what the full SGLang's scheduler could look like if cognitive load were optimized over locality. Mixins are clever because they avoid file-explosion, but they also scatter behavior across inheritance chains in ways grep can't easily reveal. Reading a 4,006-line class with four Mixins requires the reader to mentally maintain a partial method-resolution order — a tax that compounds with codebase age.

For your own systems: when a class crosses roughly 800 lines, ask whether you're really building one thing or whether you're papering over a missed decomposition. The teaching version of your future system will probably split it — so the production version should consider splitting it sooner.

§ 05 · RadixCacheRadixCache, quietly refactored.

The 253-line kvcache/radix_cache.py in mini-SGLang isn't just shorter than its 828-line parent — it has a quietly better API. Where full SGLang requires callers to manually pair inc_lock_ref() and dec_lock_ref(), mini introduces a single lock_handle() that returns a RadixCacheHandle dataclass:

class RadixCacheHandle:        # frozen dataclass
    node: RadixTreeNode
    matched_indices: list[int]
    # → released automatically when handle goes out of scope

class RadixPrefixCache:
    def lock_handle(self, key) -> RadixCacheHandle: ...
    def match_prefix(self, key) -> list[int]: ...
    def insert_prefix(self, key, value) -> None: ...
    def evict(self, num_tokens: int) -> int: ...
    def reset(self) -> None: ...                          # NotImplementedError
    def check_integrity(self) -> None: ...                # no-op

The handle is a small, idiomatic Python pattern (frozen dataclass + explicit lifecycle). It makes leakage essentially impossible. The full SGLang couldn't introduce this without breaking many internal callers; mini, having no callers to protect, ships the cleaner design.

N° 3

The teaching version is sometimes better.

You'd expect a minimal version to be strictly a subset of its parent — fewer features, otherwise the same. But mini-SGLang occasionally improves on full SGLang: the RAII handle for KV cache references, the seven-file scheduler split, the elimination of Mixins, the single overlap loop instead of dual normal/overlap loops. These are refactors the production version can't take cheaply, because they'd break callers.

So mini-SGLang functions as a shadow design — a working sketch of what full SGLang could become given a green field. That's another reason to read it: not just to learn the current system, but to glimpse the next version of it.

§ 06 · EngineThe Engine, in 253 lines.

engine/engine.py is mini's equivalent of full SGLang's 3,607-line model_runner.py. The shrinkage ratio is the same as the scheduler (14×), and the technique is the same: pull related concerns into sibling files.

class Engine:
    def __init__(self, config: EngineConfig): ...
    def _init_communication(): ...              # NCCL groups
    def _load_weight_state_dict(): ...          # HF or generated
    def _determine_num_pages(): ...             # KV page budget
    def _sync_get_memory(): ...                 # all-reduce free mem
    def forward_batch(batch) -> ForwardOutput:  # ★ the hot path
        if self.graph_runner.can_replay(batch):
            return self.graph_runner.replay(batch)
        return self.model.forward(batch)
    def shutdown(): ...

# graph.py — class GraphRunner (CUDA graph capture per shape bucket)
# sample.py — class Sampler  (top-k, top-p, temperature)
# config.py — EngineConfig

Compare this to full SGLang's model_runner.py: 3,607 lines coordinating model forward, attention backend selection (out of 27), KV cache writes, multiple CUDA graph capture strategies (graph/piecewise/breakable/cpu), LoRA, speculative decoding, EP MoE, mixed-precision paths. Every additional concern is real and necessary in production — but it makes the file unreadable as a learning artifact.

§ 07 · CutWhat's missing, on purpose.

A list of what mini-SGLang deliberately leaves out, with brief notes on why each is "not architecture":

Feature	Why it's cut
25 of 27 attention backends	backends are kernel selections, not architecture — once you understand one backend interface, the others are variants
~50 model architectures	model files are mostly weight loading + forward; pattern transfers from Llama to others
AITER / ROCm / MUSA / NPU paths	hardware portability is a layered concern atop a working CUDA path
Speculative decoding (EAGLE)	throughput optimization, not foundational
Disaggregated prefill/decode (P/D)	distributed-system optimization, not core inference
Structured generation (JSON, regex, EBNF)	logits-mask feature, orthogonal to scheduling
Multimodal (vision / audio)	preprocessor + adapter, doesn't change the serving core
Hierarchical / sparse / Mamba caches	specialized memory pools for specific models
Connectors to vLLM / TensorRT-LLM	interop, not architecture
sgl-kernel as a separate wheel	release-engineering concern, not runtime
Web dashboard / observability stack	operations layer

None of these are unimportant — they're the reasons production SGLang exists. They're just not foundational to "how an LLM inference engine works." mini-SGLang's curatorial line falls exactly at the boundary between architecture and features that ride on top of architecture.

N° 4

What's missing teaches what's optional.

For your own work — multi-agent kernel optimization, AMD inference, anywhere — the question "what can I cut from this system and still call it the system?" is the most powerful design audit you can run. mini-SGLang is that audit performed on full SGLang in public.

When you build your own inference or orchestration system, mentally write the mini version first. Identify what cannot be removed without breaking the central claim. Build that first; layer the rest as optional features. Production systems that grew this way (designed-mini-first, even if never published) are conspicuously more maintainable than ones that grew by accretion.

§ 08 · MethodHow to actually use both.

A concrete reading-and-reference workflow that combines mini-SGLang and full SGLang. This is the practical takeaway for anyone who needs to work in either codebase:

First pass — read all of mini-SGLang sequentially, in the order from Plate III. ~5 hours. Goal: form vocabulary.
Indexing — for each mini file, note which full-SGLang file/directory is its parent. Keep this as a private cheat sheet.
Targeted reads — when you need to touch full SGLang, find the mini parent of the target module first. Refresh the simple version (~15 min), then dive into the full version with the simple version's structure in mind.
Diff reading — when full SGLang adds something mini doesn't have (e.g., disaggregation), read it as an add-on, not a re-learn. The add-on attaches to a part of mini that you already understand.
Contribution direction — if you find a cleaner pattern in mini that full doesn't have, it's a potential refactor PR for full SGLang. The RAII handle pattern is one such candidate.

§ 09 · EpiloguePedagogy, as code.

What LMSYS did with mini-SGLang is, in a small way, an example for the open-source community at large. Big systems become impenetrable. Documentation tries to bridge the gap and often fails because docs decay faster than code. A second smaller implementation, maintained in lockstep with the main one, is a third path — neither just-code nor just-docs, but code that exists to be read.

If you're an MLSys engineer in 2026, the most valuable hour you can spend this week is reading mini-SGLang. It will give you the vocabulary to read every modern inference engine — vLLM, TensorRT-LLM, your own future system — with confidence. And it will quietly shape how you write your own code: smaller, more decomposed, more readable. That's the real shrink.

— Fin.