File N° 004 · Source Reading · with reflections

mini-SGLang.

140× smaller than its parent, by design. A reading of the 5,000-line teaching implementation that maps cleanly onto every concept in the 728,000-line production engine — and what that mapping itself teaches.

2026.05.15 · Jhin Pan · AMD Spring 2026 Intern · Source Reading series, entry № 004

In December 2025 the SGLang team did something rare in open source: they wrote a second, smaller implementation of their own system — not to replace it, but to explain it. mini-SGLang is roughly 5,000 lines of Python, ~140× smaller than its parent. It preserves the four-process architecture, RadixAttention, chunked prefill, overlap scheduling, and tensor parallelism. Everything else — 27 attention backends, hierarchical caches, disaggregation, speculative decoding, EAGLE, multimodal — is gone. The result is a system you can read end-to-end in an afternoon, and which makes the production engine's 728k lines suddenly legible.

The previous entry in this series read full SGLang end-to-end. I came away admiring the engineering but uneasy about the reading experience — a 4,006-line single-class Scheduler, a 3,607-line ModelRunner, 27 attention backends. mini-SGLang is the project I wish I'd read first.

This entry is structured around that wish. Each section reads a piece of mini-SGLang, and where the design choice illuminates something about the full SGLang — or about software pedagogy in general — there's a margin note or a numbered Reflection callout.

5,000
mini-SGLang lines
728,969
full SGLang lines
140×
ratio
4
cooperating processes
2
attention backends
2
models (Llama · Qwen3)

§ 01 · AccountingThe 140× shrink, accounted.

To know what was kept, count what was cut. Here's the file-level comparison for the modules that appear in both:

Module / filefull SGLangminiratio
scheduler.py4,006 lines + Mixins280 lines + 7 helpers14×
model_runner.py / engine.py3,607 lines253 lines14×
radix_cache.py828 lines253 lines3.3×
attention backends27 backends2 backends13×
models~50 architectures2 (Llama, Qwen3)25×
hardware backends4 (gpu/mlx/musa/npu)1 (gpu)
whole package728,969 LOC~5,000 LOC140×

The scheduler's 14× compression is the most informative. Full SGLang's scheduler is a single 4,006-line class with Mixins (SchedulerMlxOverlapMixin, scheduler_dp_attn_mixin, scheduler_output_processor_mixin, scheduler_input_blocker). The mini version splits the same responsibilities into 7 small files: scheduler.py (280 lines, the loop), cache.py, config.py, decode.py, io.py, prefill.py, table.py. Same total surface area, completely different cognitive surface.

Plate I — The shrink, to scale file-by-file, proportional rectangles comparative
Same modules · drawn to scale full SGLang mini-SGLang scheduler 4,006 lines · 1 monolith + Mixins 280 + 7 helpers runner/engine 3,607 lines 253 radix_cache 828 253 attn backends 27 backends · 18,000+ lines 2 models ~50 architectures 2 ∑ package 728,969 lines ~5,000 — 140× ratio
Width is proportional to line count, clipped to fit. The mini bars don't look that small in absolute terms because of the necessary minimum-rectangle rendering — but the 140× total-package ratio is no exaggeration. The scheduler and runner/engine shrink the most. The radix cache shrinks least, because it's algorithmic substance — there's a floor on how short a correct radix tree can be.

§ 02 · SkeletonWhat's preserved: the bones.

Despite the 140× shrink, mini-SGLang is the same shape as full SGLang. Four cooperating processes communicate over ZeroMQ for control and NCCL (via torch.distributed) for tensors:

The mini version has one structural simplification: in full SGLang the scheduler is on every GPU (N schedulers for TP=N, fully symmetric). In mini, the scheduler is the only "manager" process and the workers are pure execution endpoints. This is a real architectural decision, not just code shrinkage — and arguably the cleaner one for understanding.

Plate II — Mirroring topology four processes in both engines structural
full SGLang mini-SGLang TokenizerManager HTTP → tokens DetokenizerMgr tokens → text Scheduler × N (one per GPU) 4,006 lines, monolith + Mixins event_loop_normal · event_loop_overlap ModelRunner workers (TP/EP) 3,607 lines · 27 attn backends ZMQ + NCCL Tokenizer Worker HTTP → tokens Detokenizer tokens → text Scheduler (×1 manager) 280 lines · 7 helper files run_forever() · single overlap loop Engine workers (TP) 253 lines · 2 attn backends ZMQ + NCCL (same) = same shape =
Same four-role topology, same wire protocols, same conceptual roles. The mini version pushes the scheduler to single-manager rather than per-GPU — a real simplification, but the responsibilities map. This is the architectural skeleton you carry from mini into reading the full code.
N° 1

Small is a lens, not a constraint.

It's tempting to read mini-SGLang as a "stepping stone toward the real thing." That framing undersells it. The 5,000-line version isn't lesser — it's a different artifact serving a different purpose. The production engine optimizes for throughput, model coverage, hardware portability, and feature breadth. The teaching engine optimizes for read time per concept. Both are correct; both are necessary.

The cultural innovation in publishing mini-SGLang isn't the code; it's the decision to write code whose primary user is a reader, not an operator. That decision is rare in open source and historically required to come from individuals (Karpathy's nanoGPT, Kaiming He's MAE) rather than from institutional teams. LMSYS doing this at scale is, quietly, a precedent.

§ 03 · PathA guided reading order.

Because mini is small enough to read sequentially, there's a real "correct order." Here's the path I'd recommend, in eight stops:

  1. __main__.py (3 lines) — just calls launch_server(). The thinnest possible entry.
  2. server/ — spawns the four processes, wires their ZMQ sockets.
  3. tokenizer/ — wrap HF tokenizer; simplest worker, helps you read ZMQ patterns.
  4. core.py — defines Req, Batch, SamplingParams. These are the value objects that flow through all later modules.
  5. kvcache/radix_cache.py (253 lines) — algorithmic substance. RAII handle pattern. Read this carefully.
  6. scheduler/scheduler.py (280 lines) + the 6 helpers — the loop. Once kvcache is clear, the schedule decisions are obvious.
  7. engine/engine.py (253 lines) — the GPU side. forward_batch dispatches to attention + model.
  8. attention/ + models/ + layers/ — the bottom of the stack. Read after you know who calls what.
Plate III — A reading walk 8 stops through the package narrative
Read in this order — each stop fits in working memory 1 __main__ 3 lines 2 server/ spawn 4 3 tokenizer/ simplest 4 core.py Req · Batch 5 kvcache algorithmic 6 scheduler the loop 7 engine/ GPU side 8 attn/models leaves ~10 min ~20 min ~15 min ~20 min ~60 min ★ ~60 min ★ ~40 min ~45 min ≈ 4-5 hours total · stops 5 and 6 are where the substance lives
The walk is roughly 4-5 hours from start to finish. kvcache and scheduler are the two stops where you should slow down — these contain the algorithmic substance that the rest of the system arranges around.

§ 04 · Scheduler280 lines vs four thousand.

The full SGLang scheduler is the system's heart and also its largest single file. The mini version is the same heart, drawn in 280 lines plus six focused helpers:

scheduler.py    # 280 — class Scheduler(SchedulerIOMixin), run_forever()
cache.py        # CacheManager — coordinates with RadixCache
config.py       # SchedulerConfig — knobs
decode.py       # DecodeManager — Phase A: continue decoding
prefill.py      # PrefillManager + ChunkedReq — Phase B: admit + chunked prefill
io.py           # SchedulerIOMixin — ZMQ I/O
table.py        # TableManager — per-request block tables

Each file has one concern. decode.py handles "what to do with already-running requests"; prefill.py handles "what new requests can we admit, possibly chunking"; cache.py mediates with the radix cache; table.py tracks block tables. The main scheduler.py just orchestrates these. This is what 4,006 lines look like when you stop optimizing for production and start optimizing for the reader.

Plate IV — Same concerns, different shape Scheduler structure: full vs mini organizational
full SGLang · one class mini-SGLang · seven files class Scheduler scheduler.py · 4,006 lines + SchedulerMlxOverlapMixin + scheduler_dp_attn_mixin + scheduler_output_processor_mixin + scheduler_input_blocker event_loop_normal() event_loop_overlap() process_input_requests() get_next_batch_to_run() update_running_batch() + ~80 other methods all in one class scheduler/ 7 files · 1 file per concern scheduler.py 280 · the loop io.py ZMQ in/out prefill.py admit + chunked decode.py continue running cache.py radix interface table.py block tables config.py all knobs each file ≤ 200 lines · each concern in one place no Mixins · no surprise inheritance same concerns, different shape
Both engines need to manage prefill, decode, cache, block tables, IO, config. The production engine puts them in one class with Mixins; the teaching engine puts each in its own file. For new contributors, the right-hand layout is meaningfully easier to onboard onto — even if both ultimately do the same work.
N° 2

Splitting is the lesson.

The single most teaching-rich move in mini-SGLang is that scheduler split. It shows what the full SGLang's scheduler could look like if cognitive load were optimized over locality. Mixins are clever because they avoid file-explosion, but they also scatter behavior across inheritance chains in ways grep can't easily reveal. Reading a 4,006-line class with four Mixins requires the reader to mentally maintain a partial method-resolution order — a tax that compounds with codebase age.

For your own systems: when a class crosses roughly 800 lines, ask whether you're really building one thing or whether you're papering over a missed decomposition. The teaching version of your future system will probably split it — so the production version should consider splitting it sooner.

§ 05 · RadixCacheRadixCache, quietly refactored.

The 253-line kvcache/radix_cache.py in mini-SGLang isn't just shorter than its 828-line parent — it has a quietly better API. Where full SGLang requires callers to manually pair inc_lock_ref() and dec_lock_ref(), mini introduces a single lock_handle() that returns a RadixCacheHandle dataclass:

class RadixCacheHandle:        # frozen dataclass
    node: RadixTreeNode
    matched_indices: list[int]
    # → released automatically when handle goes out of scope

class RadixPrefixCache:
    def lock_handle(self, key) -> RadixCacheHandle: ...
    def match_prefix(self, key) -> list[int]: ...
    def insert_prefix(self, key, value) -> None: ...
    def evict(self, num_tokens: int) -> int: ...
    def reset(self) -> None: ...                          # NotImplementedError
    def check_integrity(self) -> None: ...                # no-op

The handle is a small, idiomatic Python pattern (frozen dataclass + explicit lifecycle). It makes leakage essentially impossible. The full SGLang couldn't introduce this without breaking many internal callers; mini, having no callers to protect, ships the cleaner design.

N° 3

The teaching version is sometimes better.

You'd expect a minimal version to be strictly a subset of its parent — fewer features, otherwise the same. But mini-SGLang occasionally improves on full SGLang: the RAII handle for KV cache references, the seven-file scheduler split, the elimination of Mixins, the single overlap loop instead of dual normal/overlap loops. These are refactors the production version can't take cheaply, because they'd break callers.

So mini-SGLang functions as a shadow design — a working sketch of what full SGLang could become given a green field. That's another reason to read it: not just to learn the current system, but to glimpse the next version of it.

§ 06 · EngineThe Engine, in 253 lines.

engine/engine.py is mini's equivalent of full SGLang's 3,607-line model_runner.py. The shrinkage ratio is the same as the scheduler (14×), and the technique is the same: pull related concerns into sibling files.

class Engine:
    def __init__(self, config: EngineConfig): ...
    def _init_communication(): ...              # NCCL groups
    def _load_weight_state_dict(): ...          # HF or generated
    def _determine_num_pages(): ...             # KV page budget
    def _sync_get_memory(): ...                 # all-reduce free mem
    def forward_batch(batch) -> ForwardOutput:  # ★ the hot path
        if self.graph_runner.can_replay(batch):
            return self.graph_runner.replay(batch)
        return self.model.forward(batch)
    def shutdown(): ...

# graph.py — class GraphRunner (CUDA graph capture per shape bucket)
# sample.py — class Sampler  (top-k, top-p, temperature)
# config.py — EngineConfig

Compare this to full SGLang's model_runner.py: 3,607 lines coordinating model forward, attention backend selection (out of 27), KV cache writes, multiple CUDA graph capture strategies (graph/piecewise/breakable/cpu), LoRA, speculative decoding, EP MoE, mixed-precision paths. Every additional concern is real and necessary in production — but it makes the file unreadable as a learning artifact.

§ 07 · CutWhat's missing, on purpose.

A list of what mini-SGLang deliberately leaves out, with brief notes on why each is "not architecture":

FeatureWhy it's cut
25 of 27 attention backendsbackends are kernel selections, not architecture — once you understand one backend interface, the others are variants
~50 model architecturesmodel files are mostly weight loading + forward; pattern transfers from Llama to others
AITER / ROCm / MUSA / NPU pathshardware portability is a layered concern atop a working CUDA path
Speculative decoding (EAGLE)throughput optimization, not foundational
Disaggregated prefill/decode (P/D)distributed-system optimization, not core inference
Structured generation (JSON, regex, EBNF)logits-mask feature, orthogonal to scheduling
Multimodal (vision / audio)preprocessor + adapter, doesn't change the serving core
Hierarchical / sparse / Mamba cachesspecialized memory pools for specific models
Connectors to vLLM / TensorRT-LLMinterop, not architecture
sgl-kernel as a separate wheelrelease-engineering concern, not runtime
Web dashboard / observability stackoperations layer

None of these are unimportant — they're the reasons production SGLang exists. They're just not foundational to "how an LLM inference engine works." mini-SGLang's curatorial line falls exactly at the boundary between architecture and features that ride on top of architecture.

N° 4

What's missing teaches what's optional.

For your own work — multi-agent kernel optimization, AMD inference, anywhere — the question "what can I cut from this system and still call it the system?" is the most powerful design audit you can run. mini-SGLang is that audit performed on full SGLang in public.

When you build your own inference or orchestration system, mentally write the mini version first. Identify what cannot be removed without breaking the central claim. Build that first; layer the rest as optional features. Production systems that grew this way (designed-mini-first, even if never published) are conspicuously more maintainable than ones that grew by accretion.

§ 08 · MethodHow to actually use both.

A concrete reading-and-reference workflow that combines mini-SGLang and full SGLang. This is the practical takeaway for anyone who needs to work in either codebase:

  1. First pass — read all of mini-SGLang sequentially, in the order from Plate III. ~5 hours. Goal: form vocabulary.
  2. Indexing — for each mini file, note which full-SGLang file/directory is its parent. Keep this as a private cheat sheet.
  3. Targeted reads — when you need to touch full SGLang, find the mini parent of the target module first. Refresh the simple version (~15 min), then dive into the full version with the simple version's structure in mind.
  4. Diff reading — when full SGLang adds something mini doesn't have (e.g., disaggregation), read it as an add-on, not a re-learn. The add-on attaches to a part of mini that you already understand.
  5. Contribution direction — if you find a cleaner pattern in mini that full doesn't have, it's a potential refactor PR for full SGLang. The RAII handle pattern is one such candidate.

§ 09 · EpiloguePedagogy, as code.

What LMSYS did with mini-SGLang is, in a small way, an example for the open-source community at large. Big systems become impenetrable. Documentation tries to bridge the gap and often fails because docs decay faster than code. A second smaller implementation, maintained in lockstep with the main one, is a third path — neither just-code nor just-docs, but code that exists to be read.

If you're an MLSys engineer in 2026, the most valuable hour you can spend this week is reading mini-SGLang. It will give you the vocabulary to read every modern inference engine — vLLM, TensorRT-LLM, your own future system — with confidence. And it will quietly shape how you write your own code: smaller, more decomposed, more readable. That's the real shrink.

— Fin.