The Library · 文库

The Library

一个分门别类的索引，收齐了我写过的东西。 代码精读和论文精读是 /sources 下自成一体的 HTML 深读；教程和博客是中英双语的 blog 长文。四个栏目，一张地图。

9code readings

3paper readings

6tutorials

4blog posts

Code

代码精读 · source-level readings of real codebases — each a self-contained HTML deep dive

9 deep dives

Source ReadingNo. 001

SkyPilot

Cloud orchestration as an optimization problem: resources, regions, spot instances, Kubernetes, jobs, and the control loops that make them practical.

cloudinfrascheduler

Deep dive →

Source ReadingNo. 002

SGLang

A serving-system read focused on runtime structure, request scheduling, cache management, routing, and the boundary between Python control and fast inference paths.

LLM servingruntimecache

Deep dive →

Source ReadingNo. 003

vLLM

The wrap-up of the initial serving trilogy: scheduler pressure, PagedAttention, KV memory, batching, and how design choices compare against SGLang.

PagedAttentionKV cachescheduler

Deep dive →

Source ReadingNo. 004

mini-SGLang

A smaller codebase read as a teaching artifact: what a minimal implementation makes explicit, what it hides, and how to learn from that compression.

minimalteachingserving

Deep dive →

Source ReadingNo. 005

GCNasm

A descent into hand-written AMD GPU assembly: CDNA3 idioms, occupancy, memory movement, instruction selection, and the optimization patterns behind fast kernels.

AMDassemblykernel

Deep dive →

Source ReadingNo. 006

FlyDSL

A Python DSL with typed MLIR underneath: layout algebra, copy and MMA atoms, compiler boundaries, and what it takes to express production GEMM from Python.

DSLMLIRlayout

Deep dive →

Source ReadingNo. 007

AITER MoE Tuner

A tuner-first reading of MoE GEMM search: config space, benchmarking discipline, hardware assumptions, and why tuning code is often kernel knowledge in disguise.

MoEGEMMautotune

Deep dive →

Source ReadingNo. 008

rocprof Viewer

A field guide to AMD instruction-level profiling: rocprofv3 capture, Advanced Thread Trace, source mapping, and how to read the viewer panels without fooling yourself.

profilingATTROCm

Deep dive →

Source ReadingNo. 009

Codex Goal Mode

A source-level reading of goal mode as a thread-scoped state machine: persisted goals, model tools, runtime continuation, token budget accounting, and authority boundaries.

Codexstate machineruntime

Deep dive →

Paper

论文精读 · close readings of research papers, rebuilt as HTML deep dives

3 deep dives

Paper ReadingNo. 001

Polar

Agentic RL without rewriting the harness: proxying LLM API calls, asynchronous staging, prefix merging, and what SWE-Bench tells us about scalable agent training.

agent RLproxySWE-Bench

Deep dive →

Paper ReadingNo. 002

Kernel Design Agents

A close read of agentic GPU kernel development: plan-execute-verify loops, KernelWiki, ncu-guided debugging, autotuning, and reward-hacking failure modes.

agentsGPU kernelsautotune

Deep dive →

Paper ReadingNo. 003

Linear Layouts

One binary matrix over GF(2) as the organizing principle for tensor layouts: conversion, broadcast, swizzling, slicing, and robust code generation.

compilerlayoutGF(2)

Deep dive →

Tutorial

教程 · first-principles primers and guides — one rich HTML primer, the rest bilingual blog posts

6 pieces

Primer · HTMLcompiler stack

From Python to Silicon

A systems primer for the path from Python to GPU execution: compiler layers, kernel boundaries, IR, runtime dispatch, and what each layer is responsible for.

compilerMLIR/LLVMISA

Deep dive → EN 中文

Primerattention

Attention Mechanisms

Full, Sparse, and Linear attention from first principles — up through DeepSeek NSA and Gated Linear Attention, with the tradeoffs that decide each one.

attentionsparselinear

EN 中文

Primerinference

KV Cache & Model Weights

The first thing to understand before optimizing inference: what KV cache is, how it differs from model weights, and how each scales with sequence and batch.

inferenceKV cacheLLM

EN 中文

PrimerGPU memory

LLM GPU Memory Calculation

How to actually compute LLM memory on a GPU — the components, worked 7B/70B examples, and how DP / TP / PP / EP and ZeRO change the arithmetic.

GPUmemoryparallelism

EN 中文

Guidepost-training

SFT & RL Training Guide

A first-principles guide to SFT and RL post-training: loss and label masking, dataset construction, hyperparameters, RLHF, and the common pitfalls.

SFTRLRLHF

EN 中文

Primertransformer

Transformer Deep Dive

The Transformer rebuilt from three angles at once — the math, runnable PyTorch, and the design rationale behind self-attention, LayerNorm, and the MLP.

transformermathcode

EN 中文

Blog

博客 · original writing — benchmarks, framework comparisons, and project notes

4 posts

Benchmarkspec decoding

Qwen3-Coder × EAGLE3

A measured benchmark of EAGLE3 speculative decoding on Qwen3-Coder-30B-A3B — where the 1.87× speedup comes from and why code generation benefits most.

spec decodingbenchmarkSGLang

EN 中文

ComparisonRL frameworks

NeMo-RL vs slime

A working comparison of two RL post-training frameworks — algorithms, engineering quality, MoE support, and ROCm fit — with a reasoned pick for MI300X / MI355X.

RLtrainingframework

EN 中文

ProjectRL · kernels

TritonForge

Building a server-based, multi-turn RL system that generates Triton kernels across NVIDIA and AMD — architecture, SFT+RL methodology, results, and roadmap.

RLTritonkernel

EN 中文

NoteFlyDSL

BasisAttr · beneath Layout

A follow-up note beneath the FlyDSL layout algebra: what BasisAttr and Fly_Basis are, why layouts need them, and where to start completing the surface.

FlyDSLMLIRlayout

EN 中文

Code & Paper open as self-contained HTML deep dives (each carries its own EN / ZH toggle). Tutorial & Blog open as bilingual blog pages. Search across everything, or filter by shelf. Try: MLIR, ATT, MoE, PagedAttention, agent RL, GF(2), RLHF, Codex.