The Library · 文库

The Library

一个分门别类的索引, 收齐了我写过的东西。 代码精读论文精读是 /sources 下自成一体的 HTML 深读; 教程博客是中英双语的 blog 长文。 四个栏目, 一张地图。

Four shelves A diagram of four labeled shelves — code, paper, tutorial, and blog — feeding into one index. /sources · the library CODE · 9 PAPER · 3 TUTORIAL · 6 BLOG · 4
9code readings
3paper readings
6tutorials
4blog posts
01

Code

代码精读 · source-level readings of real codebases — each a self-contained HTML deep dive
9 deep dives
Source ReadingNo. 001

SkyPilot

Cloud orchestration as an optimization problem: resources, regions, spot instances, Kubernetes, jobs, and the control loops that make them practical.

cloudinfrascheduler
Source ReadingNo. 002

SGLang

A serving-system read focused on runtime structure, request scheduling, cache management, routing, and the boundary between Python control and fast inference paths.

LLM servingruntimecache
Source ReadingNo. 003

vLLM

The wrap-up of the initial serving trilogy: scheduler pressure, PagedAttention, KV memory, batching, and how design choices compare against SGLang.

PagedAttentionKV cachescheduler
Source ReadingNo. 004

mini-SGLang

A smaller codebase read as a teaching artifact: what a minimal implementation makes explicit, what it hides, and how to learn from that compression.

minimalteachingserving
Source ReadingNo. 005

GCNasm

A descent into hand-written AMD GPU assembly: CDNA3 idioms, occupancy, memory movement, instruction selection, and the optimization patterns behind fast kernels.

AMDassemblykernel
Source ReadingNo. 006

FlyDSL

A Python DSL with typed MLIR underneath: layout algebra, copy and MMA atoms, compiler boundaries, and what it takes to express production GEMM from Python.

DSLMLIRlayout
Source ReadingNo. 007

AITER MoE Tuner

A tuner-first reading of MoE GEMM search: config space, benchmarking discipline, hardware assumptions, and why tuning code is often kernel knowledge in disguise.

MoEGEMMautotune
Source ReadingNo. 008

rocprof Viewer

A field guide to AMD instruction-level profiling: rocprofv3 capture, Advanced Thread Trace, source mapping, and how to read the viewer panels without fooling yourself.

profilingATTROCm
Source ReadingNo. 009

Codex Goal Mode

A source-level reading of goal mode as a thread-scoped state machine: persisted goals, model tools, runtime continuation, token budget accounting, and authority boundaries.

Codexstate machineruntime
02

Paper

论文精读 · close readings of research papers, rebuilt as HTML deep dives
3 deep dives
Paper ReadingNo. 001

Polar

Agentic RL without rewriting the harness: proxying LLM API calls, asynchronous staging, prefix merging, and what SWE-Bench tells us about scalable agent training.

agent RLproxySWE-Bench
Paper ReadingNo. 002

Kernel Design Agents

A close read of agentic GPU kernel development: plan-execute-verify loops, KernelWiki, ncu-guided debugging, autotuning, and reward-hacking failure modes.

agentsGPU kernelsautotune
Paper ReadingNo. 003

Linear Layouts

One binary matrix over GF(2) as the organizing principle for tensor layouts: conversion, broadcast, swizzling, slicing, and robust code generation.

compilerlayoutGF(2)
03

Tutorial

教程 · first-principles primers and guides — one rich HTML primer, the rest bilingual blog posts
6 pieces
Primer · HTMLcompiler stack

From Python to Silicon

A systems primer for the path from Python to GPU execution: compiler layers, kernel boundaries, IR, runtime dispatch, and what each layer is responsible for.

compilerMLIR/LLVMISA
Primerattention

Attention Mechanisms

Full, Sparse, and Linear attention from first principles — up through DeepSeek NSA and Gated Linear Attention, with the tradeoffs that decide each one.

attentionsparselinear
Primerinference

KV Cache & Model Weights

The first thing to understand before optimizing inference: what KV cache is, how it differs from model weights, and how each scales with sequence and batch.

inferenceKV cacheLLM
PrimerGPU memory

LLM GPU Memory Calculation

How to actually compute LLM memory on a GPU — the components, worked 7B/70B examples, and how DP / TP / PP / EP and ZeRO change the arithmetic.

GPUmemoryparallelism
Guidepost-training

SFT & RL Training Guide

A first-principles guide to SFT and RL post-training: loss and label masking, dataset construction, hyperparameters, RLHF, and the common pitfalls.

SFTRLRLHF
Primertransformer

Transformer Deep Dive

The Transformer rebuilt from three angles at once — the math, runnable PyTorch, and the design rationale behind self-attention, LayerNorm, and the MLP.

transformermathcode
04

Blog

博客 · original writing — benchmarks, framework comparisons, and project notes
4 posts
Benchmarkspec decoding

Qwen3-Coder × EAGLE3

A measured benchmark of EAGLE3 speculative decoding on Qwen3-Coder-30B-A3B — where the 1.87× speedup comes from and why code generation benefits most.

spec decodingbenchmarkSGLang
ComparisonRL frameworks

NeMo-RL vs slime

A working comparison of two RL post-training frameworks — algorithms, engineering quality, MoE support, and ROCm fit — with a reasoned pick for MI300X / MI355X.

RLtrainingframework
ProjectRL · kernels

TritonForge

Building a server-based, multi-turn RL system that generates Triton kernels across NVIDIA and AMD — architecture, SFT+RL methodology, results, and roadmap.

RLTritonkernel
NoteFlyDSL

BasisAttr · beneath Layout

A follow-up note beneath the FlyDSL layout algebra: what BasisAttr and Fly_Basis are, why layouts need them, and where to start completing the surface.

FlyDSLMLIRlayout

Code & Paper open as self-contained HTML deep dives (each carries its own EN / ZH toggle). Tutorial & Blog open as bilingual blog pages. Search across everything, or filter by shelf. Try: MLIR, ATT, MoE, PagedAttention, agent RL, GF(2), RLHF, Codex.