SkyPilot
Cloud orchestration as an optimization problem: resources, regions, spot instances, Kubernetes, jobs, and the control loops that make them practical.
The Library · 文库
一个分门别类的索引, 收齐了我写过的东西。 代码精读和论文精读是 /sources 下自成一体的 HTML 深读; 教程和博客是中英双语的 blog 长文。 四个栏目, 一张地图。
Cloud orchestration as an optimization problem: resources, regions, spot instances, Kubernetes, jobs, and the control loops that make them practical.
A serving-system read focused on runtime structure, request scheduling, cache management, routing, and the boundary between Python control and fast inference paths.
The wrap-up of the initial serving trilogy: scheduler pressure, PagedAttention, KV memory, batching, and how design choices compare against SGLang.
A smaller codebase read as a teaching artifact: what a minimal implementation makes explicit, what it hides, and how to learn from that compression.
A descent into hand-written AMD GPU assembly: CDNA3 idioms, occupancy, memory movement, instruction selection, and the optimization patterns behind fast kernels.
A Python DSL with typed MLIR underneath: layout algebra, copy and MMA atoms, compiler boundaries, and what it takes to express production GEMM from Python.
A tuner-first reading of MoE GEMM search: config space, benchmarking discipline, hardware assumptions, and why tuning code is often kernel knowledge in disguise.
A field guide to AMD instruction-level profiling: rocprofv3 capture, Advanced Thread Trace, source mapping, and how to read the viewer panels without fooling yourself.
A source-level reading of goal mode as a thread-scoped state machine: persisted goals, model tools, runtime continuation, token budget accounting, and authority boundaries.
Agentic RL without rewriting the harness: proxying LLM API calls, asynchronous staging, prefix merging, and what SWE-Bench tells us about scalable agent training.
A close read of agentic GPU kernel development: plan-execute-verify loops, KernelWiki, ncu-guided debugging, autotuning, and reward-hacking failure modes.
One binary matrix over GF(2) as the organizing principle for tensor layouts: conversion, broadcast, swizzling, slicing, and robust code generation.
A systems primer for the path from Python to GPU execution: compiler layers, kernel boundaries, IR, runtime dispatch, and what each layer is responsible for.
Full, Sparse, and Linear attention from first principles — up through DeepSeek NSA and Gated Linear Attention, with the tradeoffs that decide each one.
The first thing to understand before optimizing inference: what KV cache is, how it differs from model weights, and how each scales with sequence and batch.
How to actually compute LLM memory on a GPU — the components, worked 7B/70B examples, and how DP / TP / PP / EP and ZeRO change the arithmetic.
A first-principles guide to SFT and RL post-training: loss and label masking, dataset construction, hyperparameters, RLHF, and the common pitfalls.
The Transformer rebuilt from three angles at once — the math, runnable PyTorch, and the design rationale behind self-attention, LayerNorm, and the MLP.
A measured benchmark of EAGLE3 speculative decoding on Qwen3-Coder-30B-A3B — where the 1.87× speedup comes from and why code generation benefits most.
A working comparison of two RL post-training frameworks — algorithms, engineering quality, MoE support, and ROCm fit — with a reasoned pick for MI300X / MI355X.
Building a server-based, multi-turn RL system that generates Triton kernels across NVIDIA and AMD — architecture, SFT+RL methodology, results, and roadmap.
A follow-up note beneath the FlyDSL layout algebra: what BasisAttr and Fly_Basis are, why layouts need them, and where to start completing the surface.
Code & Paper open as self-contained HTML deep dives (each carries its own EN / ZH toggle). Tutorial & Blog open as bilingual blog pages. Search across everything, or filter by shelf. Try: MLIR, ATT, MoE, PagedAttention, agent RL, GF(2), RLHF, Codex.