You can write production ML systems for years without ever quite knowing what an IR is. You import torch, you call torch.matmul, the GPU lights up, and a few milliseconds later there is a tensor on the other side. Somewhere between your Python file and that tensor sit at least eight layers of software, three forms of intermediate representation, two compiler frameworks, one device driver, and a chip that speaks an instruction set someone in Bangalore or Sunnyvale spent two years specifying. Most of the time you do not need to know any of this. But every once in a while something underneath leaks — a kernel runs slower than it should, a new accelerator does not have a backend, a wheel built for CUDA 12.4 will not load on CUDA 12.6 — and the abstraction stops being free.
你可以把生产级 ML 系统写好几年, 都不太知道 IR 是什么。 你 import 一个 torch, 调一个 torch.matmul, GPU 亮一下, 几毫秒后另一头就出来一个 tensor。 你这个 Python 文件和那个 tensor 之间, 至少夹着八层软件、 三种 IR、 两套编译器框架、 一份设备驱动、 和一颗讲着指令集(由某个在 Bangalore 或 Sunnyvale 的人花了两年定义的指令集)的芯片。 大多数时候你不需要知道这些。 但偶尔会有什么从下面漏出来 —— 一个 kernel 比预期慢、 一个新加速器没 backend、 一个为 CUDA 12.4 编的 wheel 在 CUDA 12.6 装不上 —— 这时抽象就不再免费。
This essay is for the working ML engineer who hit one of those leaks and realized they could not quite name the pieces. It is not a textbook — Hennessy & Patterson is still on the shelf — but it is the explanation I wish someone had given me before I had to read Hennessy & Patterson. Each chapter takes one word from the stack (IR, MLIR, LLVM, ISA, FFI), shows you the smallest concrete example I could find, and explains why people in this field keep talking about it.
这篇文章是写给那种"漏了一次"才意识到自己讲不出这些零件名字的 ML 工程师。 它不是教科书 —— Hennessy & Patterson 还在架子上 —— 但它是我希望在自己被迫去读 Hennessy & Patterson 之前, 有人先给我讲的那个东西。 每一章拿栈里的一个词(IR、 MLIR、 LLVM、 ISA、 FFI), 给出我能找到的最小的具体例子, 然后解释为什么这个领域的人老在聊它。
If you only have twenty minutes, read § 1 (the stack) and § 7 (one matmul, all the way down). Those two together give the mental model. The middle five chapters expand each layer in turn — you can skim them and return on demand.
如果你只有二十分钟, 读 § 1(整个栈的全景)和 § 7(追一个 matmul 一路下去)。 这两章合在一起就给出了心智模型。 中间五章把每一层各自展开 —— 可以扫读, 用到再回来翻。
A note on bias. I work on AMD GPU kernels through MLIR, so when I need a concrete example I will reach for FlyDSL, ROCDL, MFMA, CDNA — the AMD vocabulary I touch every day. The shape of the argument is identical for NVIDIA, Intel, ARM, RISC-V. Wherever vendor-specific names appear I have tried to flag them.
一点偏见说明。 我日常做的是 AMD GPU kernel + MLIR, 所以例子大多来自 FlyDSL、 ROCDL、 MFMA、 CDNA 这套 AMD 词汇。 论证的形状对 NVIDIA、 Intel、 ARM、 RISC-V 完全一样。 凡是 vendor-specific 的名字, 我都尽量标了出来。
§ 1 · The map before the territoryThe vertical stack
§ 1 · 先看地图再走路纵向栈全景
Start with the question: what actually happens between torch.matmul(A, B) and a transistor switching? The answer is a vertical descent through layers, each translating from a slightly more abstract description into a slightly more concrete one. The plate on the next page draws the full descent on one page; this section is the verbal walk-through.
先问一个问题: torch.matmul(A, B) 到一个晶体管翻转之间, 实际上发生了什么? 答案是一段纵向下降, 一层接一层, 每一层把"更抽象的描述"翻译成"更具体的描述"。 下一页那张图把整个下降画在一张纸上; 这一节是对照的文字讲解。
A walking tour of the layers
逐层走一遍
Python source. What you write. Human-readable, dynamic, garbage-collected.
Python 源码。 你写的东西。 人可读、 动态、 GC。
CPython bytecode. Python compiles each function to bytecode (dis.dis(fn) shows it). A stack machine interprets the bytecode at runtime. This is the layer where overhead lives — every Python call passes through it.
CPython 字节码。 Python 把每个函数先编译成字节码(dis.dis(fn) 能看到), 然后由一个栈式机器在运行时解释。 Python 的开销主要长在这一层 —— 每次 Python 调用都要过它。
PyTorch dispatcher. torch.matmul resolves to an ATen operator, which the dispatcher routes by device, dtype, and layout. The routing logic is itself a graph of registrations. After the dispatcher, you are no longer in Python — you are in C++.
PyTorch 分发器。 torch.matmul 先变成 ATen 算子, 然后由 dispatcher 按 device、 dtype、 layout 路由。 路由本身就是一张注册图。 过了 dispatcher, 你就不再在 Python 里 —— 你在 C++ 里。
ATen kernel (FFI boundary). The C++ side of the bridge. PyTorch uses pybind11 for the Python ⇄ C++ marshalling; tensors cross via DLPack to avoid copies. This is the first place the word FFI earns its keep.
ATen kernel(FFI 边界)。 桥的 C++ 这一侧。 PyTorch 用 pybind11 做 Python ⇄ C++ 的 marshalling; tensor 用 DLPack 跨过去避免拷贝。 这是 FFI 这个词第一次真正赚回它的复杂度。
DSL / framework (where IR appears). If the kernel is hand-rolled CUDA / HIP, you jump straight to the assembler. If it is a DSL — Triton, FlyDSL, JAX-via-XLA, Inductor — the framework traces the Python into an in-memory typed graph. That graph is its IR. You have now left "code" and entered "data the compiler can reason about."
DSL / 框架(IR 在此出现)。 如果 kernel 是手写 CUDA / HIP, 直接跳到汇编器。 如果是 DSL —— Triton、 FlyDSL、 JAX-via-XLA、 Inductor —— 框架会把 Python trace 成一份内存里的 typed graph。 那张图就是它的 IR。 你已经离开"代码", 进入"编译器可以推理的数据"。
MLIR pass pipeline. The IR is rewritten by a sequence of passes, each a function (IR) → (IR). The pipeline progressively lowers from "linalg / linear algebra ops" through "affine loops" through "explicit memrefs" down to "LLVM dialect" — at which point you are one step from machine code.
MLIR pass 流水线。 IR 被一连串 pass 改写, 每个 pass 都是一个 (IR) → (IR) 的函数。 流水线渐进地从"linalg / 线代 op"下降到"affine 循环"到"显式 memref"再到"LLVM dialect" —— 此时距机器码只差一步。
LLVM IR. The lingua franca of modern compiler backends. SSA-form, typed, target-agnostic. The same LLVM IR can be sent to x86, ARM, RISC-V, AMDGPU, NVPTX, WebAssembly. Every modern compiler that targets more than one chip uses LLVM, or has imitated it.
LLVM IR。 现代编译器后端的通用语。 SSA 形式、 typed、 target-agnostic。 同一份 LLVM IR 可以送去 x86、 ARM、 RISC-V、 AMDGPU、 NVPTX、 WebAssembly。 凡是现代多 target 编译器都在用 LLVM, 或者模仿过它。
Target backend / ASM. The LLVM target backend selects real instructions, allocates registers, schedules, and emits assembly (or directly machine code). For AMD GPUs this is the AMDGPU backend; for NVIDIA the NVPTX backend (which targets PTX, not SASS).
Target backend / 汇编。 LLVM 的 target backend 选指令、 分配寄存器、 调度、 然后吐出汇编(或直接机器码)。 AMD GPU 用 AMDGPU backend; NVIDIA 用 NVPTX backend(它出的是 PTX, 不是 SASS)。
Machine code & device. The driver loads the binary onto the device, the device's command processor schedules wavefronts / warps, and the actual ISA stream feeds the execution units. From here it is microcode and transistors, and you cannot watch the program with a debugger anymore.
机器码 + 设备。 驱动把二进制加载到设备, 设备的 command processor 调度 wavefront / warp, 真正的 ISA 流送到执行单元。 再往下是微码和晶体管, 你拿 debugger 看不到了。
Every layer rewrites the program from "easier to write" toward "easier to execute." IR is the data type the compiler uses to do those rewrites. MLIR is a framework for building IRs at many levels. LLVM is one specific IR plus a battle-tested set of backends. The ISA is the chip's spec — what the backend must emit. FFI is how higher and lower layers talk across language and runtime boundaries. Everything else is engineering.
每一层都把程序从"好写"改写成"好执行"。 IR 是编译器做这些改写时用的数据结构。 MLIR 是一个用来在多层造 IR 的框架。 LLVM 是一种具体的 IR 加上一整套打磨多年的 backend。 ISA 是芯片的规范 —— backend 必须吐出符合它的指令。 FFI 是上下两层跨过语言和运行时边界对话的方式。 其余都是工程细节。
§ 2 · The word that means too many thingsIR · the controlled vocabulary
§ 2 · 一个词覆盖太多东西IR · 受控的词表
"IR" stands for intermediate representation. The phrase is unhelpful in the way that "data structure" is unhelpful: technically accurate, but it does not tell you what the thing is for. A useful definition: an IR is a controlled vocabulary for describing what a program does, in a form that is easier to analyze and transform than the surface language.
"IR" 是 intermediate representation 的缩写,"中间表示"。 这个翻译跟"数据结构"一样无用: 技术上准确, 但说不清它是干什么的。 一个更有用的定义: IR 是一套受控词表, 用来描述程序做了什么, 但写成一种比源语言更便于分析和改写的形式。
"More analyzable than source" matters because source code has too much sugar — generics, exceptions, syntactic shorthand, undefined ordering, late binding. "More structured than machine code" matters because machine code has too little — no variable names, no types, no control flow markers, just registers and branches. IR sits in between by design.
"比源更好分析"很关键, 因为源码语法糖太多 —— 泛型、 异常、 各种语法简写、 求值顺序不确定、 晚绑定。 "比机器码更有结构"也很关键, 因为机器码什么都没 —— 没变量名、 没类型、 没控制流标记, 只有寄存器和跳转。 IR 是有意设计在中间的。
A minimal example
一个最小例子
Take a trivial Python function:
拿一个琐碎的 Python 函数:
Compiled to LLVM IR (via, say, Cython or Numba), the body becomes:
编译成 LLVM IR(用 Cython、 Numba 之类), 函数体变成:
Three things changed. (1) Types became explicit: i32, not "int." (2) Each name is assigned exactly once — %a, %b, %r never get reassigned. That is SSA form: Static Single Assignment. (3) The function body is a flat list of operations connected by data flow, not a tree of expressions. There are no implicit conversions, no operator overloading, no surprises. The compiler can now do interesting things — fold constants, eliminate dead code, prove that %r never overflows — because the IR makes the analysis obvious.
三件事变了。 (1) 类型变显式了: i32, 不是含混的 "int"。 (2) 每个名字只被赋值一次 —— %a、 %b、 %r 都不会被重新赋值。 这就是 SSA: Static Single Assignment。 (3) 函数体是一串扁平的 op, 由数据流连起来, 不是表达式树。 没有隐式转换、 没有运算符重载、 没有惊喜。 编译器从这一刻开始可以做一些有意思的事 —— 常量折叠、 死代码消除、 证明 %r 不会溢出 —— 因为 IR 让分析变得显然。
Why everyone has their own IR
为什么每家都有自己的 IR
Once you accept that IR is just "the data type the compiler reasons about," it becomes obvious why every compiler invents one. Each level of abstraction wants a different vocabulary:
一旦接受 "IR 就是编译器拿来推理用的数据结构", 你就会理解为什么每个编译器都自造一种。 每一层抽象都想要一套不同的词汇:
| System | IR | What it talks about | |
|---|---|---|---|
| 系统 | IR | 它聊的是什么 | |
| LLVM | LLVM IR | SSA scalars, integers, floats, pointers, basic blocks | SSA 标量、 整型、 浮点、 指针、 基本块 |
| MLIR | fly.layout, linalg.matmul, ... | whatever each dialect needs — tensors, layouts, hardware atoms | 每个 dialect 想要什么就有什么 —— tensor、 layout、 硬件 atom |
| JVM | Java bytecode | stack-machine ops, typed locals, class references | 栈式机操作、 typed locals、 类引用 |
| .NET | CIL / MSIL | same idea as JVM, different bytecode set | 和 JVM 思想一样, 字节码不同 |
| Wasm | WebAssembly | portable bytecode for sandboxed execution | 面向沙盒的可移植字节码 |
| GraalVM | Graal IR | SSA "sea-of-nodes" with explicit control/data edges | SSA "sea-of-nodes", 控制 / 数据边都显式 |
| PyTorch 2 | FX Graph, Inductor IR | tensor ops, then loop nests | 先 tensor op, 后循环嵌套 |
| Triton | Triton IR | block-level tensor ops with layouts | block 级别的 tensor op 带 layout |
| JAX / XLA | HLO | high-level operations on whole arrays | 面向整个数组的高层 op |
| SPIR-V | SPIR-V | portable IR for graphics & compute shaders | 为图形 + 计算 shader 设计的可移植 IR |
The word IR meaning a dozen different things at once is part of why this terminology feels slippery. When someone says "the compiler lowers the IR to LLVM IR," they mean "the compiler rewrites its own IR into LLVM's IR." Two different IRs, same word.
IR 这个词同时指十几种不同东西, 这正是术语让人滑手的原因。 有人说"编译器把 IR lower 到 LLVM IR", 意思是"编译器把它自己的 IR 改写成 LLVM 的 IR"。 两个不同的 IR, 同一个词。
An IR is the file format the compiler talks to itself in. A pass is a function that takes one IR and returns another, often the same kind but simpler. A pipeline is a sequence of passes. That is it.
IR 是编译器自言自语用的文件格式。 一个 pass 是把一份 IR 变成另一份的函数, 通常种类相同但更简单。 一个 pipeline 是一串 pass。 就这样。
§ 3 · The IR you build IRs withMLIR · IR for building IRs
§ 3 · 用来造 IR 的 IRMLIR · 拿来造 IR 的框架
If every project keeps inventing its own IR, the obvious next question is: can we share the infrastructure? Each IR needs a parser, a printer, a verifier, a pass manager, a type system, an attribute system, a graph data structure with use-def chains, debugging facilities. That stuff is engineering, not insight. MLIR is the answer: Multi-Level Intermediate Representation. A framework for building IRs that all share the same infrastructure, but can specialize the operations and types to any level of abstraction.
如果每个项目都在重造 IR, 下一个自然的问题是: 基础设施能不能复用? 每个 IR 都需要 parser、 printer、 verifier、 pass manager、 类型系统、 属性系统、 use-def chain 的图数据结构、 调试设施。 这些都是工程, 不是 insight。 MLIR 就是答案: Multi-Level Intermediate Representation。 一个框架, 让你造 IR 时复用同一套基础设施, 但 op 和 type 可以专门化到任何抽象层级。
MLIR was born at Google in 2019 (Chris Lattner and Tatiana Shpeisman led the early work), then upstreamed into the LLVM project where it lives today. The core abstraction is the dialect: a named bundle of ops, types, and attributes. The standard distribution ships a few dozen dialects covering everything from low-level LLVM IR to high-level array math.
MLIR 2019 年在 Google 起步(早期由 Chris Lattner 和 Tatiana Shpeisman 牵头), 后来 upstream 进了 LLVM 项目, 现在住在那里。 核心抽象是 dialect: 一组带名字的 op、 type、 attribute 的捆绑。 标准发行版里有几十个 dialect, 覆盖从最底层的 LLVM IR 到最高层的数组数学。
A peek at three dialects
瞄一眼三个 dialect
To make this less abstract, here is the same idea — "a 4×8 matrix" — written in three MLIR dialects at three different levels:
为了不让这件事太抽象, 用三个 MLIR dialect 在三个层级表达同一个想法 ——"一个 4×8 矩阵":
Same shape, three vocabularies. The top one is what an autotuner wants to optimize, the middle one is what a polyhedral compiler wants to tile and fuse, the bottom one is what a kernel author wants to reason about hardware lane layout in. MLIR is the framework that lets all three live in the same module and lower into each other.
同一个形状, 三套词汇。 最上面是 autotuner 想优化的形式; 中间是多面体编译器想 tile 和 fuse 的形式; 最下面是 kernel 作者想推理硬件 lane 布局的形式。 MLIR 就是让这三种共存在同一个 module 里并互相 lower 的框架。
Before MLIR, every ML compiler reinvented the IR infrastructure from scratch: TensorFlow had its own GraphDef + grappler, Glow had its own IR, ONNX had its own IR, Halide had its own IR, TVM had its own Relay + Tensor IR. After MLIR, most new compilers (Triton, FlyDSL, Torch-MLIR, IREE, OpenXLA) share a parser, a verifier, a pass manager. The infrastructure stopped being a tax.
MLIR 之前, 每个 ML 编译器都从零重造 IR 基础设施: TensorFlow 有 GraphDef + grappler、 Glow 有自己的 IR、 ONNX 有自己的 IR、 Halide 有自己的 IR、 TVM 有 Relay + Tensor IR。 MLIR 之后, 大部分新编译器(Triton、 FlyDSL、 Torch-MLIR、 IREE、 OpenXLA)共享 parser、 verifier、 pass manager。 基础设施不再是一种税。
§ 4 · The compiler infrastructure that ate the worldLLVM · the workhorse
§ 4 · 吞下整个世界的编译器基建LLVM · 干苦力的主力
LLVM is — depending on how you count — either the most successful compiler infrastructure project in history, or tied with GCC. It started in 2000 as Chris Lattner's master's thesis at Illinois ("Low Level Virtual Machine"), grew into the technical backbone of Apple's clang in 2007, and has since become the universal backend for almost everything that is not Microsoft's own MSVC. If you write Rust, Swift, Julia, Zig, modern Fortran, or compile Python via Numba or Cython, you are using LLVM. If you write GPU kernels in CUDA, HIP, OpenCL, SYCL, or any MLIR-based DSL, you are using LLVM. If you flash a microcontroller with a Rust firmware, you are using LLVM.
LLVM —— 看你怎么数 —— 要么是史上最成功的编译器基建项目, 要么和 GCC 并列。 它 2000 年起步, 是 Chris Lattner 在 Illinois 的硕士论文("Low Level Virtual Machine"); 2007 年长成 Apple clang 的技术骨架; 之后变成几乎所有非微软自家 MSVC 的 universal backend。 你写 Rust、 Swift、 Julia、 Zig、 现代 Fortran, 或者用 Numba / Cython 编 Python, 你在用 LLVM。 你写 CUDA、 HIP、 OpenCL、 SYCL kernel, 或者任何基于 MLIR 的 DSL, 你在用 LLVM。 你给微控制器烧 Rust 固件, 你在用 LLVM。
"LLVM" the brand covers three different things people sometimes confuse:
"LLVM"这个品牌覆盖三个有时被混淆的东西:
| Name | What it is | |
|---|---|---|
| 名字 | 到底是什么 | |
| LLVM Project | The umbrella repo. Holds clang, LLVM core, LLDB, libc++, compiler-rt, MLIR, BOLT, Flang, and a dozen more sub-projects. | 那个 umbrella 仓库, 装着 clang、 LLVM core、 LLDB、 libc++、 compiler-rt、 MLIR、 BOLT、 Flang 等十几个子项目。 |
| LLVM IR | The intermediate representation. SSA-form, typed, target-agnostic. Has a textual form (.ll) and a binary form (.bc, "bitcode"). | 那种 IR。 SSA、 typed、 target-agnostic。 有文本形式(.ll)和二进制形式(.bc, "bitcode")。 |
| LLVM Core | The library that contains the optimizer (~100 passes), the legalizer, the instruction selector, the register allocator, and the per-target code generators. | 那个核心库, 包含 optimizer(~100 个 pass)、 legalizer、 指令选择器、 寄存器分配器、 各 target 的代码生成器。 |
What MLIR adds on top
MLIR 在上面加了什么
LLVM IR is wonderful at register-level reasoning — instruction selection, scheduling, register allocation — but it knows nothing about tensors, layouts, or loop tiling. By the time your program is in LLVM IR, all the high-level structure has been lost. MLIR's job is to let the compiler do tensor-shaped optimizations before dropping to LLVM IR, so the LLVM backend gets a program that is already shaped for the target.
LLVM IR 擅长寄存器级别的推理 —— 指令选择、 调度、 寄存器分配 —— 但它不知道 tensor、 layout、 循环 tiling 是什么。 等程序变成 LLVM IR, 高层结构已经丢失。 MLIR 的工作是让编译器在掉到 LLVM IR 之前把 tensor 级别的优化做掉, 这样 LLVM backend 接到的就是一段已经为目标硬件塑形过的程序。
In practice this is why ML compilers stack MLIR on top of LLVM rather than replacing it. MLIR handles the part where "matmul of a 4096-cube tensor" turns into "a doubly-tiled loop nest with a software pipeline." LLVM handles the part where "this loop body needs eight MFMA instructions and four buffer loads, scheduled so the next iteration overlaps with the current one's accumulator write."
实际工程里这就是为什么 ML 编译器把 MLIR 叠在 LLVM 之上而不是替代 LLVM。 MLIR 负责"4096 立方的 tensor matmul"变成"二重 tiled 循环嵌套加软件流水线"那一段。 LLVM 负责"这段循环体要发八个 MFMA 加四个 buffer load, 调度成下一轮和当前轮的累加器写回重叠"那一段。
§ 5 · The chip's contract with the worldISA · where the rubber meets silicon
§ 5 · 芯片对外的契约ISA · 真正落到硅的那一步
An ISA — Instruction Set Architecture — is the contract between a chip and any compiler that wants to use it. It specifies: the legal instructions, what they do, how they are encoded, the registers the program can address, the memory model, the exception/interrupt mechanisms. Two chips that implement the same ISA can run the same binary. Two chips with different ISAs cannot — that is why a binary built for Apple Silicon does not run on Intel, and vice versa.
一个 ISA —— Instruction Set Architecture, 指令集架构 —— 是芯片和任何想用它的编译器之间的契约。 它规定: 哪些指令是合法的、 每条指令做什么、 怎么编码、 程序能寻址哪些寄存器、 内存模型、 异常 / 中断机制。 两个实现同一个 ISA 的芯片可以跑同一个二进制; 两个不同 ISA 的不能 —— 这正是 Apple Silicon 编的二进制跑不了 Intel、 反过来也不行的原因。
The major ISA families
主要 ISA 家族
| Family | Owner | Where you meet it | |
|---|---|---|---|
| 家族 | 归属 | 你在哪里见过 | |
| x86 / x86-64 | Intel · AMD | desktops, servers, laptops, most cloud VMs | 桌面、 服务器、 笔记本、 大多数云上 VM |
| ARMv8 / v9 · AArch64 | Arm Holdings | phones, Apple Silicon Macs, Graviton, NVIDIA Grace | 手机、 Apple Silicon Mac、 Graviton、 NVIDIA Grace |
| RISC-V | open ISA · UC Berkeley origin | 开放 ISA · 起源 UC Berkeley | embedded, accelerators, increasingly servers |
| NVIDIA PTX / SASS | NVIDIA | PTX is a portable virtual ISA; SASS is the real on-chip one (undocumented) | PTX 是可移植虚拟 ISA; SASS 是芯片上真正的(未公开) |
| AMD GCN · RDNA · CDNA | AMD | Radeon (RDNA), Instinct/MI series (CDNA), Steam Deck APU | Radeon (RDNA)、 Instinct / MI 系列 (CDNA)、 Steam Deck APU |
| PowerPC · Power ISA | IBM | mainframes, some HPC, recent Power10/11 | 大型机、 一些 HPC、 最近的 Power10 / 11 |
| MIPS · LoongArch | historic · Loongson (China) | 历史角色 · Loongson(中国) | routers, embedded, China domestic CPUs |
Why ISA matters more for ML than it used to
为什么 ISA 现在对 ML 比以前更重要
For most of CPU history, the ISA was a stable substrate — you didn't really care which x86 generation you were on, because the compiler handled it. ML changed that. The performance of a modern ML workload depends almost entirely on whether the compiler reaches the chip's specialized matrix-math instructions. Those instructions are the difference between 50 TFLOPS and 500 TFLOPS on the same silicon area. Every major vendor added them in the last decade, and they are all different.
CPU 史上的大部分时间里, ISA 是一个稳定的底座 —— 你不太在意自己跑在哪一代 x86 上, 因为编译器处理掉了。 ML 改变了这件事。 一个现代 ML 工作负载的性能, 几乎完全取决于编译器是否打到芯片上的专门矩阵数学指令。 这些指令是同一片硅上 50 TFLOPS 和 500 TFLOPS 的差别。 过去十年每个大厂都加了它们, 而且互不相同。
ISA = what instructions exist and what they do. ABI (Application Binary Interface) = the conventions layered on top — how arguments are passed, which registers are caller-saved, how the stack is aligned, how exceptions propagate. Two compilers targeting the same ISA can still disagree at the ABI level. "x86-64 System V ABI" and "Microsoft x64 ABI" are both x86-64 binaries that cannot freely call each other.
ISA = 有哪些指令、 每条做什么。 ABI(Application Binary Interface)= 叠在 ISA 之上的约定 —— 参数怎么传、 哪些寄存器是 caller-saved、 栈如何对齐、 异常如何传播。 两个编译器编同一个 ISA, 在 ABI 上仍可能不一致。 "x86-64 System V ABI"和"Microsoft x64 ABI"都是 x86-64 二进制, 但互相不能自由调用。
§ 6 · The bridge between worldsFFI · the boundary
§ 6 · 跨世界的桥FFI · 边界与翻译
FFI stands for Foreign Function Interface. It is the mechanism a program written in one language uses to call a function written in another. Every modern ML system has at least one FFI inside it: Python ↔ C++, C++ ↔ CUDA, JIT-compiled binary ↔ runtime, Rust ↔ C, JavaScript ↔ Wasm. If a system feels stitched together, the seams are usually FFI calls.
FFI 是 Foreign Function Interface, "外部函数接口"。 一个语言写的程序通过 FFI 调用另一个语言写的函数。 每个现代 ML 系统至少包含一次 FFI: Python ↔ C++、 C++ ↔ CUDA、 JIT 出来的二进制 ↔ 运行时、 Rust ↔ C、 JavaScript ↔ Wasm。 一个系统若有"拼接感", 缝合处通常都是 FFI 调用。
An FFI has to solve three problems simultaneously:
一个 FFI 必须同时解决三件事:
- ABI mismatch. The two languages may not even agree on what a function call looks like at the machine level.
- ABI 不一致。 两边的语言在机器层面也许都不同意"一次函数调用长什么样"。
- Data representation. Python lists, C arrays, NumPy arrays, PyTorch tensors, JAX arrays — they all describe "a chunk of numbers" but with different metadata, layout, and ownership.
- 数据表示。 Python list、 C 数组、 NumPy array、 PyTorch tensor、 JAX array —— 都在描述"一段数字", 但元数据、 布局、 所有权各不相同。
- Lifetime & ownership. A GC'd object passed into C++ must not be freed while the C++ side is still using it. A C++ object handed to Python must release its memory when Python loses the last reference.
- 生命周期 + 所有权。 传给 C++ 的 GC 对象, 在 C++ 还用着时不能被回收; 给到 Python 的 C++ 对象, 最后一个引用消失时要释放。
A closer look at TVM-FFI
细看 TVM-FFI
Apache TVM-FFI is the cleanest contemporary example of "stable C ABI for ML kernels." Its pitch: a kernel library — FlashInfer, TileLang, NVIDIA's cuteDSL — should be able to ship one wheel that PyTorch / JAX / CuPy / Paddle can all load. The way it gets there is:
Apache TVM-FFI 是当前"为 ML kernel 提供稳定 C ABI"最干净的例子。 卖点: 一个 kernel 库 —— FlashInfer、 TileLang、 NVIDIA 的 cuteDSL —— 应该能发一个 wheel, 让 PyTorch / JAX / CuPy / Paddle 都装得上。 它做法是:
- A minimal C ABI for calls, errors, types — about a thousand lines of header.
- 一份极小的 C ABI, 覆盖调用、 错误、 类型 —— 大约一千行头文件。
- Tensor interop via DLPack — the cross-framework "I am a tensor in device memory at address X with strides Y" protocol.
- tensor 互通用 DLPack —— 跨框架的"我是一块设备内存上 X 地址、 stride 是 Y 的 tensor"协议。
- Language bindings for Python, C++, Rust — same C ABI underneath.
- Python、 C++、 Rust 三种语言绑定 —— 底下都是同一个 C ABI。
If you have ever debugged a kernel wheel that says undefined symbol: _ZN3c106detail... after a PyTorch upgrade, you have hit the problem that stable cross-framework ABIs are trying to solve.
如果你曾经在 PyTorch 升级后 debug 过 undefined symbol: _ZN3c106detail... 的 kernel wheel, 你就撞过"跨框架稳定 ABI"想解决的那个问题。
DLPack is a small struct describing a tensor: pointer, device, dtype, shape, strides, byte offset. Every major framework knows how to import and export this struct, so you can torch.from_dlpack(jax_array) and pass tensors across frameworks at zero copy. It is the unsexy plumbing that holds the modern ML ecosystem together.
DLPack 是一个小 struct, 描述一个 tensor: pointer、 device、 dtype、 shape、 strides、 byte offset。 主流框架都知道如何导入导出这个 struct, 所以你可以 torch.from_dlpack(jax_array), tensor 跨框架零拷贝传递。 它是把现代 ML 生态粘起来的不起眼的水管。
§ 7 · The whole stack on one exampleOne matmul, all the way down
§ 7 · 一个例子穿透整个栈一个 matmul, 一路下去
Time to spend the vocabulary. Trace a single call — C = torch.matmul(A, B) for FP16 4096-cube tensors on an AMD MI300X — from Python entry to the actual MFMA instructions retiring on the wavefront. We will skip nothing important and add nothing extra.
把前面学的词花掉。 追一个调用 —— FP16 4096 立方的 C = torch.matmul(A, B) 在 AMD MI300X 上 —— 从 Python 入口到 MFMA 指令在 wavefront 上发完为止。 重点都不会跳过, 也不会加多余的细节。
Step 1 · Python → CPython bytecode
Step 1 · Python → CPython 字节码
CPython parses your .py file once into bytecode. torch.matmul resolves to a LOAD_ATTR · CALL sequence on the stack machine. Total cost: a couple of microseconds, paid every call.
CPython 把 .py 文件解析成一次字节码。 torch.matmul 解析成栈式机器上的一串 LOAD_ATTR · CALL。 总开销几微秒, 每次调用都付。
Step 2 · PyTorch dispatcher → ATen kernel
Step 2 · PyTorch dispatcher → ATen kernel
The dispatcher inspects the tensors' device (cuda/hip), dtype (f16), and layout, then looks up the registered implementation. We land in aten::matmul, which decides this is a 2D-by-2D case and delegates to aten::mm, which on a ROCm build delegates to at::native::mm_out_hip. That function asks hipBLAS / hipBLASLt for a GEMM, or — if you have flipped a config — calls into a fused custom kernel.
dispatcher 检查 tensor 的 device(cuda / hip)、 dtype(f16)、 layout, 然后查注册的实现。 落到 aten::matmul, 它判断这是 2D × 2D, 转交给 aten::mm, 在 ROCm 构建里又转交给 at::native::mm_out_hip。 这个函数请 hipBLAS / hipBLASLt 做一个 GEMM —— 或者在你切换了 config 时, 走进某个 fused 自定义 kernel。
Step 3 · FFI · Python ⇄ C++ ⇄ HIP runtime
Step 3 · FFI · Python ⇄ C++ ⇄ HIP runtime
Tensors cross the Python/C++ boundary through libtorch's pybind11 layer (or, in newer toolchains, via TVM-FFI). The HIP runtime — equivalent to NVIDIA's CUDA driver — accepts a launch request and enqueues it on the stream associated with this tensor.
tensor 通过 libtorch 的 pybind11 层穿过 Python / C++ 边界(新工具链里可能走 TVM-FFI)。 HIP 运行时 —— 类似 NVIDIA 的 CUDA driver —— 接收 launch 请求, 把它入到这个 tensor 关联的 stream。
Step 4 · Compiled kernel binary (HSACO)
Step 4 · 编译好的 kernel 二进制 (HSACO)
The kernel is not compiled at this moment; it was compiled when the library shipped (or the first time you ran it, if JIT). The binary is an HSACO file — AMD's equivalent of NVIDIA's .cubin — containing CDNA3 machine code for the MI300X's gfx942 ISA. It was produced earlier by: a DSL (FlyDSL, Triton) → MLIR → LLVM IR → LLVM AMDGPU backend → assembler → HSACO.
kernel 不是此时编译的; 是库发布时(或第一次跑时, 如果是 JIT)就编好了。 二进制是 HSACO 文件 —— AMD 对应 NVIDIA .cubin 的格式 —— 含 MI300X 上 gfx942 ISA 的 CDNA3 机器码。 它早先这样产生: DSL(FlyDSL、 Triton)→ MLIR → LLVM IR → LLVM AMDGPU backend → assembler → HSACO。
Step 5 · Wavefront execution · MFMA instructions
Step 5 · Wavefront 执行 · MFMA 指令
The MI300X's command processor schedules the kernel onto Compute Units. Each Compute Unit runs wavefronts of 64 threads. Inside the hot loop, the machine code is a tight sequence of buffer_load_dwordx4 (load 16 bytes of A or B from global memory), ds_write_b128 / ds_read_b128 (stage through LDS), and v_mfma_f32_16x16x16_f16 (the matrix-multiply-accumulate). Get this loop's instruction mix right and you push the chip to 90%+ of peak FP16 FLOPs. Get it wrong and you sit at 30%.
MI300X 的 command processor 把 kernel 调度到 Compute Unit。 每个 Compute Unit 跑 64 线程的 wavefront。 hot loop 里的机器码是一串紧凑的 buffer_load_dwordx4(从 global 内存加载 16 字节的 A / B)、 ds_write_b128 / ds_read_b128(过 LDS 中转)、 v_mfma_f32_16x16x16_f16(矩阵乘加)。 这个循环的指令配比调对, 整片芯片能跑到 FP16 峰值 FLOPs 的 90%+。 调错, 就停在 30%。
Step 6 · Result returns · GC
Step 6 · 结果返回 · GC
When the kernel signals completion (via the stream's event), the runtime returns control to Python. The result tensor — a thin C++ object wrapping a device pointer + shape + strides + DLPack metadata — gets handed back to Python through pybind11. Python's GC owns the lifetime. When the last reference drops, the device memory is freed.
kernel 通过 stream 的 event 报告完成, 运行时把控制权还给 Python。 结果 tensor —— 一个 C++ 对象, 包着 device pointer + shape + strides + DLPack 元数据 —— 通过 pybind11 回到 Python。 Python GC 接管生命周期; 最后一个引用消失时, 设备内存释放。
Python source → bytecode → PyTorch dispatcher → ATen C++ → pybind11/TVM-FFI → HIP runtime → HSACO machine code → CDNA3 wavefront executing MFMAs → result back through the same bridges in reverse. Every layer is replaceable; the joints are what make the system feel either elegant or brittle.
Python 源码 → 字节码 → PyTorch dispatcher → ATen C++ → pybind11 / TVM-FFI → HIP runtime → HSACO 机器码 → CDNA3 wavefront 跑 MFMA → 结果沿同一条路反向回来。 每一层都可替换; 真正决定系统优雅还是脆弱的, 是关节。
§ 8 · The places people run agroundReefs · common confusions
§ 8 · 大家容易搁浅的地方易混淆礁石 合集
A few patterns of confusion that keep recurring in conversations:
几个反复出现的混淆模式:
-
"IR" overloaded"IR"被严重重载A compiler can have a frontend IR, a middle IR, and a backend IR — three different IRs, same compiler. "Lowering the IR to LLVM IR" usually means "lowering my IR to LLVM's IR." Always pin down whose IR.一个编译器可以同时有 frontend IR、 middle IR、 backend IR —— 三种不同的 IR, 同一个编译器。 "把 IR lower 到 LLVM IR"通常是说"把我的 IR lower 到 LLVM 的 IR"。 先问清楚谁的 IR。
-
LLVM IR ≠ MLIRLLVM IR ≠ MLIRLLVM IR is one specific IR. MLIR is a framework for building (potentially many) IRs. MLIR happens to live in the LLVM repo and includes a dialect that mirrors LLVM IR, but the two are not the same thing.LLVM IR 是一种具体的 IR。 MLIR 是一个用于造(很多种) IR 的框架。 MLIR 恰好住在 LLVM 仓库里, 也包含一个镜像 LLVM IR 的 dialect, 但二者不是一回事。
-
PTX is virtual; SASS is realPTX 是虚拟的; SASS 才是真实的On NVIDIA GPUs, "the ISA" usually means PTX in conversation. PTX is a portable virtual ISA — the driver's JIT translates it to SASS, the real undocumented per-architecture ISA. You write PTX (or have a compiler emit it); you almost never see SASS unless you are debugging at the assembly level.说"NVIDIA GPU 的 ISA"通常指 PTX。 PTX 是可移植的虚拟 ISA —— 驱动会把它 JIT 成 SASS, 即每代架构上真实的、 未公开的 ISA。 你写 PTX 或让编译器吐 PTX; 除非你在汇编级 debug, 否则几乎看不到 SASS。
-
ISA ≠ ABIISA ≠ ABIISA = what the chip executes. ABI = how compilers agree on calling conventions on top of that ISA. Same ISA, different ABIs is real — System V x86-64 and Microsoft x64 are both x86-64 but have different rules for which registers carry arguments.ISA = 芯片执行的指令。 ABI = 在 ISA 之上编译器们对调用约定的协议。 同一 ISA、 不同 ABI 是常态 —— System V x86-64 和 Microsoft x64 都是 x86-64, 但参数走哪些寄存器规则不同。
-
"Tensor Core" is a brand, not a category"Tensor Core"是品牌, 不是品类"Tensor Core" is NVIDIA's marketing name for their matrix instructions. The category is matrix-multiply-accumulate, and every vendor has one: AMD MFMA (CDNA) / WMMA (RDNA), Intel AMX, ARM SME, Apple AMX (different from Intel's, confusingly). Talking about "Tensor Cores" on AMD hardware is fine in casual conversation but technically the instructions are MFMAs."Tensor Core"是 NVIDIA 给自家矩阵指令的营销名。 品类叫 matrix-multiply-accumulate, 每家都有: AMD MFMA (CDNA) / WMMA (RDNA)、 Intel AMX、 ARM SME、 Apple AMX(和 Intel 那个同名但不同, 经常搞混)。 在 AMD 硬件上口头说"Tensor Core"在非正式场合没事, 但严格说应叫 MFMA。
-
"Compiler" means at least three things in MLML 里"编译器"至少指三种东西There is the AOT compiler (clang / rustc) that produces a binary from source. There is the framework compiler (torch.compile, JAX jit, XLA) that produces a graph from a Python function. There is the kernel compiler (Triton, FlyDSL) that produces a GPU binary from a kernel DSL. They share infrastructure but solve different problems.有 AOT 编译器(clang / rustc)把源码编成二进制; 有框架编译器(torch.compile、 JAX jit、 XLA)把一个 Python 函数变成一张图; 有 kernel 编译器(Triton、 FlyDSL)把一段 kernel DSL 编成 GPU 二进制。 三者共享基建, 但解决的问题不同。
-
"AOT" vs "JIT""AOT" vs "JIT"AOT (ahead-of-time): compile once, ship a binary. C/C++, Rust, Swift, Go work this way. JIT (just-in-time): compile at runtime, often shape-specialized. JVM, V8, Julia, PyTorch's torch.compile, every GPU kernel DSL works this way. Most ML systems blend both — the framework ships AOT, the kernels JIT per shape.AOT(ahead-of-time): 编一次, 发二进制。 C / C++、 Rust、 Swift、 Go 走这条路。 JIT(just-in-time): 运行时编, 常按 shape 特化。 JVM、 V8、 Julia、 PyTorch 的 torch.compile、 每个 GPU kernel DSL 都走 JIT。 大多数 ML 系统两者并用 —— 框架 AOT, kernel 按 shape JIT。
CodaWhere to read next
余话接下来读什么
If this primer convinced you that the territory is worth a closer look, here are the books and references that paid for themselves for me:
如果这篇科普让你觉得这片地值得再深看一眼, 这些是我自己花时间读过、 觉得值的书与文档:
| For | Read | |
|---|---|---|
| 想学 | 读这个 | |
| Computer architecture, the canonical text | 计算机体系结构, 经典教材 | Hennessy & Patterson · Computer Architecture: A Quantitative Approach |
| A friendlier compiler textbook | 更友好的编译原理教材 | Cooper & Torczon · Engineering a Compiler |
| LLVM, the design retrospective | LLVM 的设计回顾 | The Architecture of Open Source Applications, Vol. 1 — Chapter on LLVM (Chris Lattner, free online) |
| MLIR, the original paper | MLIR 原始论文 | Lattner et al. · MLIR: A Compiler Infrastructure for the End of Moore's Law (2020) |
| AMD GPU ISA reference | AMD GPU ISA 参考 | AMD CDNA3 ISA Reference Guide (PDF on AMD's site) |
| NVIDIA PTX reference | NVIDIA PTX 参考 | PTX ISA Programming Guide (NVIDIA Docs) |
| Real kernel code to learn from | 可以学的 kernel 代码 | CUTLASS · FlashAttention · vLLM · FlyDSL · TileLang |
And — meta — the other entries in this Source Reading series cover concrete repos in this stack: SkyPilot (orchestration), SGLang and vLLM (inference engines), mini-SGLang (teaching version), gcnasm (AMD CDNA3 assembly), FlyDSL (layout-algebra Python DSL). Each is a worked example of one of the boxes in Plate I.
另外(meta) —— 这个 Source Reading 系列里其他几集, 各自精读了这个栈里的具体仓库: SkyPilot(编排)、 SGLang 与 vLLM(推理引擎)、 mini-SGLang(教学版)、 gcnasm(AMD CDNA3 汇编)、 FlyDSL(layout 代数 Python DSL)。 每一集都是 Plate I 中某一格的"动手版"。
The point of writing this primer was not to make you a compiler engineer. It was to give you back the vocabulary so that the next time something underneath leaks — the next time the abstraction stops being free — you can name what just happened. Naming things is most of debugging.
写这篇科普的目的不是要把你变成编译器工程师。 是把这套词汇还给你 —— 下一次有什么东西从下面漏出来、 抽象不再免费的时候, 你能给刚才发生的事起个名字。 而 debug 这件事大半就是命名。