N.007 A Primer 2026-05-26
N.007 科普一集 2026-05-26

From Python to Silicon

Python 硅片

A field guide to IR, MLIR, LLVM, ISA, and FFI — for the working ML engineer who never quite finished the compilers course.

写给本科 CS 念过、 但是 Compiler 和 Computer Arch 没好好上过的 ML 工程师 —— 一份关于 IR、 MLIR、 LLVM、 ISA、 FFI 的口袋手册。

Length
体量
10 chapters
10
Audience
读者
curious & terminology-shy
好奇而怕术语者
Map
路线
Python ASM
Python ASM
Compiled for
编译为
share-ready primer
可分享的入门

You can write production ML systems for years without ever quite knowing what an IR is. You import torch, you call torch.matmul, the GPU lights up, and a few milliseconds later there is a tensor on the other side. Somewhere between your Python file and that tensor sit at least eight layers of software, three forms of intermediate representation, two compiler frameworks, one device driver, and a chip that speaks an instruction set someone in Bangalore or Sunnyvale spent two years specifying. Most of the time you do not need to know any of this. But every once in a while something underneath leaks — a kernel runs slower than it should, a new accelerator does not have a backend, a wheel built for CUDA 12.4 will not load on CUDA 12.6 — and the abstraction stops being free.

你可以把生产级 ML 系统写好几年, 都不太知道 IR 是什么。 你 import 一个 torch, 调一个 torch.matmul, GPU 亮一下, 几毫秒后另一头就出来一个 tensor。 你这个 Python 文件和那个 tensor 之间, 至少夹着八层软件、 三种 IR、 两套编译器框架、 一份设备驱动、 和一颗讲着指令集(由某个在 Bangalore 或 Sunnyvale 的人花了两年定义的指令集)的芯片。 大多数时候你不需要知道这些。 但偶尔会有什么从下面漏出来 —— 一个 kernel 比预期慢、 一个新加速器没 backend、 一个为 CUDA 12.4 编的 wheel 在 CUDA 12.6 装不上 —— 这时抽象就不再免费。

This essay is for the working ML engineer who hit one of those leaks and realized they could not quite name the pieces. It is not a textbook — Hennessy & Patterson is still on the shelf — but it is the explanation I wish someone had given me before I had to read Hennessy & Patterson. Each chapter takes one word from the stack (IR, MLIR, LLVM, ISA, FFI), shows you the smallest concrete example I could find, and explains why people in this field keep talking about it.

这篇文章是写给那种"漏了一次"才意识到自己讲不出这些零件名字的 ML 工程师。 它不是教科书 —— Hennessy & Patterson 还在架子上 —— 但它是我希望在自己被迫去读 Hennessy & Patterson 之前, 有人先给我讲的那个东西。 每一章拿栈里的一个词(IRMLIRLLVMISAFFI), 给出我能找到的最小的具体例子, 然后解释为什么这个领域的人老在聊它。

How to read this 怎么读

If you only have twenty minutes, read § 1 (the stack) and § 7 (one matmul, all the way down). Those two together give the mental model. The middle five chapters expand each layer in turn — you can skim them and return on demand.

如果你只有二十分钟, 读 § 1(整个栈的全景)和 § 7(追一个 matmul 一路下去)。 这两章合在一起就给出了心智模型。 中间五章把每一层各自展开 —— 可以扫读, 用到再回来翻。

A note on bias. I work on AMD GPU kernels through MLIR, so when I need a concrete example I will reach for FlyDSL, ROCDL, MFMA, CDNA — the AMD vocabulary I touch every day. The shape of the argument is identical for NVIDIA, Intel, ARM, RISC-V. Wherever vendor-specific names appear I have tried to flag them.

一点偏见说明。 我日常做的是 AMD GPU kernel + MLIR, 所以例子大多来自 FlyDSL、 ROCDL、 MFMA、 CDNA 这套 AMD 词汇。 论证的形状对 NVIDIA、 Intel、 ARM、 RISC-V 完全一样。 凡是 vendor-specific 的名字, 我都尽量标了出来。

§ 1 · The map before the territoryThe vertical stack

§ 1 · 先看地图再走路纵向全景

Start with the question: what actually happens between torch.matmul(A, B) and a transistor switching? The answer is a vertical descent through layers, each translating from a slightly more abstract description into a slightly more concrete one. The plate on the next page draws the full descent on one page; this section is the verbal walk-through.

先问一个问题: torch.matmul(A, B) 到一个晶体管翻转之间, 实际上发生了什么? 答案是一段纵向下降, 一层接一层, 每一层把"更抽象的描述"翻译成"更具体的描述"。 下一页那张图把整个下降画在一张纸上; 这一节是对照的文字讲解。

Plate I The vertical descent from Python to silicon 从 Python 到硅片的纵向下降 — scale: conceptual —
The Vertical Descent · Python → Silicon HIGH-LEVEL SILICON Python source torch.matmul(A, B) writes human-readable source code CPython bytecode CALL · LOAD_ATTR · ... interprets stack machine ops PyTorch dispatcher aten::matmul → CUDA / HIP backend routes device · dtype · layout match ATen kernel · FFI hop C++ entry · pybind11 / dlpack / TVM-FFI crosses Python ⇄ C++ boundary DSL / framework · IR appears FlyDSL / Triton / Inductor · trace → MLIR emits typed graph in custom dialect MLIR pipeline linalg → affine → scf → memref → llvm lowers dialect tower, top to bottom LLVM IR define i32 @ker(...) { %r = mul ... } optimizes SSA · CFG · target-agnostic Target backend · ASM v_mfma_f32_16x16x16_f16 · buffer_load_dwordx4 selects CDNA / RDNA / x86 / PTX ISA Machine code · device HSACO / cubin · loaded by driver executes microcode + transistors LOWERING DIRECTION what this layer does Each row collapses sub-layers — the dispatcher alone is a small library, the MLIR pipeline can be 30+ passes. Boundary lines on rows 4 and 5 flag where the word "FFI" and the word "IR" first matter. The descent is not strictly linear in time (some optimizations refer back up the stack), but the layered picture is the right starting mental model.
Plate I — the same vertical descent, drawn once. Three boundaries matter most: the FFI hop (Python ⇄ C++), the IR boundary (where the program stops being an AST and starts being a typed graph), and the ISA boundary (where a CPU/GPU finally executes). Everything else is detail.
Plate I —— 同一段纵向下降, 画在一张纸上。 三条边界最关键: FFI 跨越(Python ⇄ C++)、 IR 边界(程序从 AST 变成 typed graph)、 ISA 边界(CPU / GPU 真正执行的地方)。 其余都是细节。

A walking tour of the layers

逐层走一遍

Python source. What you write. Human-readable, dynamic, garbage-collected.

Python 源码。 你写的东西。 人可读、 动态、 GC。

CPython bytecode. Python compiles each function to bytecode (dis.dis(fn) shows it). A stack machine interprets the bytecode at runtime. This is the layer where overhead lives — every Python call passes through it.

CPython 字节码。 Python 把每个函数先编译成字节码(dis.dis(fn) 能看到), 然后由一个栈式机器在运行时解释。 Python 的开销主要长在这一层 —— 每次 Python 调用都要过它。

PyTorch dispatcher. torch.matmul resolves to an ATen operator, which the dispatcher routes by device, dtype, and layout. The routing logic is itself a graph of registrations. After the dispatcher, you are no longer in Python — you are in C++.

PyTorch 分发器。 torch.matmul 先变成 ATen 算子, 然后由 dispatcher 按 device、 dtype、 layout 路由。 路由本身就是一张注册图。 过了 dispatcher, 你就不再在 Python 里 —— 你在 C++ 里。

ATen kernel (FFI boundary). The C++ side of the bridge. PyTorch uses pybind11 for the Python ⇄ C++ marshalling; tensors cross via DLPack to avoid copies. This is the first place the word FFI earns its keep.

ATen kernel(FFI 边界)。 桥的 C++ 这一侧。 PyTorch 用 pybind11 做 Python ⇄ C++ 的 marshalling; tensor 用 DLPack 跨过去避免拷贝。 这是 FFI 这个词第一次真正赚回它的复杂度。

DSL / framework (where IR appears). If the kernel is hand-rolled CUDA / HIP, you jump straight to the assembler. If it is a DSL — Triton, FlyDSL, JAX-via-XLA, Inductor — the framework traces the Python into an in-memory typed graph. That graph is its IR. You have now left "code" and entered "data the compiler can reason about."

DSL / 框架(IR 在此出现)。 如果 kernel 是手写 CUDA / HIP, 直接跳到汇编器。 如果是 DSL —— Triton、 FlyDSL、 JAX-via-XLA、 Inductor —— 框架会把 Python trace 成一份内存里的 typed graph。 那张图就是它的 IR。 你已经离开"代码", 进入"编译器可以推理的数据"。

MLIR pass pipeline. The IR is rewritten by a sequence of passes, each a function (IR) → (IR). The pipeline progressively lowers from "linalg / linear algebra ops" through "affine loops" through "explicit memrefs" down to "LLVM dialect" — at which point you are one step from machine code.

MLIR pass 流水线。 IR 被一连串 pass 改写, 每个 pass 都是一个 (IR) → (IR) 的函数。 流水线渐进地从"linalg / 线代 op"下降到"affine 循环"到"显式 memref"再到"LLVM dialect" —— 此时距机器码只差一步。

LLVM IR. The lingua franca of modern compiler backends. SSA-form, typed, target-agnostic. The same LLVM IR can be sent to x86, ARM, RISC-V, AMDGPU, NVPTX, WebAssembly. Every modern compiler that targets more than one chip uses LLVM, or has imitated it.

LLVM IR。 现代编译器后端的通用语。 SSA 形式、 typed、 target-agnostic。 同一份 LLVM IR 可以送去 x86、 ARM、 RISC-V、 AMDGPU、 NVPTX、 WebAssembly。 凡是现代多 target 编译器都在用 LLVM, 或者模仿过它。

Target backend / ASM. The LLVM target backend selects real instructions, allocates registers, schedules, and emits assembly (or directly machine code). For AMD GPUs this is the AMDGPU backend; for NVIDIA the NVPTX backend (which targets PTX, not SASS).

Target backend / 汇编。 LLVM 的 target backend 选指令、 分配寄存器、 调度、 然后吐出汇编(或直接机器码)。 AMD GPU 用 AMDGPU backend; NVIDIA 用 NVPTX backend(它出的是 PTX, 不是 SASS)。

Machine code & device. The driver loads the binary onto the device, the device's command processor schedules wavefronts / warps, and the actual ISA stream feeds the execution units. From here it is microcode and transistors, and you cannot watch the program with a debugger anymore.

机器码 + 设备。 驱动把二进制加载到设备, 设备的 command processor 调度 wavefront / warp, 真正的 ISA 流送到执行单元。 再往下是微码和晶体管, 你拿 debugger 看不到了。

The one-paragraph summary 一段话总结

Every layer rewrites the program from "easier to write" toward "easier to execute." IR is the data type the compiler uses to do those rewrites. MLIR is a framework for building IRs at many levels. LLVM is one specific IR plus a battle-tested set of backends. The ISA is the chip's spec — what the backend must emit. FFI is how higher and lower layers talk across language and runtime boundaries. Everything else is engineering.

每一层都把程序从"好写"改写成"好执行"。 IR 是编译器做这些改写时用的数据结构。 MLIR 是一个用来在多层造 IR 的框架。 LLVM 是一种具体的 IR 加上一整套打磨多年的 backend。 ISA 是芯片的规范 —— backend 必须吐出符合它的指令。 FFI 是上下两层跨过语言和运行时边界对话的方式。 其余都是工程细节。

§ 2 · The word that means too many thingsIR · the controlled vocabulary

§ 2 · 一个词覆盖太多东西IR · 受控的词表

"IR" stands for intermediate representation. The phrase is unhelpful in the way that "data structure" is unhelpful: technically accurate, but it does not tell you what the thing is for. A useful definition: an IR is a controlled vocabulary for describing what a program does, in a form that is easier to analyze and transform than the surface language.

"IR" 是 intermediate representation 的缩写,"中间表示"。 这个翻译跟"数据结构"一样无用: 技术上准确, 但说不清它是干什么的。 一个更有用的定义: IR 是一套受控词表, 用来描述程序做了什么, 但写成一种比源语言更便于分析和改写的形式。

"More analyzable than source" matters because source code has too much sugar — generics, exceptions, syntactic shorthand, undefined ordering, late binding. "More structured than machine code" matters because machine code has too little — no variable names, no types, no control flow markers, just registers and branches. IR sits in between by design.

"比源更好分析"很关键, 因为源码语法糖太多 —— 泛型、 异常、 各种语法简写、 求值顺序不确定、 晚绑定。 "比机器码更有结构"也很关键, 因为机器码什么都没 —— 没变量名、 没类型、 没控制流标记, 只有寄存器和跳转。 IR 是有意设计在中间的。

A minimal example

一个最小例子

Take a trivial Python function:

拿一个琐碎的 Python 函数:

# Python source def add(a: int, b: int) -> int: return a + b

Compiled to LLVM IR (via, say, Cython or Numba), the body becomes:

编译成 LLVM IR(用 Cython、 Numba 之类), 函数体变成:

; LLVM IR · target-agnostic, SSA-form define i32 @add(i32 %a, i32 %b) { %r = add i32 %a, %b ret i32 %r }

Three things changed. (1) Types became explicit: i32, not "int." (2) Each name is assigned exactly once — %a, %b, %r never get reassigned. That is SSA form: Static Single Assignment. (3) The function body is a flat list of operations connected by data flow, not a tree of expressions. There are no implicit conversions, no operator overloading, no surprises. The compiler can now do interesting things — fold constants, eliminate dead code, prove that %r never overflows — because the IR makes the analysis obvious.

三件事变了。 (1) 类型变显式了: i32, 不是含混的 "int"。 (2) 每个名字只被赋值一次 —— %a%b%r 都不会被重新赋值。 这就是 SSAStatic Single Assignment。 (3) 函数体是一串扁平的 op, 由数据流连起来, 不是表达式树。 没有隐式转换、 没有运算符重载、 没有惊喜。 编译器从这一刻开始可以做一些有意思的事 —— 常量折叠、 死代码消除、 证明 %r 不会溢出 —— 因为 IR 让分析变得显然。

Plate II SSA · what the compiler actually sees SSA · 编译器真正看到的形式 — scale: function-local —
A single basic block in SSA form SOURCE def f(x, y): t = x * 2 if t > y: z = t - y else: z = y - t return z human-readable · ambiguous order trace SSA · IR block · entry %t1 = mul %x, 2 %c = icmp gt %t1, %y br %c, then, else then %z1 = sub %t1, %y else %z2 = sub %y, %t1 merge · φ-node %z = phi %z1, %z2 key entry block true branch false branch merge w/ phi Each name is assigned exactly once. At the merge, a φ-node picks the right value based on the predecessor. SSA's two superpowers: (i) every use has a single defining op (data flow is a DAG, not a tangle), and (ii) the compiler can ask "does anything read %t1 after this point?" in O(1).
Plate II — a single Python function viewed as a control-flow graph in SSA form. The φ-node at the merge is the IR's way of saying "pick whichever predecessor we came from"; it has no runtime cost, it is a bookkeeping device for the optimizer.
Plate II —— 同一个 Python 函数变成 SSA 形式的控制流图。 merge 处的 φ-node 是 IR 在说"挑刚才从哪条边过来的那个值"; 它没有运行时代价, 只是优化器的记账工具。

Why everyone has their own IR

为什么每家都有自己的 IR

Once you accept that IR is just "the data type the compiler reasons about," it becomes obvious why every compiler invents one. Each level of abstraction wants a different vocabulary:

一旦接受 "IR 就是编译器拿来推理用的数据结构", 你就会理解为什么每个编译器都自造一种。 每一层抽象都想要一套不同的词汇:

SystemIRWhat it talks about
系统IR它聊的是什么
LLVMLLVM IRSSA scalars, integers, floats, pointers, basic blocksSSA 标量、 整型、 浮点、 指针、 基本块
MLIRfly.layout, linalg.matmul, ...whatever each dialect needs — tensors, layouts, hardware atoms每个 dialect 想要什么就有什么 —— tensor、 layout、 硬件 atom
JVMJava bytecodestack-machine ops, typed locals, class references栈式机操作、 typed locals、 类引用
.NETCIL / MSILsame idea as JVM, different bytecode set和 JVM 思想一样, 字节码不同
WasmWebAssemblyportable bytecode for sandboxed execution面向沙盒的可移植字节码
GraalVMGraal IRSSA "sea-of-nodes" with explicit control/data edgesSSA "sea-of-nodes", 控制 / 数据边都显式
PyTorch 2FX Graph, Inductor IRtensor ops, then loop nests先 tensor op, 后循环嵌套
TritonTriton IRblock-level tensor ops with layoutsblock 级别的 tensor op 带 layout
JAX / XLAHLOhigh-level operations on whole arrays面向整个数组的高层 op
SPIR-VSPIR-Vportable IR for graphics & compute shaders为图形 + 计算 shader 设计的可移植 IR

The word IR meaning a dozen different things at once is part of why this terminology feels slippery. When someone says "the compiler lowers the IR to LLVM IR," they mean "the compiler rewrites its own IR into LLVM's IR." Two different IRs, same word.

IR 这个词同时指十几种不同东西, 这正是术语让人滑手的原因。 有人说"编译器把 IR lower 到 LLVM IR", 意思是"编译器把它自己的 IR 改写成 LLVM 的 IR"。 两个不同的 IR, 同一个词。

Mental model 心智模型

An IR is the file format the compiler talks to itself in. A pass is a function that takes one IR and returns another, often the same kind but simpler. A pipeline is a sequence of passes. That is it.

IR 是编译器自言自语用的文件格式。 一个 pass 是把一份 IR 变成另一份的函数, 通常种类相同但更简单。 一个 pipeline 是一串 pass。 就这样。

§ 3 · The IR you build IRs withMLIR · IR for building IRs

§ 3 · 用来造 IR 的 IRMLIR · 拿来造 IR 的框架

If every project keeps inventing its own IR, the obvious next question is: can we share the infrastructure? Each IR needs a parser, a printer, a verifier, a pass manager, a type system, an attribute system, a graph data structure with use-def chains, debugging facilities. That stuff is engineering, not insight. MLIR is the answer: Multi-Level Intermediate Representation. A framework for building IRs that all share the same infrastructure, but can specialize the operations and types to any level of abstraction.

如果每个项目都在重造 IR, 下一个自然的问题是: 基础设施能不能复用? 每个 IR 都需要 parser、 printer、 verifier、 pass manager、 类型系统、 属性系统、 use-def chain 的图数据结构、 调试设施。 这些都是工程, 不是 insight。 MLIR 就是答案: Multi-Level Intermediate Representation。 一个框架, 让你造 IR 时复用同一套基础设施, 但 op 和 type 可以专门化到任何抽象层级。

MLIR was born at Google in 2019 (Chris Lattner and Tatiana Shpeisman led the early work), then upstreamed into the LLVM project where it lives today. The core abstraction is the dialect: a named bundle of ops, types, and attributes. The standard distribution ships a few dozen dialects covering everything from low-level LLVM IR to high-level array math.

MLIR 2019 年在 Google 起步(早期由 Chris Lattner 和 Tatiana Shpeisman 牵头), 后来 upstream 进了 LLVM 项目, 现在住在那里。 核心抽象是 dialect: 一组带名字的 op、 type、 attribute 的捆绑。 标准发行版里有几十个 dialect, 覆盖从最底层的 LLVM IR 到最高层的数组数学。

Plate III The MLIR dialect tower MLIR 的 dialect 高塔 — scale: floors of one building —
Dialects as floors · upper floors specialize in tensors, lower floors in addresses linalg · tosa · tensor · shape whole-array operations affine · scf structured loops & polyhedral vector · memref · arith explicit memory & vectors gpu · async · transform device-aware ops & rewrites nvgpu · rocdl · nvvm · spv vendor-specific intrinsics llvm dialect LLVM IR in MLIR clothing → llvm-translate bridge to real LLVM IR LLVM backend · assembly · machine code CUSTOM DIALECTS — examples fly · FlyDSL · layout algebra tt · Triton · block tensor ops torch · Torch-MLIR · graph IR tf · TensorFlow ops stablehlo · JAX / XLA portable IR these enter at the top of the tower PASSES — between floors --linalg-to-loops --loop-tile / -loop-fuse --vector-lower --memref-to-llvm --gpu-to-rocdl --convert-llvm each pass is (IR) → (IR) composing them is the whole pipeline LOWER Every dialect carries its own ops & types but shares the MLIR infrastructure (parser, verifier, pass manager). Lowering passes between dialects are how a program walks down the building.
Plate III — the MLIR dialect tower. Programs enter at whatever floor matches their abstraction (a tensor-graph IR enters near the top; a hand-written LLVM-dialect program enters near the bottom), and a pipeline of lowering passes walks them down to the LLVM dialect, where the bridge to real LLVM IR (and from there to assembly) is a short step.
Plate III —— MLIR 的 dialect 高塔。 程序在跟自身抽象层级匹配的那一层进入(一个 tensor 图 IR 进高层; 手写 LLVM dialect 进底层), 然后一条 pipeline 把它一路 lower 到 llvm dialect, 再过桥到真正的 LLVM IR, 距离汇编只剩一步。

A peek at three dialects

瞄一眼三个 dialect

To make this less abstract, here is the same idea — "a 4×8 matrix" — written in three MLIR dialects at three different levels:

为了不让这件事太抽象, 用三个 MLIR dialect 在三个层级表达同一个想法 ——"一个 4×8 矩阵":

// linalg — whole tensor with named semantics %C = linalg.matmul ins(%A, %B : tensor<4x8xf32>, tensor<8x4xf32>) outs(%init : tensor<4x4xf32>) -> tensor<4x4xf32> // memref + affine — explicit loops, explicit memory affine.for %i = 0 to 4 { affine.for %j = 0 to 4 { %acc = arith.constant 0.0 : f32 %c = affine.for %k = 0 to 8 iter_args(%a = %acc) -> f32 { // ... fused multiply-add into %a ... } memref.store %c, %Cmem[%i, %j] : memref<4x4xf32> } } // fly · FlyDSL custom dialect — layout-aware identity %layout = fly.make_identity_layout(%shape) : (!fly.int_tuple<(4, 8)>) -> !fly.layout<(4, 8) : (1E0, 1E1)>

Same shape, three vocabularies. The top one is what an autotuner wants to optimize, the middle one is what a polyhedral compiler wants to tile and fuse, the bottom one is what a kernel author wants to reason about hardware lane layout in. MLIR is the framework that lets all three live in the same module and lower into each other.

同一个形状, 三套词汇。 最上面是 autotuner 想优化的形式; 中间是多面体编译器想 tile 和 fuse 的形式; 最下面是 kernel 作者想推理硬件 lane 布局的形式。 MLIR 就是让这三种共存在同一个 module 里并互相 lower 的框架。

Why MLIR is a big deal MLIR 为什么是大事

Before MLIR, every ML compiler reinvented the IR infrastructure from scratch: TensorFlow had its own GraphDef + grappler, Glow had its own IR, ONNX had its own IR, Halide had its own IR, TVM had its own Relay + Tensor IR. After MLIR, most new compilers (Triton, FlyDSL, Torch-MLIR, IREE, OpenXLA) share a parser, a verifier, a pass manager. The infrastructure stopped being a tax.

MLIR 之前, 每个 ML 编译器都从零重造 IR 基础设施: TensorFlow 有 GraphDef + grappler、 Glow 有自己的 IR、 ONNX 有自己的 IR、 Halide 有自己的 IR、 TVM 有 Relay + Tensor IR。 MLIR 之后, 大部分新编译器(Triton、 FlyDSL、 Torch-MLIR、 IREE、 OpenXLA)共享 parser、 verifier、 pass manager。 基础设施不再是一种税。

§ 4 · The compiler infrastructure that ate the worldLLVM · the workhorse

§ 4 · 吞下整个世界的编译器基建LLVM · 干苦力的主力

LLVM is — depending on how you count — either the most successful compiler infrastructure project in history, or tied with GCC. It started in 2000 as Chris Lattner's master's thesis at Illinois ("Low Level Virtual Machine"), grew into the technical backbone of Apple's clang in 2007, and has since become the universal backend for almost everything that is not Microsoft's own MSVC. If you write Rust, Swift, Julia, Zig, modern Fortran, or compile Python via Numba or Cython, you are using LLVM. If you write GPU kernels in CUDA, HIP, OpenCL, SYCL, or any MLIR-based DSL, you are using LLVM. If you flash a microcontroller with a Rust firmware, you are using LLVM.

LLVM —— 看你怎么数 —— 要么是史上最成功的编译器基建项目, 要么和 GCC 并列。 它 2000 年起步, 是 Chris Lattner 在 Illinois 的硕士论文("Low Level Virtual Machine"); 2007 年长成 Apple clang 的技术骨架; 之后变成几乎所有非微软自家 MSVC 的 universal backend。 你写 Rust、 Swift、 Julia、 Zig、 现代 Fortran, 或者用 Numba / Cython 编 Python, 你在用 LLVM。 你写 CUDA、 HIP、 OpenCL、 SYCL kernel, 或者任何基于 MLIR 的 DSL, 你在用 LLVM。 你给微控制器烧 Rust 固件, 你在用 LLVM。

"LLVM" the brand covers three different things people sometimes confuse:

"LLVM"这个品牌覆盖三个有时被混淆的东西:

NameWhat it is
名字到底是什么
LLVM ProjectThe umbrella repo. Holds clang, LLVM core, LLDB, libc++, compiler-rt, MLIR, BOLT, Flang, and a dozen more sub-projects.那个 umbrella 仓库, 装着 clang、 LLVM core、 LLDB、 libc++、 compiler-rt、 MLIR、 BOLT、 Flang 等十几个子项目。
LLVM IRThe intermediate representation. SSA-form, typed, target-agnostic. Has a textual form (.ll) and a binary form (.bc, "bitcode").那种 IR。 SSA、 typed、 target-agnostic。 有文本形式(.ll)和二进制形式(.bc, "bitcode")。
LLVM CoreThe library that contains the optimizer (~100 passes), the legalizer, the instruction selector, the register allocator, and the per-target code generators.那个核心库, 包含 optimizer(~100 个 pass)、 legalizer、 指令选择器、 寄存器分配器、 各 target 的代码生成器。
Plate IV LLVM as universal backend — one IR, many targets LLVM 作为通用后端 —— 一份 IR, 多个目标 — scale: hub-and-spoke —
Frontends in, LLVM IR in the middle, backends out LLVM IR SSA · typed target-agnostic LLVM CORE · OPTIMIZER · CODEGEN FRONTENDS · feed LLVM IR clang · C / C++ / ObjC rustc · Rust swiftc · Swift julia · Julia (JIT) flang · modern Fortran numba · Python (JIT) zig · Zig MLIR · via llvm dialect TARGETS · emitted by backends x86 / x86-64 aarch64 · ARMv8 / v9 riscv32 / riscv64 AMDGPU · CDNA / RDNA NVPTX → PTX → SASS WebAssembly · wasm32 SPIR-V · graphics & compute embedded · AVR · Xtensa · ... WHY THE HUB WINS Each new frontend writes once. Each new chip writes one backend. The cost is O(frontends + targets), not O(frontends × targets). That equation is what unlocked the modern compiler ecosystem.
Plate IV — the LLVM bargain. Without a shared IR, every language would need a code generator for every chip. With a shared IR, frontends only have to emit LLVM IR, and chip vendors only have to write one backend. This is the trick that made it economical to support, e.g., Rust on RISC-V six months after RISC-V landed.
Plate IV —— LLVM 的交易。 没有共享 IR, 每个语言都得为每颗芯片写一份代码生成器。 有了共享 IR, 前端只需吐 LLVM IR, 芯片厂商只需写一份 backend。 正是这个 trick, 让"RISC-V 落地半年后就能跑 Rust"这种事在经济上变得可行。

What MLIR adds on top

MLIR 在上面加了什么

LLVM IR is wonderful at register-level reasoning — instruction selection, scheduling, register allocation — but it knows nothing about tensors, layouts, or loop tiling. By the time your program is in LLVM IR, all the high-level structure has been lost. MLIR's job is to let the compiler do tensor-shaped optimizations before dropping to LLVM IR, so the LLVM backend gets a program that is already shaped for the target.

LLVM IR 擅长寄存器级别的推理 —— 指令选择、 调度、 寄存器分配 —— 但它不知道 tensor、 layout、 循环 tiling 是什么。 等程序变成 LLVM IR, 高层结构已经丢失。 MLIR 的工作是让编译器在掉到 LLVM IR 之前把 tensor 级别的优化做掉, 这样 LLVM backend 接到的就是一段已经为目标硬件塑形过的程序。

In practice this is why ML compilers stack MLIR on top of LLVM rather than replacing it. MLIR handles the part where "matmul of a 4096-cube tensor" turns into "a doubly-tiled loop nest with a software pipeline." LLVM handles the part where "this loop body needs eight MFMA instructions and four buffer loads, scheduled so the next iteration overlaps with the current one's accumulator write."

实际工程里这就是为什么 ML 编译器把 MLIR 叠在 LLVM 之上而不是替代 LLVM。 MLIR 负责"4096 立方的 tensor matmul"变成"二重 tiled 循环嵌套加软件流水线"那一段。 LLVM 负责"这段循环体要发八个 MFMA 加四个 buffer load, 调度成下一轮和当前轮的累加器写回重叠"那一段。

§ 5 · The chip's contract with the worldISA · where the rubber meets silicon

§ 5 · 芯片对外的契约ISA · 真正落到硅的那一步

An ISAInstruction Set Architecture — is the contract between a chip and any compiler that wants to use it. It specifies: the legal instructions, what they do, how they are encoded, the registers the program can address, the memory model, the exception/interrupt mechanisms. Two chips that implement the same ISA can run the same binary. Two chips with different ISAs cannot — that is why a binary built for Apple Silicon does not run on Intel, and vice versa.

一个 ISA —— Instruction Set Architecture, 指令集架构 —— 是芯片和任何想用它的编译器之间的契约。 它规定: 哪些指令是合法的、 每条指令做什么、 怎么编码、 程序能寻址哪些寄存器、 内存模型、 异常 / 中断机制。 两个实现同一个 ISA 的芯片可以跑同一个二进制; 两个不同 ISA 的不能 —— 这正是 Apple Silicon 编的二进制跑不了 Intel、 反过来也不行的原因。

The major ISA families

主要 ISA 家族

FamilyOwnerWhere you meet it
家族归属你在哪里见过
x86 / x86-64Intel · AMDdesktops, servers, laptops, most cloud VMs桌面、 服务器、 笔记本、 大多数云上 VM
ARMv8 / v9 · AArch64Arm Holdingsphones, Apple Silicon Macs, Graviton, NVIDIA Grace手机、 Apple Silicon Mac、 Graviton、 NVIDIA Grace
RISC-Vopen ISA · UC Berkeley origin开放 ISA · 起源 UC Berkeleyembedded, accelerators, increasingly servers
NVIDIA PTX / SASSNVIDIAPTX is a portable virtual ISA; SASS is the real on-chip one (undocumented)PTX 是可移植虚拟 ISA; SASS 是芯片上真正的(未公开)
AMD GCN · RDNA · CDNAAMDRadeon (RDNA), Instinct/MI series (CDNA), Steam Deck APURadeon (RDNA)、 Instinct / MI 系列 (CDNA)、 Steam Deck APU
PowerPC · Power ISAIBMmainframes, some HPC, recent Power10/11大型机、 一些 HPC、 最近的 Power10 / 11
MIPS · LoongArchhistoric · Loongson (China)历史角色 · Loongson(中国)routers, embedded, China domestic CPUs

Why ISA matters more for ML than it used to

为什么 ISA 现在对 ML 比以前更重要

For most of CPU history, the ISA was a stable substrate — you didn't really care which x86 generation you were on, because the compiler handled it. ML changed that. The performance of a modern ML workload depends almost entirely on whether the compiler reaches the chip's specialized matrix-math instructions. Those instructions are the difference between 50 TFLOPS and 500 TFLOPS on the same silicon area. Every major vendor added them in the last decade, and they are all different.

CPU 史上的大部分时间里, ISA 是一个稳定的底座 —— 你不太在意自己跑在哪一代 x86 上, 因为编译器处理掉了。 ML 改变了这件事。 一个现代 ML 工作负载的性能, 几乎完全取决于编译器是否打到芯片上的专门矩阵数学指令。 这些指令是同一片硅上 50 TFLOPS 和 500 TFLOPS 的差别。 过去十年每个大厂都加了它们, 而且互不相同。

Plate V Matrix instructions across vendors · the modern ML ISA war 各厂的矩阵指令 · 现代 ML ISA 战争 — scale: same operation, different opcodes —
"Multiply two small matrices, accumulate" — every vendor's version D := A · B + C small-matrix multiply-accumulate Intel AMX · Sapphire Rapids+ tdpbf16ps · tdpbssd tile registers (8 × up to 1KB) BF16 / INT8 · 16×16 tiles also: AVX-512 VNNI · AVX-VNNI vpdpbusd / vpdpbssd for INT8 dot-product ARM SME · Apple M4 / ARMv9 smopa · sumopa · usmopa streaming SVE + outer-product ZA tile register (square, scalable) also: SVE / SVE2 vector dot-products vector-length agnostic NVIDIA Tensor Core · Volta → Blackwell mma · wgmma · tcgen05 warp-level (32-thread) mma instructions FP16 / BF16 / FP8 / FP4 / INT8 / INT4 / TF32 "HMMA" is the family of these instructions accessed via PTX mma intrinsics AMD MFMA · CDNA (MI300X, MI355X) v_mfma_f32_16x16x16_f16 wavefront-level (64-thread) matrix FMA FP16 / BF16 / FP8 / FP4 / INT8 CDNA3 = 16×16×16 / 32×32×8 etc. CDNA4 adds MFMA-scale (microscaling FP4) AMD WMMA · RDNA3 / RDNA4 · gfx11/12 v_wmma_f32_16x16x16_f16 wave32-level matrix FMA (Radeon GPUs) FP16 / BF16 / FP8 / FP4 / INT8 consumer-GPU equivalent of MFMA used by AI PC inference workloads RISC-V · Matrix / Vector ext. RVV · matrix proposals · custom RVV is stable; matrix is still moving vendors add custom matrix exts (Andes, T-Head) the open ISA's matrix story is fragmenting single-vendor extensions slow adoption Every block above is a different ISA addition. The same matmul lowers to a different opcode per vendor. The compiler's job is to pick the right one — and to know when the tile shape doesn't fit (then you fall off the fast path).
Plate V — six different vendors, six different ways to say "multiply two small matrices and accumulate." Whether your kernel uses these instructions is exactly the difference between a respectable kernel and a state-of-the-art one. This is also why ML compilers are a per-vendor problem: the matrix instruction surface is the most fragmented part of modern computing.
Plate V —— 六家不同的厂、 六种说"两个小矩阵相乘再累加"的方式。 你的 kernel 是否打到这些指令上, 正是"过得去的 kernel"和"顶尖 kernel"的差别。 这也是为什么 ML 编译器变成每家厂一个项目: 矩阵指令面是现代计算里最碎的一片地。
ISA vs ABI · don't conflate them ISA 与 ABI · 不要混淆

ISA = what instructions exist and what they do. ABI (Application Binary Interface) = the conventions layered on top — how arguments are passed, which registers are caller-saved, how the stack is aligned, how exceptions propagate. Two compilers targeting the same ISA can still disagree at the ABI level. "x86-64 System V ABI" and "Microsoft x64 ABI" are both x86-64 binaries that cannot freely call each other.

ISA = 有哪些指令、 每条做什么。 ABI(Application Binary Interface)= 叠在 ISA 之上的约定 —— 参数怎么传、 哪些寄存器是 caller-saved、 栈如何对齐、 异常如何传播。 两个编译器编同一个 ISA, 在 ABI 上仍可能不一致。 "x86-64 System V ABI"和"Microsoft x64 ABI"都是 x86-64 二进制, 但互相不能自由调用。

§ 6 · The bridge between worldsFFI · the boundary

§ 6 · 跨世界的桥FFI · 边界与翻译

FFI stands for Foreign Function Interface. It is the mechanism a program written in one language uses to call a function written in another. Every modern ML system has at least one FFI inside it: Python ↔ C++, C++ ↔ CUDA, JIT-compiled binary ↔ runtime, Rust ↔ C, JavaScript ↔ Wasm. If a system feels stitched together, the seams are usually FFI calls.

FFIForeign Function Interface, "外部函数接口"。 一个语言写的程序通过 FFI 调用另一个语言写的函数。 每个现代 ML 系统至少包含一次 FFI: Python ↔ C++、 C++ ↔ CUDA、 JIT 出来的二进制 ↔ 运行时、 Rust ↔ C、 JavaScript ↔ Wasm。 一个系统若有"拼接感", 缝合处通常都是 FFI 调用。

An FFI has to solve three problems simultaneously:

一个 FFI 必须同时解决三件事:

  1. ABI mismatch. The two languages may not even agree on what a function call looks like at the machine level.
  2. ABI 不一致。 两边的语言在机器层面也许都不同意"一次函数调用长什么样"。
  3. Data representation. Python lists, C arrays, NumPy arrays, PyTorch tensors, JAX arrays — they all describe "a chunk of numbers" but with different metadata, layout, and ownership.
  4. 数据表示。 Python list、 C 数组、 NumPy array、 PyTorch tensor、 JAX array —— 都在描述"一段数字", 但元数据、 布局、 所有权各不相同。
  5. Lifetime & ownership. A GC'd object passed into C++ must not be freed while the C++ side is still using it. A C++ object handed to Python must release its memory when Python loses the last reference.
  6. 生命周期 + 所有权。 传给 C++ 的 GC 对象, 在 C++ 还用着时不能被回收; 给到 Python 的 C++ 对象, 最后一个引用消失时要释放。
Plate VI FFI bridges in a typical ML stack 典型 ML 栈里的 FFI 桥 — scale: language boundary —
Python · C++ · CUDA / HIP · kernel — and the bridges between them Python world PyTorch / JAX / NumPy C++ runtime ATen · libtorch · TVM-FFI GPU runtime CUDA / HIP driver Device kernel HSACO · cubin tensor.numpy() DLPack capsule torch.utils.dlpack jax.dlpack data exchange via DLPack (zero-copy when device matches) pybind11 or TVM-FFI aten::matmul (impl) DLPack DLTensor * tvm_ffi::Function cuLaunchKernel / hip* stable C ABI · framework-agnostic "one wheel, many frameworks" driver call hipMalloc / cuMalloc hipMemcpy / cuMemcpy hipModuleLoad / cuModule hipLaunchKernel user → driver → command queue (stream-ordered on the device) dispatch v_mfma_f32_16x16x16 buffer_load_dwordx4 ds_read_b128 s_waitcnt vmcnt(0) ISA stream on wavefront no FFI down here — just transistors THREE BRIDGES · THREE FAILURE MODES Python ⇄ C++ (pybind11 / TVM-FFI): version skew, GIL, refcount. C++ ⇄ driver: driver/runtime mismatch, ABI. Driver ⇄ device: out-of-memory, lost device.
Plate VI — three FFI bridges sit between your Python and an MFMA instruction firing. Each bridge has a stable ABI that both sides agree on. The hottest area right now is the leftmost bridge, where projects like Apache TVM-FFI are trying to define a single C ABI that PyTorch, JAX, CuPy, and Paddle can all consume without per-framework glue.
Plate VI —— 你的 Python 到一条 MFMA 指令发出之间, 中间夹着三道 FFI 桥。 每道桥都有一份两边都认的稳定 ABI。 目前最热的就是最左边那道桥, Apache TVM-FFI 这类项目想定义一份让 PyTorch、 JAX、 CuPy、 Paddle 都能直接用的 C ABI, 不再为每家框架写胶水。

A closer look at TVM-FFI

细看 TVM-FFI

Apache TVM-FFI is the cleanest contemporary example of "stable C ABI for ML kernels." Its pitch: a kernel library — FlashInfer, TileLang, NVIDIA's cuteDSL — should be able to ship one wheel that PyTorch / JAX / CuPy / Paddle can all load. The way it gets there is:

Apache TVM-FFI 是当前"为 ML kernel 提供稳定 C ABI"最干净的例子。 卖点: 一个 kernel 库 —— FlashInfer、 TileLang、 NVIDIA 的 cuteDSL —— 应该能发一个 wheel, 让 PyTorch / JAX / CuPy / Paddle 都装得上。 它做法是:

  • A minimal C ABI for calls, errors, types — about a thousand lines of header.
  • 一份极小的 C ABI, 覆盖调用、 错误、 类型 —— 大约一千行头文件。
  • Tensor interop via DLPack — the cross-framework "I am a tensor in device memory at address X with strides Y" protocol.
  • tensor 互通用 DLPack —— 跨框架的"我是一块设备内存上 X 地址、 stride 是 Y 的 tensor"协议。
  • Language bindings for Python, C++, Rust — same C ABI underneath.
  • Python、 C++、 Rust 三种语言绑定 —— 底下都是同一个 C ABI。

If you have ever debugged a kernel wheel that says undefined symbol: _ZN3c106detail... after a PyTorch upgrade, you have hit the problem that stable cross-framework ABIs are trying to solve.

如果你曾经在 PyTorch 升级后 debug 过 undefined symbol: _ZN3c106detail... 的 kernel wheel, 你就撞过"跨框架稳定 ABI"想解决的那个问题。

DLPack · the tensor passport DLPack · tensor 的护照

DLPack is a small struct describing a tensor: pointer, device, dtype, shape, strides, byte offset. Every major framework knows how to import and export this struct, so you can torch.from_dlpack(jax_array) and pass tensors across frameworks at zero copy. It is the unsexy plumbing that holds the modern ML ecosystem together.

DLPack 是一个小 struct, 描述一个 tensor: pointer、 device、 dtype、 shape、 strides、 byte offset。 主流框架都知道如何导入导出这个 struct, 所以你可以 torch.from_dlpack(jax_array), tensor 跨框架零拷贝传递。 它是把现代 ML 生态粘起来的不起眼的水管。

§ 7 · The whole stack on one exampleOne matmul, all the way down

§ 7 · 一个例子穿透整个栈一个 matmul, 一路下去

Time to spend the vocabulary. Trace a single call — C = torch.matmul(A, B) for FP16 4096-cube tensors on an AMD MI300X — from Python entry to the actual MFMA instructions retiring on the wavefront. We will skip nothing important and add nothing extra.

把前面学的词花掉。 追一个调用 —— FP16 4096 立方的 C = torch.matmul(A, B) 在 AMD MI300X 上 —— 从 Python 入口到 MFMA 指令在 wavefront 上发完为止。 重点都不会跳过, 也不会加多余的细节。

Step 1 · Python → CPython bytecode

Step 1 · Python → CPython 字节码

CPython parses your .py file once into bytecode. torch.matmul resolves to a LOAD_ATTR · CALL sequence on the stack machine. Total cost: a couple of microseconds, paid every call.

CPython 把 .py 文件解析成一次字节码。 torch.matmul 解析成栈式机器上的一串 LOAD_ATTR · CALL。 总开销几微秒, 每次调用都付。

Step 2 · PyTorch dispatcher → ATen kernel

Step 2 · PyTorch dispatcher → ATen kernel

The dispatcher inspects the tensors' device (cuda/hip), dtype (f16), and layout, then looks up the registered implementation. We land in aten::matmul, which decides this is a 2D-by-2D case and delegates to aten::mm, which on a ROCm build delegates to at::native::mm_out_hip. That function asks hipBLAS / hipBLASLt for a GEMM, or — if you have flipped a config — calls into a fused custom kernel.

dispatcher 检查 tensor 的 device(cuda / hip)、 dtype(f16)、 layout, 然后查注册的实现。 落到 aten::matmul, 它判断这是 2D × 2D, 转交给 aten::mm, 在 ROCm 构建里又转交给 at::native::mm_out_hip。 这个函数请 hipBLAS / hipBLASLt 做一个 GEMM —— 或者在你切换了 config 时, 走进某个 fused 自定义 kernel。

Step 3 · FFI · Python ⇄ C++ ⇄ HIP runtime

Step 3 · FFI · Python ⇄ C++ ⇄ HIP runtime

Tensors cross the Python/C++ boundary through libtorch's pybind11 layer (or, in newer toolchains, via TVM-FFI). The HIP runtime — equivalent to NVIDIA's CUDA driver — accepts a launch request and enqueues it on the stream associated with this tensor.

tensor 通过 libtorch 的 pybind11 层穿过 Python / C++ 边界(新工具链里可能走 TVM-FFI)。 HIP 运行时 —— 类似 NVIDIA 的 CUDA driver —— 接收 launch 请求, 把它入到这个 tensor 关联的 stream。

Step 4 · Compiled kernel binary (HSACO)

Step 4 · 编译好的 kernel 二进制 (HSACO)

The kernel is not compiled at this moment; it was compiled when the library shipped (or the first time you ran it, if JIT). The binary is an HSACO file — AMD's equivalent of NVIDIA's .cubin — containing CDNA3 machine code for the MI300X's gfx942 ISA. It was produced earlier by: a DSL (FlyDSL, Triton) → MLIR → LLVM IR → LLVM AMDGPU backend → assembler → HSACO.

kernel 不是此时编译的; 是库发布时(或第一次跑时, 如果是 JIT)就编好了。 二进制是 HSACO 文件 —— AMD 对应 NVIDIA .cubin 的格式 —— 含 MI300X 上 gfx942 ISA 的 CDNA3 机器码。 它早先这样产生: DSL(FlyDSL、 Triton)→ MLIR → LLVM IR → LLVM AMDGPU backend → assembler → HSACO。

Step 5 · Wavefront execution · MFMA instructions

Step 5 · Wavefront 执行 · MFMA 指令

The MI300X's command processor schedules the kernel onto Compute Units. Each Compute Unit runs wavefronts of 64 threads. Inside the hot loop, the machine code is a tight sequence of buffer_load_dwordx4 (load 16 bytes of A or B from global memory), ds_write_b128 / ds_read_b128 (stage through LDS), and v_mfma_f32_16x16x16_f16 (the matrix-multiply-accumulate). Get this loop's instruction mix right and you push the chip to 90%+ of peak FP16 FLOPs. Get it wrong and you sit at 30%.

MI300X 的 command processor 把 kernel 调度到 Compute Unit。 每个 Compute Unit 跑 64 线程的 wavefront。 hot loop 里的机器码是一串紧凑的 buffer_load_dwordx4(从 global 内存加载 16 字节的 A / B)、 ds_write_b128 / ds_read_b128(过 LDS 中转)、 v_mfma_f32_16x16x16_f16(矩阵乘加)。 这个循环的指令配比调对, 整片芯片能跑到 FP16 峰值 FLOPs 的 90%+。 调错, 就停在 30%。

Step 6 · Result returns · GC

Step 6 · 结果返回 · GC

When the kernel signals completion (via the stream's event), the runtime returns control to Python. The result tensor — a thin C++ object wrapping a device pointer + shape + strides + DLPack metadata — gets handed back to Python through pybind11. Python's GC owns the lifetime. When the last reference drops, the device memory is freed.

kernel 通过 stream 的 event 报告完成, 运行时把控制权还给 Python。 结果 tensor —— 一个 C++ 对象, 包着 device pointer + shape + strides + DLPack 元数据 —— 通过 pybind11 回到 Python。 Python GC 接管生命周期; 最后一个引用消失时, 设备内存释放。

The whole journey in one sentence 整段旅程一句话

Python source → bytecode → PyTorch dispatcher → ATen C++ → pybind11/TVM-FFI → HIP runtime → HSACO machine code → CDNA3 wavefront executing MFMAs → result back through the same bridges in reverse. Every layer is replaceable; the joints are what make the system feel either elegant or brittle.

Python 源码 → 字节码 → PyTorch dispatcher → ATen C++ → pybind11 / TVM-FFI → HIP runtime → HSACO 机器码 → CDNA3 wavefront 跑 MFMA → 结果沿同一条路反向回来。 每一层都可替换; 真正决定系统优雅还是脆弱的, 是关节。

§ 8 · The places people run agroundReefs · common confusions

§ 8 · 大家容易搁浅的地方易混淆礁石 合集

A few patterns of confusion that keep recurring in conversations:

几个反复出现的混淆模式:

  • "IR" overloaded
    "IR"被严重重载
    A compiler can have a frontend IR, a middle IR, and a backend IR — three different IRs, same compiler. "Lowering the IR to LLVM IR" usually means "lowering my IR to LLVM's IR." Always pin down whose IR.
    一个编译器可以同时有 frontend IR、 middle IR、 backend IR —— 三种不同的 IR, 同一个编译器。 "把 IR lower 到 LLVM IR"通常是说"把我的 IR lower 到 LLVM 的 IR"。 先问清楚谁的 IR。
  • LLVM IR ≠ MLIR
    LLVM IR ≠ MLIR
    LLVM IR is one specific IR. MLIR is a framework for building (potentially many) IRs. MLIR happens to live in the LLVM repo and includes a dialect that mirrors LLVM IR, but the two are not the same thing.
    LLVM IR 是一种具体的 IR。 MLIR 是一个用于造(很多种) IR 的框架。 MLIR 恰好住在 LLVM 仓库里, 也包含一个镜像 LLVM IR 的 dialect, 但二者不是一回事。
  • PTX is virtual; SASS is real
    PTX 是虚拟的; SASS 才是真实的
    On NVIDIA GPUs, "the ISA" usually means PTX in conversation. PTX is a portable virtual ISA — the driver's JIT translates it to SASS, the real undocumented per-architecture ISA. You write PTX (or have a compiler emit it); you almost never see SASS unless you are debugging at the assembly level.
    说"NVIDIA GPU 的 ISA"通常指 PTX。 PTX 是可移植的虚拟 ISA —— 驱动会把它 JIT 成 SASS, 即每代架构上真实的、 未公开的 ISA。 你写 PTX 或让编译器吐 PTX; 除非你在汇编级 debug, 否则几乎看不到 SASS。
  • ISA ≠ ABI
    ISA ≠ ABI
    ISA = what the chip executes. ABI = how compilers agree on calling conventions on top of that ISA. Same ISA, different ABIs is real — System V x86-64 and Microsoft x64 are both x86-64 but have different rules for which registers carry arguments.
    ISA = 芯片执行的指令。 ABI = 在 ISA 之上编译器们对调用约定的协议。 同一 ISA、 不同 ABI 是常态 —— System V x86-64 和 Microsoft x64 都是 x86-64, 但参数走哪些寄存器规则不同。
  • "Tensor Core" is a brand, not a category
    "Tensor Core"是品牌, 不是品类
    "Tensor Core" is NVIDIA's marketing name for their matrix instructions. The category is matrix-multiply-accumulate, and every vendor has one: AMD MFMA (CDNA) / WMMA (RDNA), Intel AMX, ARM SME, Apple AMX (different from Intel's, confusingly). Talking about "Tensor Cores" on AMD hardware is fine in casual conversation but technically the instructions are MFMAs.
    "Tensor Core"是 NVIDIA 给自家矩阵指令的营销名。 品类叫 matrix-multiply-accumulate, 每家都有: AMD MFMA (CDNA) / WMMA (RDNA)、 Intel AMX、 ARM SME、 Apple AMX(和 Intel 那个同名但不同, 经常搞混)。 在 AMD 硬件上口头说"Tensor Core"在非正式场合没事, 但严格说应叫 MFMA。
  • "Compiler" means at least three things in ML
    ML 里"编译器"至少指三种东西
    There is the AOT compiler (clang / rustc) that produces a binary from source. There is the framework compiler (torch.compile, JAX jit, XLA) that produces a graph from a Python function. There is the kernel compiler (Triton, FlyDSL) that produces a GPU binary from a kernel DSL. They share infrastructure but solve different problems.
    有 AOT 编译器(clang / rustc)把源码编成二进制; 有框架编译器(torch.compile、 JAX jit、 XLA)把一个 Python 函数变成一张图; 有 kernel 编译器(Triton、 FlyDSL)把一段 kernel DSL 编成 GPU 二进制。 三者共享基建, 但解决的问题不同。
  • "AOT" vs "JIT"
    "AOT" vs "JIT"
    AOT (ahead-of-time): compile once, ship a binary. C/C++, Rust, Swift, Go work this way. JIT (just-in-time): compile at runtime, often shape-specialized. JVM, V8, Julia, PyTorch's torch.compile, every GPU kernel DSL works this way. Most ML systems blend both — the framework ships AOT, the kernels JIT per shape.
    AOT(ahead-of-time): 编一次, 发二进制。 C / C++、 Rust、 Swift、 Go 走这条路。 JIT(just-in-time): 运行时编, 常按 shape 特化。 JVM、 V8、 Julia、 PyTorch 的 torch.compile、 每个 GPU kernel DSL 都走 JIT。 大多数 ML 系统两者并用 —— 框架 AOT, kernel 按 shape JIT。

CodaWhere to read next

余话接下来读什么

If this primer convinced you that the territory is worth a closer look, here are the books and references that paid for themselves for me:

如果这篇科普让你觉得这片地值得再深看一眼, 这些是我自己花时间读过、 觉得值的书与文档:

ForRead
想学读这个
Computer architecture, the canonical text计算机体系结构, 经典教材Hennessy & Patterson · Computer Architecture: A Quantitative Approach
A friendlier compiler textbook更友好的编译原理教材Cooper & Torczon · Engineering a Compiler
LLVM, the design retrospectiveLLVM 的设计回顾The Architecture of Open Source Applications, Vol. 1 — Chapter on LLVM (Chris Lattner, free online)
MLIR, the original paperMLIR 原始论文Lattner et al. · MLIR: A Compiler Infrastructure for the End of Moore's Law (2020)
AMD GPU ISA referenceAMD GPU ISA 参考AMD CDNA3 ISA Reference Guide (PDF on AMD's site)
NVIDIA PTX referenceNVIDIA PTX 参考PTX ISA Programming Guide (NVIDIA Docs)
Real kernel code to learn from可以学的 kernel 代码CUTLASS · FlashAttention · vLLM · FlyDSL · TileLang

And — meta — the other entries in this Source Reading series cover concrete repos in this stack: SkyPilot (orchestration), SGLang and vLLM (inference engines), mini-SGLang (teaching version), gcnasm (AMD CDNA3 assembly), FlyDSL (layout-algebra Python DSL). Each is a worked example of one of the boxes in Plate I.

另外(meta) —— 这个 Source Reading 系列里其他几集, 各自精读了这个栈里的具体仓库: SkyPilot(编排)、 SGLangvLLM(推理引擎)、 mini-SGLang(教学版)、 gcnasm(AMD CDNA3 汇编)、 FlyDSL(layout 代数 Python DSL)。 每一集都是 Plate I 中某一格的"动手版"。

The point of writing this primer was not to make you a compiler engineer. It was to give you back the vocabulary so that the next time something underneath leaks — the next time the abstraction stops being free — you can name what just happened. Naming things is most of debugging.

写这篇科普的目的不是要把你变成编译器工程师。 是把这套词汇还给你 —— 下一次有什么东西从下面漏出来、 抽象不再免费的时候, 你能给刚才发生的事起个名字。 而 debug 这件事大半就是命名。