§ 00 · Why this paper, why nowPrologue
§ 00 · 为什么读这篇, 为什么是现在序章
A tensor layout is the map from a logical tensor element — say A[row, col] — to the physical resource that holds it: which register, in which thread, in which warp, or which byte of shared memory. Get it wrong and the Tensor Core reads garbage; get it suboptimal and you pay for data movement you never needed. This paper makes one bet: that every layout a GPU compiler will ever need is a linear function over the two-element field — and that this single fact dissolves most of the layout problem.
张量布局, 就是从一个逻辑张量元素 —— 比如 A[row, col] —— 到承载它的物理资源的映射: 哪个寄存器、 在哪个线程、 哪个 warp, 或者共享内存的哪个字节。 搞错了, Tensor Core 读到的就是垃圾; 搞得不够优, 你就要为本不必要的数据搬运买单。 这篇论文押了一个赌注: GPU 编译器需要的每一种布局, 都是 二元域上的一个线性函数 —— 而这一个事实, 几乎溶解掉了整个布局问题。
Why now, and why I'm reading it for AMD. Our second goal is a multi-agent system that writes fast AMD kernels, and the broad goal is porting CUDA-only repos to ROCm. Both run straight into layouts: a kernel ported from NVIDIA to AMD is not "translated" until its data is re-tiled for the MI300X's 64-wide wavefront and its LDS banks. The paper proves AMD's mfma layouts are linear too, and its two flagship algorithms — optimal swizzling and warp-shuffle generation — are hardware-agnostic. That makes this not a Triton curiosity but a blueprint for the layout layer of any kernel agent we build.
为什么是现在, 以及我为什么从 AMD 的角度读它。 我们的第二个目标是一个会写高性能 AMD kernel 的多 agent 系统, 而宽目标是把 CUDA-only 的仓库移植到 ROCm。 两者都会一头撞进布局问题: 一个从 NVIDIA 移植到 AMD 的 kernel, 在它的数据被重新 tile 成 MI300X 的 64 宽 wavefront、 重新对齐到 LDS bank 之前, 根本算不上"翻译完成"。 这篇论文证明了 AMD 的 mfma 布局同样是线性的, 而它的两个旗舰算法 —— 最优 swizzling 和 warp-shuffle 生成 —— 都是硬件无关的。 这让它不只是一个 Triton 的趣闻, 而是我们要造的任何 kernel agent 中, 布局那一层的蓝图。
Stop writing per-layout interface methods and per-pair conversion code. Model each layout as a binary matrix M whose columns are hardware-index bits and whose rows are logical-tensor-coordinate bits, with all arithmetic over GF(2) (add = XOR, multiply = AND). Then converting layout A to B is B⁻¹A; broadcast is a zero column; a contiguous vectorizable run is an identity block; bank-conflict-free swizzling is a subspace you can solve for. The matrix is the abstraction.
别再为每种布局写一套接口方法、 为每一对布局写一段转换代码。 把每个布局建模成一个二元矩阵 M —— 列是硬件索引的位, 行是逻辑张量坐标的位, 所有运算都在 GF(2) 上(加法 = XOR, 乘法 = AND)。 于是布局 A 转到 B 就是 B⁻¹A; 广播是一个零列; 一段可向量化的连续元素是一个单位块; 无 bank 冲突的 swizzling 是一个可以解出来的子空间。 矩阵就是那个抽象。
§ 01 · The ground floorGroup, Ring, Field
§ 01 · 地基群、 环、 域
The hard part of layout algebra, for most of us with a CS background, is the algebra. So before the layouts, the floor under them. Think of it as toys (a set of elements) plus ways to play with them (operations). The more "harmonious" the playset, the more you can do — and the ladder from group to ring to field is exactly that, three rungs of increasing structure.
对我们大多数 CS 背景的人来说, 布局代数难的不是布局, 是代数。 所以在讲布局之前, 先把它脚下的地基铺好。 把它想成一堆玩具(元素的集合)加上几种玩法(运算)。 玩具组越"和谐", 能做的事就越多 —— 而从群到环到域这道阶梯, 正是如此: 结构一级比一级强的三个台阶。
{0, 1} — and its add/multiply are exactly the hardware's XOR and AND. A quirk that matters everywhere downstream: in F₂, a + a = 0, so every element is its own additive inverse and subtraction equals addition.{0, 1} —— 它的加法/乘法恰好就是硬件的 XOR 和 AND。 一个会贯穿后文的怪癖: 在 F₂ 里 a + a = 0, 所以每个元素都是自己的加法逆元, 减法等于加法。Why F₂ specifically, and not the reals? Because GPU indices are bits. A warp has 32 or 64 threads, an mma tile is a power of two on each side, and Triton constrains every tensor dimension and every layout subdivision (registers per thread, threads, warps) to powers of two. The moment your coordinates are exactly log₂(N) bits, an XOR of bit-vectors is a linear combination, and a matrix over F₂ is the most natural description of how those bits recombine. Theory maps to the cheapest instructions the hardware has.
为什么偏偏是 F₂, 而不是实数? 因为 GPU 的索引本来 就是 位。 一个 warp 有 32 或 64 个线程, 一个 mma tile 每条边都是 2 的幂, Triton 把每个张量维度、 每个布局细分(每线程寄存器数、 线程数、 warp 数)都限制成 2 的幂。 一旦你的坐标恰好是 log₂(N) 位, 位向量的 XOR 就是 一次线性组合, 而 F₂ 上的矩阵 就是 描述这些位如何重组的最自然的语言。 理论直接落到硬件最便宜的那几条指令上。
A fair objection: F₂ aligns with powers of two, so what about a tensor whose shape isn't a power of two? In practice it's a non-issue — GPU hardware is built in powers of two from the ground up: 32/64-wide warps, power-of-two mma/wgmma tiles. The padding to a power of two happens at the tile boundary regardless, so the layout algebra never sees a non-power-of-two extent. The real limitation is elsewhere: a pure linear map y = Mx can't express an affine layout with a translation (y = Mx + b).
一个合理的质疑: F₂ 天然对齐 2 的幂, 那形状不是 2 的幂的张量怎么办? 实践中这不是问题 —— GPU 硬件从底层就是按 2 的幂搭起来的: 32/64 宽的 warp、 2 的幂的 mma/wgmma tile。 补齐到 2 的幂这件事无论如何都发生在 tile 边界上, 所以布局代数根本看不到非 2 次幂的尺寸。 真正的局限在别处: 纯线性映射 y = Mx 表达不了带平移的仿射布局(y = Mx + b)。
§ 02 · The central ideaA layout is a matrix
§ 02 · 核心思想布局就是一个矩阵
Here is the whole abstraction in one line. Take the hardware index — concatenate the bits of (register id, thread id, warp id) into one bit-vector x. Take the logical tensor coordinate — write (i, j) as a bit-vector y. A layout is the linear map y = M·x over F₂, where M is a 0/1 matrix. Computing y means: for each output row of M, AND it with x bitwise, then XOR the surviving bits. Each row of M decides how one output bit is built from the input bits; a 1 means "this input bit participates in the XOR," a 0 means it doesn't.
整个抽象一行就能说完。 拿硬件索引 —— 把(寄存器号、 线程号、 warp 号)的位拼接成一个位向量 x。 拿逻辑张量坐标 —— 把 (i, j) 写成位向量 y。 一个布局就是 F₂ 上的线性映射 y = M·x, 其中 M 是一个 0/1 矩阵。 算 y 的方法是: 对 M 的每一个输出行, 跟 x 按位 AND, 再把存活的位 XOR 起来。 M 的每一行决定一个输出位如何由输入位拼出来; 1 表示"这个输入位参与 XOR", 0 表示不参与。
x (here 8 bits: 2 register, 5 thread, 1 warp) is multiplied by the 0/1 matrix M over F₂ to produce the tensor coordinate y = (i, j). Multiply is AND, sum is XOR. A row with a single 1 just copies an input bit; a row with two 1s XORs two inputs (the seed of swizzling). A whole column of zeros means that hardware bit never affects the coordinate — which is exactly broadcasting.x(这里 8 位: 2 位寄存器、 5 位线程、 1 位 warp)在 F₂ 上乘以 0/1 矩阵 M, 得到张量坐标 y = (i, j)。 乘法是 AND, 求和是 XOR。 只有一个 1 的行只是拷贝一个输入位; 有两个 1 的行把两个输入 XOR 起来(这正是 swizzling 的种子)。 而整整一 列 全是 0, 意味着那个硬件位永远不影响坐标 —— 这恰好就是广播。That single picture is the entire reframing. Layout conversion stops being a wall of if/else and bit-twiddling and becomes a matrix product. A layout property — is it broadcasting? is it contiguous? — becomes a matrix property: does it have a zero column? does it contain an identity block? An optimization — minimize bank conflicts — becomes a search for an optimal subspace. The authors carry this through the entire Triton GPU backend without special-casing a single layout.
这一张图就是整个重构。 布局转换不再是一堵 if/else 加位操作的墙, 而变成一次矩阵乘法。 一个布局的性质 —— 它在广播吗? 它连续吗? —— 变成矩阵的性质: 它有零列吗? 它含一个单位块吗? 一个优化 —— 最小化 bank 冲突 —— 变成寻找一个最优子空间。 作者把这一条线贯穿整个 Triton GPU 后端, 没有为任何单一布局开特例。
§ 03 · Making it concreteThe 16×16 example
§ 03 · 把它落地16×16 的例子
The paper's motivating example (§4.1) is worth working by hand, because once you've traced one element through the matrix the abstraction stops feeling abstract. Tile a 16×16 tensor with 2×2 registers per thread, 4×8 threads (one 32-thread warp), and 2×1 warps. Each thread owns a contiguous 2×2 block. Since every count is a power of two, you can read the layout straight off the binary coordinates.
论文的引子例子(§4.1)值得手算一遍, 因为一旦你把一个元素穿过矩阵走一遍, 抽象就不再抽象了。 用 每线程 2×2 个寄存器、 4×8 个线程(一个 32 线程的 warp)、 2×1 个 warp 来切一个 16×16 的张量。 每个线程拥有一个连续的 2×2 块。 因为每个数量都是 2 的幂, 你可以直接从二进制坐标把布局读出来。
r1 shifts t9 one column right inside its 2×2 block (0,1); the thread index places that block at (2,2); the warp adds (0,0). XOR them: (2,3). The same answer falls out of one matrix-vector product y = L₈·x — no branches, no special cases.r1 把 t9 在它的 2×2 块内右移一列 (0,1); 线程索引把这个块放到 (2,2); warp 加上 (0,0)。 XOR 起来: (2,3)。 同样的答案也从一次矩阵-向量乘法 y = L₈·x 里掉出来 —— 没有分支, 没有特例。The labels matter as much as the bits. The input lives in a labeled vector space Reg × Thr × Wrp; the output in dim0 × dim1. Keeping the labels attached is what lets every later operation — composition, product, division — stay physically meaningful instead of becoming anonymous matrix shuffling. When the paper says it operates "label-wise," that's the bookkeeping that keeps "a register bit" from being confused with "a warp bit."
标签和位一样重要。 输入活在一个 带标签的 向量空间 Reg × Thr × Wrp 里; 输出在 dim0 × dim1 里。 把标签一直挂着, 才让后面每一个操作 —— 复合、 积、 除法 —— 都保持物理意义, 而不是退化成匿名的矩阵搬运。 论文说它按"label-wise"运作时, 指的就是这套记账法, 它让"一个寄存器位"不会跟"一个 warp 位"搞混。
§ 04 · The toolboxThe four operators
§ 04 · 工具箱四个算子
Once layouts are matrices, the operations you need to build, combine, decompose, and invert them are just linear algebra — but each one earns its place by solving a concrete code-generation problem. Four carry most of the weight.
一旦布局是矩阵, 你用来构建、 组合、 分解、 反转它们的操作就只是线性代数 —— 但每一个都靠解决一个具体的代码生成问题来挣得自己的位置。 四个算子扛起了大部分重量。
B⁻¹∘A. Product is the block-diagonal direct sum that builds a big layout from register/thread/warp pieces. Left-division asks "does this small layout sit inside the big one?" — the test for whether an ldmatrix can lower a transfer. Right-inverse runs the map backwards, and under broadcast picks the minimum-Hamming-weight solution so duplicated data all reads from one source.B⁻¹∘A 的原因。 积是分块对角的直和, 用寄存器/线程/warp 的碎片拼出大布局。 左除问的是"这个小布局嵌在大布局里吗?"—— 就是判断一条 ldmatrix 能不能降级这次传输。 右逆把映射倒着跑, 在广播下挑汉明权重最小的解, 让重复的数据都从同一个源读。Why convert = B⁻¹A is the whole game
为什么 转换 = B⁻¹A 就是整盘棋
Picture the old way. Triton supported a handful of layouts, and moving data from one to another meant a dedicated routine per pair: Blocked→MMA, MMA→Shared, Sliced→Blocked, each hand-written, each only covering a subset, each a place where a complex program could silently produce wrong results. The GitHub record is blunt about the cost — 12% of all bugs filed against Triton were layout-related.
想想旧的做法。 Triton 支持一小撮布局, 把数据从一种搬到另一种意味着每一对都要一个专门的例程: Blocked→MMA、 MMA→Shared、 Sliced→Blocked, 每个都手写、 每个只覆盖一个子集、 每个都是复杂程序里可能悄悄算错的地方。 GitHub 的记录把代价说得很直白 —— Triton 上提的所有 bug 里有 12% 跟布局相关。
With linear layouts, A and B are both matrices mapping hardware indices to the same logical tensor. To send data from A's layout to B's, you map A's hardware index up to the logical coordinate, then map that coordinate back down to B's hardware index — that's B⁻¹∘A, computed once by a generic F₂ Gaussian-elimination routine. The quadratic family of converters collapses into a single algorithm. Better still, decomposing the conversion matrix by resource (registers, threads, warps) tells the compiler where the data has to move: if the warp sub-block is the identity, no inter-warp movement is needed, which is the green light to use a warp shuffle instead of a round-trip through shared memory.
有了线性布局, A 和 B 都是把硬件索引映射到同一个逻辑张量的矩阵。 要把数据从 A 的布局送到 B 的布局, 你把 A 的硬件索引向上映射到逻辑坐标, 再把那个坐标向下映射回 B 的硬件索引 —— 这就是 B⁻¹∘A, 由一个通用的 F₂ 高斯消元例程算一次。 那一整族二次方数量的转换器, 坍缩成单个算法。 更妙的是, 把转换矩阵按资源(寄存器、 线程、 warp)分解, 会告诉编译器数据 需要 往哪搬: 如果 warp 子块是单位阵, 就不需要 warp 间搬运, 这就是用 warp shuffle 取代往返共享内存的绿灯。
§ 05 · The completeness claimEvery layout is linear
§ 05 · 完备性主张所有布局都是线性的
A framework is only as good as its coverage, so the paper grinds through every layout family Triton has and proves each is linear. The two big families split by where the data lives: distributed layouts spread a tensor across registers/threads/warps, and memory layouts store it in a programmable memory like shared memory.
一个框架的价值取决于它的覆盖面, 所以论文把 Triton 拥有的每一个布局族都啃了一遍, 证明每一个都是线性的。 两大族按数据住在哪里划分: 分布式布局 把张量铺在寄存器/线程/warp 上, 内存布局 把它存进像共享内存这样的可编程内存里。
I + C where C mixes coordinate bits to scatter consecutive rows across banks. That shear is mma swizzling.I + C, 其中 C 混入坐标位, 把连续的行打散到不同 bank 上。 那个剪切 就是 mma swizzling。The reframing pays off twice. First, broadcasting — long a source of subtle Triton bugs because finding which threads hold duplicate data in an arbitrary layout is genuinely fiddly — becomes "look for a zero column." Done. Second, swizzling, which used to be a magic incantation of per_phase / max_phase / vec constants, is revealed as an ordinary linear shear transformation: the diagonal I copies offset bits straight through, and the off-diagonal C tilts one coordinate by mixing in bits of another. Nothing mystical, just a matrix.
这个重构有两重回报。 第一, 广播 —— 长期是 Triton 微妙 bug 的来源, 因为在任意布局里找出哪些线程持有重复数据是真的麻烦 —— 变成了"找一个零列"。 完事。 第二, swizzling, 以前是一串 per_phase / max_phase / vec 常数的魔法咒语, 现在被揭示为一个普通的线性剪切变换: 对角的 I 把偏移位直接拷过去, 非对角的 C 通过混入另一个坐标的位来"倾斜"一个坐标。 不神秘, 就是个矩阵。
The completeness isn't academic. On a micro-benchmark enumerating every pair of common Triton dtypes for a simple matmul — 784 cases — old Triton passed 46.6%, because it never correctly handled small, low-precision MMA tiles (it can't support any MMA layout with more than 32 bits of contiguous elements in the tile's last dimension). Linear layouts passed 100%. Same hardware, same kernels; the difference is having a definition that covers all valid distributed layouts instead of a hand-picked subset.
这个完备性不是学术摆设。 在一个枚举 Triton 所有常见 dtype 两两组合做简单 matmul 的 micro-benchmark 上 —— 784 个 case —— 旧 Triton 只通过 46.6%, 因为它从来没正确处理过小尺寸、 低精度的 MMA tile(它支持不了任何在 tile 最后一维上有超过 32 位连续元素的 MMA 布局)。 线性布局通过 100%。 同样的硬件、 同样的 kernel; 区别在于有一个覆盖所有合法分布式布局的定义, 而不是手挑的一个子集。
§ 06 · The flagship algorithmOptimal swizzling
§ 06 · 旗舰算法最优 swizzling
Bank conflicts are the oldest tax in GPU shared memory: when two threads in a warp hit the same bank on different addresses, the access serializes. For decades the fix has been fixed swizzle patterns lifted from a hardware manual and a lot of trial and error. The paper's most important contribution turns this into an algorithm that derives the optimal swizzle for an arbitrary layout — provably maximizing vectorization while minimizing conflicts.
Bank 冲突是 GPU 共享内存里最古老的税: 当一个 warp 里的两个线程在不同地址上撞到同一个 bank, 访问就被串行化。 几十年来的解法是从硬件手册里抄固定的 swizzle 模式, 外加大量试错。 论文最重要的贡献, 是把这件事变成一个能为任意布局推导 最优 swizzle 的算法 —— 可证明地最大化向量化、 同时最小化冲突。
Vec × Bank × Seg. A conflict is exactly when two threads land on the same bank in different segments — i.e. span(Seg) ∩ span(Thr) ≠ {0}. The algorithm builds the segment basis from the complement of the thread-access subspace, so distinct segments map to distinct banks; it only dips into the conflicting space when the safe space is exhausted, giving a provably minimal-conflict swizzle. Below: a read of 16 elements splits into four conflict-free transactions.Vec × Bank × Seg。 冲突恰好发生在两个线程在不同段上落到同一个 bank 时 —— 也就是 span(Seg) ∩ span(Thr) ≠ {0}。 算法从线程访问子空间的 补空间 里构造段的基, 让不同的段映射到不同的 bank; 只有当安全空间耗尽时才借用冲突空间里的向量, 给出一个可证明冲突最小的 swizzle。 下方: 16 个元素的读取拆成四次无冲突事务。The same machinery powers warp-shuffle generation. When the conversion matrix's warp block is the identity (no inter-warp movement) and its register block can be made to vectorize, the data exchange can ride shfl.sync instead of bouncing through shared memory. The algorithm constructs a basis G from the source and target thread indices whose vectors are the XOR shuffle patterns the hardware wants, then tiles the exchange into a minimal number of rounds. Previously this was hand-written PTX; now it falls out of the linear-algebra crank, and the layout-conversion micro-benchmark shows up to 3.93× over shared-memory-always Triton, while gather hits 14.20×.
同一套机器也驱动 warp-shuffle 生成。 当转换矩阵的 warp 块是单位阵(没有 warp 间搬运)、 而它的寄存器块能被凑成可向量化时, 数据交换就能骑 shfl.sync, 而不必在共享内存里弹来弹去。 算法从源线程索引和目标线程索引构造一个基 G, 它的向量 就是 硬件想要的 XOR shuffle 模式, 然后把交换 tile 成最少的轮数。 这以前是手写 PTX; 现在它从线性代数的摇柄里掉出来, 布局转换的 micro-benchmark 显示相对"永远走共享内存"的 Triton 最高 3.93×, 而 gather 达到 14.20×。
§ 07 · Does it pay offResults & bug fixes
§ 07 · 它划算吗结果与修 bug
The honest headline is modest and the subplots are large. Across 265 real-world cases from TritonBench on three platforms, Triton-Linear lands between 0.96× and 1.40×, averaging 1.07×. The sub-1.0 cases are mostly runtime noise on tiny inputs. The wins concentrate where efficient primitives like ldmatrix / stmatrix get used in layout conversion and shared-memory load/store: int4_gemm, gemm, flex_attention.
诚实的标题很朴素, 副线却很大。 在三个平台上来自 TritonBench 的 265 个真实 case 中, Triton-Linear 落在 0.96× 到 1.40× 之间, 平均 1.07×。 低于 1.0 的 case 大多是小输入上的运行时噪音。 收益集中在高效原语如 ldmatrix / stmatrix 被用于布局转换和共享内存读写的地方: int4_gemm、 gemm、 flex_attention。
| Benchmark / dimension | Old Triton | Triton-Linear | |
|---|---|---|---|
| 基准 / 维度 | 旧 Triton | Triton-Linear | |
| mixed-precision matmul (784 cases) | 46.6% pass | 100% pass | |
| LD/ST vectorization width | misses cross-dim runs | 漏掉跨维连续段 | up to 7× wider |
| broadcast / reduce shared stores | redundant stores | 冗余 store | −76% instructions |
| layout conversion (warp shuffle) | always shared mem | 总走共享内存 | up to 3.93× |
| tl.gather (warp shuffle) | always shared mem | 总走共享内存 | up to 14.20× |
| real TritonBench (265 cases) | 1.00× | 1.07× avg · 1.40× max |
But raw speed undersells the contribution. The bigger story is robustness: layouts accounted for 12% of all Triton bugs, and linear layouts retire a swath of them by construction — a single verified matrix routine can't disagree with itself the way a dozen hand-written converters could. Some wins are purely structural: for the welford operator, Triton-Linear detects that two "equivalent" layouts (a Blocked and a Sliced) are actually the same map and lowers the conversion to a no-op — something the old system literally couldn't express, because it had no way to compare layouts of different types.
但只看裸速度, 低估了这个贡献。 更大的故事是鲁棒性: 布局占了 Triton 所有 bug 的 12%, 而线性布局靠构造方式退役掉其中一大片 —— 一个经过验证的矩阵例程, 不会像十几个手写转换器那样自相矛盾。 有些收益纯粹是结构性的: 对 welford 算子, Triton-Linear 检测到两个"等价"布局(一个 Blocked 和一个 Sliced)其实是同一个映射, 把转换降级成 no-op —— 这是旧系统根本表达不出来的事, 因为它没办法比较不同类型的布局。
On the AMD MI250, gains were only 1.00×–1.03× — and the reason is the interesting part. The big NVIDIA wins ride ldmatrix / stmatrix, hardware primitives AMD doesn't have. So the framework is doing its job (correctness, no regressions), but the headroom on AMD is gated by missing primitives, not by the layout algebra. That gap is exactly an opportunity: derive the AMD-native ds_read / ds_write swizzles from the same subspace search.
在 AMD MI250 上, 收益只有 1.00×–1.03× —— 而原因才是有意思的部分。 NVIDIA 上的大头收益骑的是 ldmatrix / stmatrix, 这是 AMD 没有的硬件原语。 所以框架在做它该做的事(正确、 无回退), 但 AMD 上的提升空间被缺失的原语卡住, 而不是被布局代数卡住。 这个缺口恰恰是机会: 用同一套子空间搜索, 推导出 AMD 原生的 ds_read / ds_write swizzle。
§ 08 · Carrying it homeFor our AMD work
§ 08 · 带回家对我们 AMD 的意义
Reading this with our two goals in mind, three things transfer directly.
带着我们两个目标读它, 有三件事可以直接迁移。
1 · The porting goal needs a layout layer, and this is its shape. A real NV→AMD port is not a syntax translation — a CUDA kernel only runs well on MI300X once its data is re-tiled for 64-wide wavefronts and re-swizzled for LDS banks. The paper proves mfma input/output layouts are linear, so the same matrix machinery that re-targets a Triton kernel from mma to wgmma can, in principle, re-target it to mfma. A porting agent that carries a layout matrix instead of pattern-matching on layout names has a far smaller surface to get wrong.
1 · 移植目标需要一个布局层, 而这就是它的形状。 真正的 NV→AMD 移植不是语法翻译 —— 一个 CUDA kernel 只有在它的数据被重新 tile 成 64 宽 wavefront、 重新 swizzle 对齐到 LDS bank 之后, 才能在 MI300X 上跑得 好。 论文证明了 mfma 的输入/输出布局是线性的, 所以那套把 Triton kernel 从 mma 重定向到 wgmma 的矩阵机器, 原则上也能把它重定向到 mfma。 一个携带布局矩阵、 而不是对布局名字做模式匹配的移植 agent, 出错的表面积要小得多。
2 · The kernel-agent goal wants the optimal-swizzle algorithm as a tool. Our second goal is a multi-agent system that writes fast AMD kernels through a generate → benchmark → analyze loop. Bank-conflict tuning is one of the most tedious, error-prone steps in that loop — and the paper gives a closed-form, provably optimal swizzle solver that doesn't care whether the hardware is NVIDIA or AMD. Wiring that solver in as an agent tool means the loop spends its iteration budget on the genuinely open questions (tiling, fusion, occupancy) instead of rediscovering swizzle patterns by trial and error.
2 · kernel-agent 目标想把最优 swizzle 算法当工具。 我们的第二个目标是一个通过 生成 → benchmark → 分析 循环来写高性能 AMD kernel 的多 agent 系统。 bank 冲突调优是这个循环里最枯燥、 最易错的步骤之一 —— 而论文给了一个闭式、 可证明最优的 swizzle 求解器, 它不在乎硬件是 NVIDIA 还是 AMD。 把这个求解器接成一个 agent 工具, 意味着循环可以把迭代预算花在真正开放的问题上(tiling、 融合、 occupancy), 而不是靠试错重新发现 swizzle 模式。
3 · The missing-primitive gap is a concrete first kernel to win. The MI250 result says the layout algebra is ready but AMD lacks the ldmatrix/stmatrix fast paths NVIDIA exploits. Per our "win one kernel completely before generalizing" rule, that points at a sharp target: take one GEMM or attention kernel on MI300X, use the linear-layout swizzle solver to drive the ds_read/ds_write path to conflict-free, and measure against the CK hand-written baseline. A clean win there validates the whole layout-as-matrix approach on our hardware before we scale it across the porting pipeline.
3 · 缺失原语的缺口, 是一个具体的"先赢下来"的 kernel。 MI250 的结果说布局代数已经就绪, 但 AMD 缺少 NVIDIA 利用的 ldmatrix/stmatrix 快路径。 按我们"先把一个 kernel 完全赢下来再泛化"的规则, 这指向一个清晰的目标: 拿 MI300X 上一个 GEMM 或 attention kernel, 用线性布局的 swizzle 求解器把 ds_read/ds_write 路径推到无冲突, 对着 CK 手写 baseline 测。 在那里干净地赢一把, 就在我们把这套方法铺到整条移植流水线之前, 先在自己的硬件上验证了"布局即矩阵"这整个思路。
The AMD–NVIDIA gap here is not physics. The swizzle math is identical over F₂; what NVIDIA has and AMD lacks is a decade of primitives like ldmatrix that turn the optimal layout into one instruction. That's a software-maturity gap, and software-maturity gaps are exactly what systematic search closes. The paper hands us the search.
这里的 AMD–NVIDIA 差距不是物理。 swizzle 的数学在 F₂ 上完全一样; NVIDIA 有而 AMD 没有的, 是十年积累的像 ldmatrix 这样、 能把最优布局变成一条指令的原语。 那是软件成熟度的差距, 而软件成熟度的差距, 恰恰是系统化搜索能补上的。 论文把搜索交到了我们手里。
§ 09 · ClosingEpilogue
§ 09 · 收尾尾声
What I keep coming back to is how little the paper invents and how much it reorganizes. There's no new hardware, no exotic IR, no learned model. There's one observation — GPU indices are bits, so layouts are linear maps over F₂ — pushed relentlessly through an entire compiler backend until the quadratic mess of per-pair converters collapses into B⁻¹A, broadcasting becomes a zero column, and bank-conflict-free swizzling becomes a subspace you solve for. That's the kind of move worth studying: not a bigger hammer, but the right coordinate system, in which the hard problems turn out to be linear.
我反复回味的, 是这篇论文发明得多么少、 重组得多么多。 没有新硬件、 没有奇异的 IR、 没有学出来的模型。 只有一个观察 —— GPU 索引是位, 所以布局是 F₂ 上的线性映射 —— 被不依不饶地推过整个编译器后端, 直到那一堆二次方数量的逐对转换器坍缩成 B⁻¹A, 广播变成一个零列, 无 bank 冲突的 swizzling 变成一个你解出来的子空间。 这才是值得研究的招式: 不是更大的锤子, 而是对的坐标系 —— 在那个坐标系里, 难题原来都是线性的。
For us, the takeaway is operational. When we build the layout layer of an NV→AMD porting agent or a kernel-optimization loop, the data structure should be a labeled F₂ matrix, and the operations should be composition, division, and the swizzle solver — not a registry of named layouts and a pile of conversion special cases. The paper already paid the cost of proving that's enough. We get to start from the abstraction.
对我们而言, 收获是可操作的。 当我们去造一个 NV→AMD 移植 agent 或一个 kernel 优化循环的布局层时, 数据结构应该是一个带标签的 F₂ 矩阵, 操作应该是复合、 除法、 和 swizzle 求解器 —— 而不是一张命名布局的注册表加一堆转换特例。 论文已经替我们付了"证明这够用"的代价。 我们可以直接从这个抽象出发。
§ 10 · SourcesReferences & citation
§ 10 · 出处参考与引用
- Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F2Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F2
The paper this reading is built on. 本文所依据的论文。 arXiv:2505.23819 - A note on the algebra of CuTe Layouts
Colfax's writeup of the closest prior art — CuTe's layout algebra, the library-side analogue to this compiler-side framework. Colfax 对最接近的先前工作的整理 —— CuTe 的 layout 代数, 是这套编译器侧框架在库侧的对应物。 layout_algebra.pdf - Triton GPU backend (MLIR / TritonGPU dialect)Triton GPU 后端(MLIR / TritonGPU 方言)
Where linear layouts are integrated, as the GPU code-generation layer. 线性布局被集成进去的地方, 即 GPU 代码生成层。 github.com/triton-lang/triton