A tuning script looks boring until you ask what it protects. This one protects production MoE inference from a dangerous assumption: that a fast kernel in isolation is the same thing as a valid fused MoE configuration.
调参脚本看起来很无聊, 直到你追问它到底在保护什么。 这个脚本保护的是 production MoE 推理里一个很危险的假设: 单个 kernel 跑得快, 并不等于它能组成一个正确又快的 fused MoE 配置。
The target file is not a small benchmark harness. It is a 4,259-line tuner that reads MoE shapes from CSV, generates candidates across ASM, CK, CK-Tile, and FlyDSL paths, benchmarks each candidate through a common multiprocessing executor, validates output against torch or operator references, then post-processes the results into a row that fused_moe can consume in production.
目标文件不是一个小型 benchmark harness。 它是一个 4259 行的 tuner: 从 CSV 读 MoE shape, 在 ASM、 CK、 CK-Tile、 FlyDSL 多条路径里生成候选, 用统一的多进程 executor 跑计时和正确性验证, 然后把结果整理成 fused_moe 在 production 路径里能读取的一行配置。
The script's load-bearing abstraction is the task tuple. Once every backend can be expressed as "generate data, run candidate, run reference, compare, record time," the rest of the tuner becomes a table-building and ranking problem.
这个脚本最关键的抽象是 task tuple。 只要每个 backend 都能表达成“生成数据、 跑候选、 跑 reference、 比较、 记录时间”, 剩下的问题就变成建表和排序。
01 · The file map
01 · 文件地图
The script lives under csrc/ck_gemm_moe_2stages_codegen/, next to the CK two-stage codegen helpers and common C++ headers. Its Python neighbors matter: gemm_moe_ck2stages_common.py exposes generated CK kernel manifests, while aiter/utility/base_tuner.py and aiter/utility/mp_tuner.py supply the generic CLI, batching, multiprocessing, timeout, and comparison machinery.
这个脚本位于 csrc/ck_gemm_moe_2stages_codegen/, 旁边是 CK 两阶段 codegen 的 helper 和公共 C++ 头文件。 它依赖的 Python 邻居同样重要: gemm_moe_ck2stages_common.py 提供生成出来的 CK kernel manifest, aiter/utility/base_tuner.py 和 aiter/utility/mp_tuner.py 则提供通用 CLI、 batch、 多进程、 timeout 和比较逻辑。
| File | Role in the tuner | |
|---|---|---|
| 文件 | 在 tuner 里的角色 | |
gemm_moe_tune.py |
MoE-specific task generation, reference selection, post-processing, config writeback. | MoE 专用的 task 生成、 reference 选择、 结果后处理、 config 写回。 |
base_tuner.py |
CLI defaults, --run_config, --compare, batching, CSV merge and update rules. |
CLI 默认值、 --run_config、 --compare、 batching、 CSV merge 和更新规则。 |
mp_tuner.py |
Process-per-task isolation, GPU assignment, timing, correctness comparison, timeout handling. | 每个 task 独立进程、 GPU 分配、 计时、 正确性比较、 timeout 处理。 |
gemm_moe_ck2stages_common.py |
CK stage1/stage2 kernel lists consumed by the candidate generator. | 提供候选生成器要消费的 CK stage1 / stage2 kernel 列表。 |
The tuner is MoE-specific, but the execution skeleton is shared with the rest of AITER's tuning infrastructure.
tuner 的业务逻辑是 MoE 专用的, 但执行骨架来自 AITER 通用调参基础设施。
02 · The entrance is small; the inherited run loop is large
02 · 入口很小, 继承来的 run loop 很大
The file's own entry point is tiny: define the key columns, define result columns, instantiate FmoeTuner, parse args, run. The real control flow is inherited from TunerCommon.run(). That generic run loop first calls pre_process(), then either exits through run_config() or enters the tune/post-process/write loop.
这个文件自己的入口很小: 定义 key columns, 定义 result columns, 实例化 FmoeTuner, parse args, run。 真正的控制流来自继承的 TunerCommon.run()。 这个通用 run loop 先调用 pre_process(), 然后要么走 run_config() 直接退出, 要么进入 tune / post-process / write 的调参循环。
# csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py:4223
key = ["cu_num", "token", "model_dim", "inter_dim", "expert", "topk", ...]
resultList = ["block_m", "ksplit", "us1", "kernelName1", "err1", ...]
tuner = FmoeTuner("fmoeTuner", key, resultList, "fmoe tuner")
args = tuner.parse_args()
tuner.run(args, False)
The split is important. gemm_moe_tune.py owns MoE semantics; base_tuner.py owns orchestration. This keeps the shape filtering, compare mode, and CSV merging consistent across tuners, while letting MoE define its own candidate grammar.
这个分工很重要。 gemm_moe_tune.py 负责 MoE 语义, base_tuner.py 负责 orchestration。 这样 shape 过滤、 compare 模式、 CSV merge 可以在所有 tuner 之间保持一致, 同时 MoE 可以定义自己的候选语法。
The top branch benchmarks an existing config; the lower branch generates new candidates and writes a tuned config.
上方分支 benchmark 已有 config; 下方分支生成新候选并写出 tuned config。
03 · The data factory normalizes a messy MoE world
03 · 数据工厂把复杂的 MoE 世界标准化
The function generate_data() is the atom underneath every candidate. It builds random hidden states, expert weights, top-k routing, quantized weights, shuffled weights, sorted token IDs, sorted expert IDs, sorted weights, and a reusable MoE output buffer. Later helpers specialize that base dictionary for ASM stage1, CK two-stage, CK-Tile A8W4, FlyDSL FP4, and FlyDSL int4.
generate_data() 是所有候选下面的原子。 它构造随机 hidden state、 expert weights、 top-k routing、 quantized weights、 shuffled weights、 sorted token IDs、 sorted expert IDs、 sorted weights 和可复用的 MoE output buffer。 后续 helper 再把这个基础字典专门化给 ASM stage1、 CK 两阶段、 CK-Tile A8W4、 FlyDSL FP4 和 FlyDSL int4。
The core reason this factory is so large is quantization. The script has to support no-quant, per-token fp8, per-1x128 blockscale, per-1x32 MXFP4, and int4 variants. Each variant changes not just dtype, but scale layout, inter-stage activation format, weight shuffling, and sometimes the reference path.
这个数据工厂之所以很大, 核心原因是 quantization。 脚本要支持 no-quant、 per-token fp8、 per-1x128 blockscale、 per-1x32 MXFP4 和 int4 变体。 每个变体改变的不只是 dtype, 还包括 scale layout、 stage 之间的 activation 格式、 weight shuffling, 有时连 reference 路径都会变。
# gemm_moe_tune.py:734 input -> fused_topk -> moe_sorting w1/w2 -> weight_quant -> shuffle_weight / shuffle_weight_a16w4 q_type decides: a1_qt, a1_scale, w1_scale, w2_scale return dict consumed by task arg keys
Why sorted IDs matter
为什么 sorted IDs 重要
MoE GEMM is grouped by expert. Sorting routes token-expert pairs into blocks so each kernel invocation sees dense expert-local work.
MoE GEMM 是按 expert 分组的。 sorting 会把 token-expert pair 排成 block, 让每次 kernel 调用看到密集的 expert-local work。
Why blockM is everywhere
为什么 blockM 到处出现
blockM is both a scheduling knob and a sorting contract. It changes padding, candidate eligibility, and fairness costs.
blockM 既是调度参数, 也是 sorting contract。 它会改变 padding、 候选合法性和 fairness cost。
04 · The task tuple is the grammar
04 · task tuple 是这个脚本的语法
The script's repeated shape is a tuple that says: here is the tag, here is how to generate data, here is the candidate function, here are the keys to pull from the data dictionary, here is the reference function, here are tolerances and optional compare functions. mp_tuner then turns that tuple into real work.
这个脚本反复出现的形态是一个 tuple: 这里是 tag, 这里是数据生成函数, 这里是候选函数, 这里是要从 data dictionary 取哪些 key, 这里是 reference 函数, 这里是 tolerance 和可选 compare function。 mp_tuner 再把这个 tuple 变成真正的执行。
| Tuple slot | Meaning | |
|---|---|---|
| tuple 位置 | 含义 | |
tag |
(info, stage, kernelName, blockM, flat?); this survives into result rows. |
(info, stage, kernelName, blockM, flat?); 最后会进入结果行。 |
gen_data |
Usually generate_data_2stages or generate_data_1stage. |
通常是 generate_data_2stages 或 generate_data_1stage。 |
func |
The candidate kernel wrapper: CK, ASM, CK-Tile, or FlyDSL. | 候选 kernel wrapper: CK、 ASM、 CK-Tile 或 FlyDSL。 |
ref_func |
Torch or operator reference used for correctness comparison. | 用于正确性比较的 torch 或 operator reference。 |
compare_fn |
Optional relaxed compare, such as cosine difference for low-precision paths. | 可选的宽松比较, 例如低精度路径里的 cosine difference。 |
Every backend becomes the same language before it enters the executor.
每个 backend 进入 executor 之前, 都被翻译成同一种语言。
05 · Four backend families compete under one scoreboard
05 · 四类 backend 在同一张 scoreboard 下竞争
Candidate generation happens in four main families. The script first tries ASM stage1 candidates, then CK or CK-Tile two-stage candidates, then FlyDSL two-stage candidates, then a FlyDSL int4-specific path. Each path has hard eligibility rules because not every dtype/quantization/activation combination exists in every kernel family.
候选生成主要分成四类。 脚本先生成 ASM stage1 候选, 然后生成 CK 或 CK-Tile 两阶段候选, 再生成 FlyDSL 两阶段候选, 最后有 FlyDSL int4 专用路径。 每条路径都有很硬的合法性判断, 因为不是每种 dtype / quantization / activation 组合都存在对应 kernel。
| Generator | Candidate family | Key constraints | ||
|---|---|---|---|---|
| 生成函数 | 候选家族 | 关键限制 | ||
gen_1stage_asm_task |
ASM fused 1-stage MoE. | ASM fused 1-stage MoE。 | Manifest-driven; may include FLAT kernels and xbf16 blockscale variants. | 由 manifest 驱动; 可能包含 FLAT kernel 和 xbf16 blockscale 变体。 |
gen_2stages_task |
CK stage1/stage2 or CK-Tile A8W4. | CK stage1 / stage2 或 CK-Tile A8W4。 | Skips unsupported int4 and some SwiGLU MXFP4 cases. | 跳过不支持的 int4 和部分 SwiGLU MXFP4 case。 |
gen_flydsl_2stages_task |
FlyDSL FP4/FP8 two-stage kernels. | FlyDSL FP4 / FP8 两阶段 kernel。 | Requires QuantType.per_1x32 and fp4 weights; blockM in 32/64/128. |
要求 QuantType.per_1x32 和 fp4 weight; blockM 为 32 / 64 / 128。 |
gen_flydsl_i4_2stages_task |
FlyDSL int4-bf16 two-stage kernels. | FlyDSL int4-bf16 两阶段 kernel。 | Requires matching block_m == tile_m for both stages. |
要求两个 stage 都满足 block_m == tile_m。 |
This is also where the script's engineering personality shows. It does not simply try everything. It encodes failed historical knowledge: mismatched int4 tile sizes break correctness; A8W4 should route to CK-Tile; split-k has model-dimension divisibility constraints; FlyDSL stage2 can try tile_m == blockM or one smaller tile, but no wider search.
这里也能看出脚本的工程性格。 它不是盲目尝试所有东西, 而是在代码里记录了历史上失败过的知识: int4 tile size 不匹配会破坏正确性; A8W4 应该走 CK-Tile; split-k 有 model_dim 整除限制; FlyDSL stage2 可以尝试 tile_m == blockM 或小一档 tile, 但不会无限扩大搜索。
A shape row fans out into backend-specific candidates only when the dtype and quantization contract is legal.
一行 shape 只有在 dtype 和 quantization contract 合法时, 才会展开成对应 backend 的候选。
06 · Selection is not just min(us)
06 · 选型不只是 min(us)
After mp_tuner returns (info, us, err) rows, post_process() groups rows by shape, filters invalid candidates, deduplicates by stage and block size, renames stage1/stage2 columns, merges stage1 and stage2 on shape plus block_m, and appends 1-stage rows as alternatives. Only then can it choose the minimum total time.
mp_tuner 返回 (info, us, err) 之后, post_process() 会按 shape 分组, 过滤非法候选, 按 stage 和 block size 去重, 重命名 stage1 / stage2 columns, 在 shape 和 block_m 上 merge stage1 / stage2, 再把 1-stage 结果作为备选加入。 到这里才可以选总耗时最小的一行。
The tricky part is fairness. Some candidates fuse quantization or sorting internally; others require a separate quant/sort or activation-cast step. The post-processor benchmarks those missing costs and adds them to the appropriate rows. FLAT kernels get even more care: because FLAT handles sorting differently, the script can run a head-to-head e2e comparison before accepting the additive fairness result.
最麻烦的是 fairness。 有些候选在内部 fused 了 quantization 或 sorting, 另一些则需要额外的 quant / sort 或 activation cast。 post-processor 会单独 benchmark 这些缺失成本, 再加到对应行上。 FLAT kernel 更特殊: 因为 FLAT 对 sorting 的处理不同, 脚本甚至会做一次 head-to-head e2e 对比, 再决定是否接受 additive fairness 的结果。
Fairness costs are added before the final row is chosen, because different candidates hide different pieces of the end-to-end path.
最终选型之前会补上 fairness cost, 因为不同候选隐藏了端到端路径里的不同步骤。
07 · --run_config is the production sanity check
07 · --run_config 是 production sanity check
The tuner has a second mode that does not generate candidates. --run_config reads shapes from a tuned CSV or an input CSV, points AITER_CONFIG_FMOE at the selected config when needed, clears operator caches, rebuilds JIT modules, then calls production fused_moe. This is the mode you use when you want to know whether the selected config actually works through the public operator path.
这个 tuner 还有第二种模式, 它不生成候选。 --run_config 会从 tuned CSV 或 input CSV 读 shape, 在需要时把 AITER_CONFIG_FMOE 指向选中的 config, 清理 operator cache, 重建 JIT module, 然后调用 production fused_moe。 如果你想知道选出来的配置在公开 operator 路径里是否真的可用, 应该跑这个模式。
python3 csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py \ --run_config aiter/configs/tuned_fmoe.csv python3 csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py \ -i aiter/configs/untuned_fmoe.csv --run_config
--compare builds on this idea: run production before tuning, tune into a candidate CSV, run production after tuning, then update the final config only if --update_improved is set and the improvement threshold is met. That separation is conservative and correct.
--compare 是这个思想的扩展: 调参前跑一次 production, 调参结果写进 candidate CSV, 调参后再跑一次 production, 只有设置了 --update_improved 并达到提升阈值时, 才更新最终 config。 这种分离是保守的, 也是正确的。
08 · Hazards worth fixing or remembering
08 · 值得修或至少记住的风险点
The script is practical, but it is not small, and several details are easy to misread. The most concrete issue I found is in calculate(): it unpacks stage, then resets it to an empty string before the stage-specific FLOP/BW branches. That means stage1 and stage2 reporting always falls through to the combined estimate. It does not change the winning kernel because selection is by us, but it does make per-stage TFLOPS/BW misleading.
这个脚本非常实用, 但它并不小, 很多细节容易读错。 我看到最具体的问题在 calculate(): 它先解包 stage, 随后又把 stage 重置成空字符串, 导致 stage1 和 stage2 专用的 FLOP / BW 分支永远不会走。 这不会改变 winner, 因为选型靠的是 us, 但会让 per-stage TFLOPS / BW 报告误导人。
# gemm_moe_tune.py:1712
key, stage, kernelName, block_m, us, err = results
...
stage = "" # suspicious: stage1/stage2 branches below never run
if stage == "stage1":
...
elif stage == "stage2":
...
When using this tuner, trust us, correctness status, and final config fields first. Treat TFLOPS/BW as derived metadata unless the calculate() stage reset is fixed.
使用这个 tuner 时, 先相信 us、 correctness status 和最终 config 字段。 在 calculate() 的 stage reset 被修掉之前, TFLOPS / BW 应该当作派生 metadata, 不要当成硬证据。
There is also a broader engineering lesson. The tuner encodes many case-specific constraints: FLAT sorting, xbf16 internal quantization, a16wi4 tile matching, split-k divisibility, FlyDSL stage2 sort block size, and relaxed cosine comparison for low precision. If a new kernel family is added, copying the tuple shape is not enough. The new family must also declare which parts of the end-to-end path it fuses and which fairness costs still need to be charged.
还有一个更大的工程教训。 这个 tuner 里编码了大量 case-specific constraint: FLAT sorting、 xbf16 internal quantization、 a16wi4 tile matching、 split-k divisibility、 FlyDSL stage2 sort block size、 低精度路径里的 cosine compare。 如果要加一个新的 kernel family, 只复制 task tuple 形状是不够的。 新 family 还必须说明它 fused 了端到端路径里的哪些部分, 哪些 fairness cost 仍然需要补上。
The output row is a dispatch contract consumed by production operators, so tuning and validation cannot be separated.
输出行是 production operator 会消费的 dispatch contract, 所以 tuning 和 validation 不能分开看。
The shortest useful mental model
最短的有效心智模型
Read the script as a compiler for tuning experiments. The input language is untuned_fmoe.csv. The intermediate representation is a list of task tuples. The runtime is mp_tuner. The optimizer is post_process(). The output artifact is tuned_fmoe.csv, which production fused_moe later reads as a dispatch table.
可以把这个脚本读成一个“调参实验编译器”。 输入语言是 untuned_fmoe.csv。 中间表示是一组 task tuple。 runtime 是 mp_tuner。 optimizer 是 post_process()。 输出 artifact 是 tuned_fmoe.csv, 后续 production fused_moe 会把它当 dispatch table 读取。
Once this model is in place, the long file becomes navigable. You no longer read 4,259 lines as one script. You read five contracts: shape contract, quantization contract, task contract, fairness contract, production dispatch contract.
一旦有了这个模型, 这个 4259 行的长文件就能导航了。 你不再把它当一个脚本顺着读, 而是读五个 contract: shape contract、 quantization contract、 task contract、 fairness contract、 production dispatch contract。