8 min read

Source Reading 006 — FlyDSL, A Layout-Algebra Python DSL with an MLIR Spine

Table of Contents

A sixth entry in the source-reading series. After gcnasm’s descent into hand-written CDNA3 assembly, FlyDSL goes in the opposite direction — a Python DSL with a proper typed MLIR IR underneath, where layout algebra becomes a first-class concept and copy / MMA atoms compose into production GEMMs without leaving the Python editor. Full HTML deep dive at /sources/flydsl.html.

Why this repo, why now

There is a particular vertigo that comes from reading a modern GPU kernel and realizing how much of it is layout, not arithmetic. A 4096-cube FP16 GEMM is, viewed line-by-line, mostly bookkeeping — which thread loads which 8 elements, where in shared memory they land, in what order the MFMA instructions consume them, when the next K-block’s prefetch overlaps with the current MFMA tail. The actual multiplication is two lines. Everything else is layout.

FlyDSLFlexible Layout pYthon DSL — is AMD’s response to that observation. It is a Python DSL where you author kernels using @flyc.kernel + @flyc.jit, but underneath sits a real MLIR dialect (the Fly dialect) with a !fly.layout type, layout algebra ops, and a pass pipeline that lowers everything to ROCDL and a HSA fatbin. The intellectual parent is NVIDIA CuTe; the AMD-specific contribution is making the algebra a typed IR rather than C++ templates.

I spent four hours reading the four examples under examples/ — vectorAdd, tiledCopy, tiledMma, preshuffle GEMM — and they turn out to be a strict pedagogical ladder. Each one adds exactly one concept to the previous, and the fourth example is essentially a compact version of every optimization that production CDNA3/CDNA4 GEMMs use. Reading them in order is the most efficient path into the codebase.

Five findings worth carrying

1. Layout is a typed IR concept, not a templated string. FlyDSL’s defining choice is that !fly.layout<(8,16):(1,8)> is a real MLIR type. Operations like fly.logical_divide and fly.partition_S consume and produce values of this type. The pass fly-layout-lowering at stage 3 of the pipeline materializes the algebra into concrete address arithmetic. This means any MLIR-aware tool (an autotuner, an analysis pass, a verifier) can reason about layouts as data, not as opaque template parameters.

2. partition_S + retile is the abstraction that pays off. In example 03, thr_copy_A.partition_S(bA) hands you the per-thread fragment with shape (V, VM, VN) — already correctly indexed for the underlying MFMA atom’s lane layout. Then retile gives you the same registers viewed under the MMA layout instead of the copy layout, at zero cost. Without retile you’d need a second fragment and explicit register-to-register copies. The MLIR pass pipeline collapses both views into one VGPR allocation.

3. Preshuffle is a recurring pattern across the production kernels. Example 04’s shuffle_weight trick — reshape B on the host so a plain buffer_load_dwordx4 already lands MFMA-lane-correct in VGPRs — is not a one-off. kernels/preshuffle_gemm.py, blockscale_preshuffle_gemm.py, moe_gemm_2stage.py all use the same idea against different MFMA shapes and dtypes. For inference weights that never change, the trade saves the entire LDS round-trip on B, eliminating both the ds_write traffic and the swizzle that would otherwise be needed for bank-conflict-free B reads.

4. Schedulers are where the last 30% lives. A FlyDSL kernel without fx.rocdl.sched_* hints will land around 60–70% of peak FLOPs. With a tuned scheduler — count of MFMAs, count of ds_reads, count of buffer_loads, and the exact interleaving inside the hot loop — the same kernel can hit 90%+. The schedulers in kernels/preshuffle_gemm.py are tuned per (BM, BN, BK, MFMA-shape) tuple, typically with ATT traces from rocprofv3, and break silently when you change the tile size. The 30-line hot_loop_scheduler in example 04 is the minimum viable shape of this artifact.

5. CUDA Graph capture works out of the box, by design. The launch path goes through fly-gpu-stream-inject, an MLIR pass that threads the user-provided stream into the actual launch instead of consulting a thread-local variable. For an inference engine that batches kernels into a captured graph for replay (vLLM, SGLang), this is the difference between FlyDSL kernels being usable and being a corner case requiring special handling. Example 01 demonstrates this with a second test that captures the kernel into torch.cuda.CUDAGraph and replays it correctly.

★ The one insight that reframed my mental model

“Layout” is not a description of memory; it is a function. make_layout(shape, stride) defines a map coord ↦ index. composition, logical_divide, product are function composition / partition / extension on that map. Once you read FlyDSL through this lens, the gap between “what the code says” and “what the kernel does” closes by an order of magnitude. The function-on-functions framing also explains why the algebra survives compilation — every pass operates on the layout-as-function representation until the final lowering stage materializes it into address arithmetic. The whole pass pipeline is a sequence of layout-function transformations, not a sequence of code-template substitutions.

What’s in the full reading

The HTML deep dive walks through eight modules:

  • M0 — Compass: the four-example ladder, line counts, concept progression.
  • M1 — Layout algebra in ten minutes: shape, stride, layout, divide, slice, TV layout.
  • M2 — Example 01 vectorAdd: minimum viable kernel; BufferCopy vs UniversalCopy; @flyc.jit / Constexpr.
  • M3 — Example 02 tiledCopy: TV layout in full; partition_S/D; the (V, VM, VN) fragment shape.
  • M4 — Example 03 tiledMma: MFMA atoms; make_tiled_copy_A/B/C; retile’s two-view trick.
  • M5 — Example 04 preshuffle GEMM: host preshuffle, LDS XOR swizzle, two-stage pipeline, hot_loop_scheduler.
  • M6 — Compile pipeline: Python → MLIR → ROCDL → fatbin; the JIT cache; FLYDSL_DUMP_IR workflow.
  • Reefs: six traps from real debugging — branch-only values, SmemPtr._view_cache, stale schedulers, &c.
  • AMD notes: where FlyDSL sits relative to Triton-ROCm and Composable Kernel; a practical kernel-tuning workflow.

Plus six hand-coded SVG plates: the compilation pipeline, the two-stage divide cascade, the TV layout grid for tiledCopy, the MFMA wave-tiling of a 64×64 C tile, the preshuffle B before/after diagram, and the software-pipeline timing diagram showing how prefetch overlaps with MFMA.

→ Full deep dive at /sources/flydsl.html — rendered in a cyanotype-blueprint aesthetic (deep navy ground with chalk-white ink and rust / brass / jade / teal annotation marks, EB Garamond + IBM Plex Sans + JetBrains Mono), with all diagrams hand-coded inline SVG.

Primary references — what to read alongside

  1. FlyDSL repo — start with docs/layout_system_guide.md for the complete Quick Reference, then docs/kernel_authoring_guide.md for practical patterns. The production kernels under kernels/ are the dictionary that the four examples are the alphabet for.

  2. NVIDIA CUTLASS CuTe — the intellectual parent. Layout algebra, copy/MMA atom design, and the partition_S/D idiom are all CuTe ideas, ported to AMD with an MLIR backbone. Reading CuTe docs alongside FlyDSL clarifies which choices are universal and which are AMD-specific.

  3. Categorical Foundations for CuTe Layouts (Colfax Research) — formal treatment of layout algebra as a category. Sufficient to derive every algebraic identity FlyDSL relies on. Read this if you want to extend the algebra, propose custom product variants, or verify that two layouts are equivalent.

  4. AMD Instinct MI300 CDNA3 ISA Reference — authoritative on every instruction FlyDSL’s lowering emits. § 8 (MUBUF) for buffer_load, § 10 (MFMA) for the matrix core, § 6 (s_waitcnt) for the vmcnt / lgkmcnt mechanics that scheduling controls.

  5. MLIR documentation — for reading FLYDSL_DUMP_IR output. The gpu, arith, scf, memref, vector, and rocdl dialects that FlyDSL composes with are documented here; the fly dialect itself is documented in-repo under include/flydsl/Dialect/Fly/IR/.

  6. Triton-ROCm — the alternative AMD kernel-DSL most readers will know. FlyDSL trades Triton’s opacity-around-scheduling for explicit control via fx.rocdl.sched_*. Reading them side-by-side clarifies the design space: same target hardware, different control surfaces.

The right reading order if you are new to AMD kernel programming is roughly: № 4 (CDNA3 § 2-3 for hardware intuition) → this writeup → № 2 (CuTe for the algebra) → FlyDSL examples 01-04 → № 3 (categorical paper, optional) → production kernels in kernels/. The MLIR doc and Triton are lookups, not sequential reads.


Previous: Source Reading 005 — GCNasm. The next entry will likely cover aiter — the production AMD kernel library that FlyDSL’s tests reference, and a natural next layer up from the layout algebra explored here.