FILE № 005 2026.05.17 · gfx942 / gfx950 / gfx1030

GCNasm

A field reading of carlushuang's kata book — sixty-four self-contained kernels that teach what the CDNA3 ISA manual refuses to spell out, from buffer_load…lds tricks the assembler will not accept to the vmcnt(3) arithmetic that turns a vector add into a pipelined loom.

Source lines
72,878.s/.cc/.cu/.py
Kata folders
64independent
Roofline ceiling
4.56TB/s · MI308X
Reading time
~6hours, careful

§ 00 · Why this repo, why nowPrologue

There is a well-documented chasm in AMD GPU programming. On one bank stands the HIP-level tutorial: __global__, blockIdx, __shfl_xor. Friendly, portable, and almost entirely unable to express the half-dozen tricks that separate a 30% kernel from a 95% kernel. On the other bank stands the CDNA3 ISA manual: an authoritative 1,200-page artifact that catalogs every bit of every opcode, never once showing what a complete, optimized, hand-written kernel actually looks like.

carlushuang's gcnasm is a bridge built of plank-by-plank examples — 64 short folders, each one a complete program demonstrating one technique. The repo has no library, no central abstraction, no build system that ties anything together. Every folder has a build.sh, an .s or .cc or both, and (if you're lucky) a README. That structure is the point: each kata is something you can read, compile, run, and modify in an hour.

For someone whose work involves generating AMD kernels — whether by hand, by codegen, or by autonomous agent — this repo is a calibration target. It answers a question the official docs do not: what should a good kernel look like when a fluent AMD engineer writes it? The first time you see the vmcnt(3) incantation in vector_add_asm/vector_add_kernel.s:237 and trace through the FIFO logic in README.md:410-452, the chasm closes by a step.

Reading order

The cleanest path is roughly: M1 (CDNA3 background, so the rest makes sense) → M2 (vector_add_asm, the canonical example) → M3 (bandwidth_memread, your roofline ceiling) → M4 (MFMA, the only path to peak FLOPS) → M5/M6/M7 in any order. Skip the rest of this prologue if you already know why you are here.

Three things this writeup is not: it is not a build guide (the READMEs handle that), it is not a tutorial for HIP itself (assume you can write a basic kernel), and it is not a substitute for the ISA manual (read sections 8–10 of the AMD Instinct MI300 CDNA3 Instruction Set Architecture document alongside this — full link list in § References below). It is a reader's report: what each example teaches, what assumptions it leaves unstated, and where it fits in the larger geography of AMD optimization.

M0 · The cartographyTopography of 64 katas

Before walking through any single example, it helps to see the shape of the whole. gcnasm has no top-level layering — every folder is a peer — but the names cluster naturally into seven orthogonal categories. The README gives a tagged index; the table below distills it into something readable.

GroupCountAnchor examplesWhat you learn
Vector / element-wise3vector_add_asm, absdiffThe whole stack — persistent kernels, LDS, OOB, pipelining
GEMM + MFMA11hgemm_mfma, matrix_core_asm, opus_gemmTile layouts, AGPR scheduling, swap-and-swizzle
Bandwidth / memcpy10bandwidth_memread, membenchThe ceiling — peak HBM bandwidth on every part you own
Warp / wave primitives6wave_reduce_dpp, warp_sort_bitonicDPP control codes, ds_permute, readlane
Atomics / sync3cmpswap_atomic, cross-wg-syncbf16x2 atomic reduction, cross-workgroup ordering
Type conversion6cvt_fp8, pk_cvt, lqqFP8, INT4, packed conversions on the cheap
Integer arithmetic3magic_integer_divisionMultiply-shift replacement for division in address math
HW probing / IPS4measure_ips, hwregInstruction throughput, HW_ID register decoding
Python / PyTorch glue7co-exec, opus_fmm, warp_histogramHow to make a hand-written kernel callable from PyTorch

The folder names tell a small story by themselves. opus_* directories use the aiter / opus library — these are higher-level building blocks that look more like a kernel you would actually ship. The bare names (hgemm, sgemm, matrix_core) are the pedagogical minimums. The _asm suffix flags hand-written assembly. The _bench suffix means there's a measurement harness around the kernel.

Plate I Repo compass — sixty-four kernels on a six-axis grid Scale: 1 = one folder
LOAD COMPUTE REDUCE COMMUNICATE PROBE GLUE SCALAR VECTOR MATRIX WAVE META buffer_ld_oob OOB → 0 demo mubuf semantics int_divide_mod magic_div macros s_mul_hi shift vector_add_asm 5 patterns in 1 ★ canonical example bandwidth_memread 4.56 TB/s ceiling float4 + nt_load absdiff __builtin_amdgcn_sad_u16 SAD 8x8 memcpy_async async copy via LDS global → lds → global measure_ips IPS per instruction tight asm loop triton_ memread triton sanity hgemm_mfma production-ish GEMM 11 files, codegen matrix_core_asm hand MFMA loop AGPR scheduling matrix_core 3 layout strategies 32x32x8 fp16 transpose-lds avoid bank conflict swizzled LDS opus_attn attention via opus 7 files opus_fmm PyTorch ext wave_reduce_dpp 6-stage tree v_mov_b32_dpp warp_sort_bitonic med3 + DPP shuffle 31 files (largest) ds_permute cross-lane shuffle LDS-based xchg cmpswap_atomic bf16x2 atomic add compare-and-swap hwreg · hwreg_mask · smid · test_compiler · test_exp2 · test_tanh membench · ubench · measure_ips_bench (the meters) co-exec ⟵ standalone Python compile+launch loop
A six-by-five floor plan of the repo. Green-bordered tiles are the kernels you should read first; magenta marks cross-lane work; cyan marks data-movement; amber marks integer or probing utilities; crimson is atomics. The empty cells (e.g. scalar-reduce, scalar-communicate) are gaps the repo does not cover — those are problems CDNA solves with hardware features (s_barrier, kernarg loads) and so do not need standalone examples.
Steal this taxonomy

If you maintain your own AMD kernel knowledge base — Obsidian, Notion, internal wiki — this six-axis grid is a better organizational principle than alphabetical or chronological. Group by data shape × operation and the same kernels will be findable from either direction. The empty cells are also informative: when you find yourself wanting to add something there, you have probably found a problem that does not need a kernel.

M1 · The hardware behind the assemblyCDNA3 in ten minutes

Every line of vector_add_kernel.s assumes you know things this section will state outright. If you've written CUDA, two-thirds of this maps cleanly onto NVIDIA equivalents; the differences are concentrated and worth memorizing.

Wavefront, CU, SIMD — the execution unit

An AMD wavefront is 64 lanes, not 32. This single fact ripples through everything: every cross-lane primitive operates over 64-bit masks (exec, vcc), DPP shuffles cover 64 lanes in 6 tree stages instead of 5, and LDS bank conflicts behave differently because the bank arbiter sees twice as many threads per cycle.

A Compute Unit (CU) on CDNA3 holds 4 SIMDs, each 16 lanes wide. A wavefront occupies one SIMD, and the wave's instructions issue 16 lanes per cycle, taking 4 cycles to fully execute one 64-lane instruction. The wave scheduler interleaves up to 10 waves per SIMD (40 waves per CU); this is why occupancy is measured in "waves per SIMD" not "warps per SM."

Plate II CU exploded view — where a wavefront actually lives CDNA3 / MI300X · gfx942
MI300X — 304 CUs ONE COMPUTE UNIT SIMD 0 16 ALU · 10 waves 256 KB VGPR SIMD 1 16 ALU · 10 waves 256 KB VGPR SIMD 2 16 ALU · 10 waves 256 KB VGPR SIMD 3 16 ALU · 10 waves 256 KB VGPR LDS 64 KB · 32 banks · 128 B/cycle addr via M0 for buffer_load…lds MFMA matrix core (gfx942) fp16 · bf16 · fp8 · int8 · 32x32x8 / 16x16x16 ONE WAVEFRONT 64 lanes total · issued 16/cycle × 4 cycles cycle 0 cycle 1 cycle 2 cycle 3 exec mask 64-bit · vcc 64-bit · m0 per-wave UNIFIED REGISTER FILE VGPR [0 .. accum_offset) AGPR [accum_offset .. end) CDNA3 fuses VGPR+AGPR in one file. .amdhsa_accum_offset declares the split. MFMA destinations must be AGPR ("a" constraint), MFMA sources can be either.
A CU is four SIMDs sharing one 64 KB LDS bank and one matrix-core unit. Each wavefront pins itself to one SIMD and its 64 lanes execute in groups of 16 over 4 cycles — that's why vmcnt arithmetic counts operations, not cycles. The unified VGPR/AGPR register file is the CDNA-specific oddity: .amdhsa_accum_offset is the directive that draws the line.

The two register classes — VGPR vs AGPR

CDNA3 has one physical register file but two logical register classes. VGPRs are general-purpose vector registers. AGPRs are accumulator registers; only the MFMA matrix-core instructions can write them as destinations, and only MFMA inputs (and a handful of v_accvgpr_* moves) can read them. The directive .amdhsa_accum_offset declares where in the unified file the AGPR range begins. If you allocate 8 VGPRs and no AGPRs, you write .amdhsa_accum_offset 8 (vector_add_kernel.s:290). If you allocate 128 VGPRs and 128 AGPRs, you write .amdhsa_accum_offset 128 (matrix_core_asm/kernel.s:62).

Omitting the directive on a CDNA target causes the assembler to error out. This is not just a metadata field — it actually affects register allocation and forms part of the kernel descriptor that the GPU reads at launch.

The two wait counters — vmcnt and lgkmcnt

AMD memory operations are asynchronous. Each one increments a counter; software is responsible for waiting until the counter drains before reading the result. There are two independent FIFOs:

CounterTracksWait instructionDepth (CDNA3)
vmcntGlobal / buffer memory ops — including buffer_load…lds and buffer_stores_waitcnt vmcnt(N)up to 63
lgkmcntLDS reads/writes (ds_read, ds_write) plus scalar memory ops (s_load_dword)s_waitcnt lgkmcnt(N)up to 15

The semantics is "wait until at most N entries remain outstanding" — so vmcnt(0) drains everything, vmcnt(3) means "let three operations stay in flight." This is the foundation under everything in M2.

One trap worth flagging now: buffer_load…lds increments vmcnt (it's a memory op from the FIFO's perspective), but the final write to LDS can only be observed after the wait completes. So you need vmcnt(0) before the corresponding ds_read, even though ds_read reads LDS not global memory. vector_add_asm threads this needle by letting two loop iterations of latency hide between the load and the read.

The buffer resource descriptor — SRD

Most memory instructions in CDNA come in two flavors: flat (one big virtual address space, like CUDA pointers) and buffer (addressed through a 128-bit resource descriptor — the "SRD"). Hand-written kernels overwhelmingly prefer buffer instructions because the SRD gives you out-of-bounds protection for free, which lets you eliminate the exec mask gymnastics that flat-pointer kernels need.

WordBitsContentsWhat it controls
00..31base_address[31:0]Low 32 bits of the pointer
10..15base_address[47:32]High 16 bits; upper bits 0 for stride=0
20..31num_recordsByte count — anything past this triggers OOB
30..31format / type0x00020000 for raw buffer on gfx942

The OOB semantics is the lever: loads past num_records return zero, stores past it are dropped silently. Set num_records = N * sizeof(float) and your kernel is automatically safe at the array boundary — no if (idx < N) needed, no s_cbranch_execz needed. This single feature reshapes how hand-written AMD kernels look compared to their CUDA equivalents.

What this section deliberately skips

This is not a complete ISA primer. Three substantial topics are deferred to their own sections: MFMA instruction families (M4), DPP cross-lane operations (M5), and M0 register handling for LDS addressing (M2, since vector_add_asm is where it first matters). The hardware basics above are enough to read the next section without panic.

M2 · The five-pattern primervector_add_asm

vector_add_asm/ is the example to read first and read carefully. It does the simplest possible work — C[i] = A[i] + B[i] on float32 — but stacks five orthogonal optimization patterns on top of it. Read the patterns separately and the kernel becomes legible; try to read it bottom-up and you will spend an hour confused by what vmcnt(3) means.

The files are tiny: vector_add_kernel.s is 319 lines (link), main.cpp is 230, the README is 452. Total reading time, end to end, is about 90 minutes the first pass. The five patterns:

  1. Pattern 1 — Persistent kernel
    Grid size equals the number of CUs (main.cpp queries multiProcessorCount). Each workgroup runs a grid-stride loop. One launch covers the whole array regardless of size; every CU stays occupied for the kernel's duration.
  2. Pattern 2 — Double LDS buffer
    4 KB of LDS split into buf0 and buf1, each holding 256 threads' worth of A and B values. The loop body reads one buffer while prefetching into the other. Hides global-memory latency behind compute.
  3. Pattern 3 — buffer_load…lds direct path
    Data flows from HBM straight into LDS, bypassing VGPRs entirely. Reduces register pressure (which raises occupancy) and eliminates a write-back step. The catch: the LLVM assembler does not accept this instruction encoding — you emit the MUBUF bits manually.
  4. Pattern 4 — OOB-based control flow
    SRDs set num_records = N * 4. Out-of-bounds loads return zero; out-of-bounds stores are dropped. The kernel never touches exec — all 64 lanes execute every instruction unconditionally.
  5. Pattern 5 — vmcnt(3) precise pipelining
    A careful arithmetic argument lets exactly 3 memory ops stay in flight at all times — two prefetches and one store from the previous half-iteration. Deeper than vmcnt(0) (no pipelining) and tighter than the conservative bound.

Pattern 3 in detail — the assembler's blind spot

The flagship trick is buffer_load_dword … offen lds. On gfx942 this instruction reads data from global memory and writes it directly to LDS, never going through a VGPR. It exists in the hardware encoding and the disassembler will print it back to you, but the LLVM MC assembler (as of ROCm 7.1 / clang 20) rejects the lds text modifier on any GFX9/CDNA target. carlushuang's workaround, in vector_add_kernel.s:147-158:

vector_add_kernel.s · line 147; ==================================================================== ; Macro: buffer_load_dword ... offen lds (manual MUBUF encoding) ; ==================================================================== .macro buffer_load_dword_offen_lds vdata, vaddr, srsrc_base ; DWORD 0: MUBUF major opcode (0x38), OP=0x14 (buffer_load_dword), ; LDS=1 (bit 16), OFFEN=1 (bit 12) .long 0xE0511000 ; DWORD 1: SOFFSET=0x80 (literal 0), SRSRC, VDATA, VADDR .long (0x80 << 24) | ((\srsrc_base / 4) << 16) | (\vdata << 8) | \vaddr .endm

The constant 0xE0511000 was derived by assembling a regular buffer_load_dword … offen (which yields 0xE0501000) and flipping bit 16. A clean macro hides this from the calling code; the rest of the kernel calls buffer_load_dword_offen_lds v_a, v_buf_off, s_res_a as if it were a real instruction.

Reef · agent ceiling

If you plan to have an agent emit .s source for clang to assemble, this kind of trick is structurally invisible — the agent's vocabulary is bounded by the assembler's grammar. The same is true for several other tricks in this repo: any optimization that requires hand-rolling MUBUF bits, manipulating amdhsa_* directives the assembler does not validate, or using internal-only encodings will be out of reach. For kernel-agent design, this argues for emitting at one level lower (raw object) or one level higher (intrinsics + post-processing).

Pattern 4 in detail — OOB as control flow

Set num_records = N * sizeof(element) when building each SRD (vector_add_kernel.s:124-132):

vector_add_kernel.s · line 124; =========================================================== ; Step 4: Build SRDs with num_records = N * 4 ; =========================================================== s_lshl_b32 s[s_res_a+2], s[s_n], 2 s_mov_b32 s[s_res_a+0], s[s_ptr_a] s_and_b32 s[s_res_a+1], s[s_ptr_a+1], 0xFFFF s_mov_b32 s[s_res_a+3], SRD_CONFIG_GFX942

Now any lane whose byte offset exceeds N * 4 gets safe hardware behavior on its load (zero return) and on its store (dropped). Loop termination is a single scalar comparison against the wave's current idx — no s_and_saveexec_b64, no per-lane masking, no execution-mask juggling:

vector_add_kernel.s · line 235v_cmp_gt_u32 vcc, s[s_n], v[v_idx] s_cbranch_vccz L_done

This produces shorter, branchier code than a CUDA equivalent — and the absence of exec manipulation means the optimizer (mental or compiler) is freer to reorder things. The trade-off: the buffer-instruction path is slightly slower per access than flat loads on small buffers (because the SRD has to be constructed and is read on every issue). For loops, the amortized cost is invisible.

Pattern 5 — the vmcnt(3) arithmetic

This is the part that takes the longest to internalize. The README spells out the accounting in section 14 (README.md:410-452); I will not reproduce it line by line. The key visual is this: at any point in the steady-state loop, the vmcnt FIFO contains exactly three operations from the half-iteration you just finished, while the half-iteration you are currently in issues three more operations (two prefetches plus one store). At the bottom of each half, s_waitcnt vmcnt(3) drains the older three and leaves the newer three in flight to overlap with the next half's compute.

Plate III vmcnt FIFO timeline — what's outstanding at each program point vector_add_asm steady state
prologue start after vmcnt(2) end of buf0 half after vmcnt(3) end of buf1 half after vmcnt(3) FIFO DEPTH 6 4 3 2 1 0 A[0] B[0] A[1] B[1] wait(2) pfA pfB stC wait(3) pfA pfB stC wait(3) pfA pfB stC wait(3) pfA pfB stC wait(3) prologue load prefetch (buffer_load) store (buffer_store) s_waitcnt drain steady state: FIFO oscillates 3 ↔ 6 forever
Each bar height is the vmcnt FIFO depth (outstanding memory ops) at that instruction. Prologue climbs to 4 and drains to 2; the first half-iteration climbs to 5 then drains to 3 via the vmcnt(3) wait. From the second iteration on, the FIFO oscillates between 3 and 6 — three slots are always devoted to the previous half's traffic, three to the current half's. The magenta blocks are the only places the GPU actually stalls; everywhere else memory and compute overlap.

What it does not show you

The kernel is a vector add — pure stream, no compute density. Therefore nothing in this example exercises the matrix core, the AGPRs, the DPP shuffles, or the LDS bank arbiter. Those are the next four sections. The point of vector_add_asm is to put the scaffold in place so you can read more complex kernels without getting distracted by the boilerplate.

M3 · The number you measure everything againstbandwidth_memread

You cannot meaningfully say a kernel is "slow" until you know the ceiling it is supposed to hit. bandwidth_memread/ is a portable, two-kernel microbenchmark — read-only and read+write — designed to push HBM as hard as physically possible. On an MI308X (80 CUs, gfx942) the README reports 4.56 TB/s read-only and 3.36 TB/s read+write (README.md:57-83). These are the numbers any memory-bound kernel on the same hardware should be benchmarked against.

What the kernel does — six tricks compressed

The whole kernel is ~30 lines (bandwidth_kernel.cu:68-92). The tricks are stacked:

bandwidth_kernel.cu · line 68template<typename T, int UNROLL = 8> __global__ void memread_kernel(T* p_src, T* p_dst, int issues_per_block, int iters) { auto current = blockIdx.x * issues_per_block; T v {}; for(auto i = 0; i < iters; i++) { auto offs = UNROLL * BLOCK_SIZE * i + threadIdx.x; #pragma unroll for(auto j = 0; j < UNROLL; j++) { acc(v, nt_load(p_src[current + offs])); offs += BLOCK_SIZE; } } // dead store -- prevents compiler from eliding loads if(v.x == 10000 && v.y == 10000 && v.z == 10000 && v.w == 10000) *p_dst = v; }
TrickWhat it gets you
float4 vector loads128-bit transactions per load — saturates the L2-to-CU bus in one instruction
UNROLL=8Eight independent loads in flight per iteration. ILP hides memory latency without needing more waves.
nt_load — non-temporal__builtin_nontemporal_load bypasses L2 on the way in. For streaming reads, L2 pollution costs more than it saves.
BLOCK_SIZE = 1024Maximum threads per workgroup. With occupancy = 1, each CU runs one workgroup of 1024 threads — 16 waves per CU.
Persistent launchGrid size = num_CUs; each workgroup loops over issues_per_block chunks. One launch per measurement.
Dead-store trickThe unreachable if branch prevents the compiler from concluding that v is unused and deleting all the loads. Subtle but essential.
Plate IV Roofline — MI300 ceilings and where the kata lives log–log, MI308X observed values
ARITHMETIC INTENSITY FLOPS / byte (log) 0.01 0.1 1 10 100 1000 PERFORMANCE TFLOPS (log) 0.1 1 10 100 1000 HBM ceiling · 4.56 TB/s · MI308X measured MFMA fp16 · 1306 TFLOPS peak MFMA fp32 · 163 TFLOPS vector_add AI ≈ 0.08 bandwidth_memread attention (BS=1) GEMM 8K · MFMA peak GEMM 512 (mem bound) ridge: AI ≈ 286 where each gcnasm kata sits on the roofline
The cyan line is the memory-bandwidth roof: perf = AI × bandwidth. The green line is the matrix-core fp16 ceiling. Below the ridge point, your kernel is memory-bound; above it, compute-bound. bandwidth_memread sits at the extreme left (AI = 0); matrix_core's large-N GEMM sits at the extreme right. Most real workloads — attention at small batch, GEMM at moderate size, normalization — live on the steep cyan slope, which is why getting bandwidth_memread's number right is the precondition for analyzing anything else.
Practical move

Before running any performance experiment on a new node, run bandwidth_memread once. Record the read-only number. That is your ceiling for the day. If your "production" kernel's effective bandwidth is 30% of that number, you know you have a memory-access problem; if it is 90%, you know to look elsewhere (compute imbalance, sync overhead, launch cost).

The dead-store trick, explained

Without the unreachable if (v == magic) *p_dst = v block, the compiler reasons: "the function computes v but never writes it; therefore the loads can all be eliminated." The whole kernel collapses to an empty loop, and the benchmark measures launch overhead instead of bandwidth. The magic-number comparison is opaque to the compiler — it cannot prove the condition is always false — so it must keep the loads in to compute v for the comparison. This trick generalizes: any time you write a microbenchmark whose result is not actually used downstream, you need a similar guard.

M4 · The instruction that pays your rentMatrix Core · MFMA layouts

Everything in deep learning that is not memory-bound is paying for time on the matrix core. On CDNA3 these instructions are the MFMA family — Matrix Fused Multiply-Add — and they are not optional. A CDNA3 CU's peak fp16 throughput comes entirely from MFMA; the regular v_fma_f16 instructions are roughly two orders of magnitude slower in aggregate. If you want to compete with NVIDIA tensor cores you must reason in MFMA terms.

matrix_core/matrix_core.cc is the cleanest pedagogical example. One workgroup, one wavefront, one __builtin_amdgcn_mfma_f32_32x32x8f16 instruction. M = N = 32, K = 8. Three layout variants demonstrate that the C-matrix layout is determined by the A/B layout you feed in:

  1. Standard — columns of C
    Each thread holds four fp16 values along a column of C. Store-back is element-wise. This is the layout the instruction "wants" by default.
  2. Swap A/B — rows of C, vector store
    Swap the A and B pointers when calling MFMA. C is transposed; now each thread holds four fp16 along a row. Store-back can use buffer_store_dwordx2 (4 fp16 in one 64-bit transaction).
  3. Swap + swizzle B — wider vector store
    Permute the B-side thread-to-row mapping so groups of 8 lanes hold contiguous fp16 values. Now store-back uses buffer_store_dwordx4 (8 fp16 per transaction). The swizzle is computed in matrix_core.cc:364-367.
Plate V MFMA 32×32×8 fp16 — the canonical tile shape v_mfma_f32_32x32x8f16 · one wave64
A · 32×8 fp16 rows = M = 32 cols = K = 8 L0 · fp16x4 · k0..3 of row 0 L1 · fp16x4 · k0..3 of row 1 L2 · fp16x4 · k0..3 of row 2 ⋮ L0..L31 → k0..3 (cycle group 0) L32..L63 → k4..7 (cycle group 1) each lane holds VGPR pair (v0, v1) = 4 × fp16 B · 8×32 fp16 rows = K = 8 (split: L0..31 hold k0..3; L32..63 hold k4..7) cols = N = 32 (each lane = 1 col) L0 L1 L2 L3 L31 L32..63 cover k4..7 in the same column positions each lane holds (v0, v1) = 4 × fp16 C · 32×32 fp32 rows = M, cols = N, stride = 4 acc cols per lane 16 fp32 in AGPRs per lane (= 32×32 / 64) a0..a3 a4..a7 a8..a11 a12..a15 each lane owns 16 fp32 spread in a 4-stride pattern across C. → destination is AGPR ("a" constraint) → needs s_nop 16 after MFMA v_mfma_f32_32x32x8f16 v_C, v_A, v_B, v_C one instruction · ~64 cycles · 32 768 fp16 multiplies
A single 64-lane wavefront issues one MFMA. The A and B tiles together hold all 256+256 = 512 fp16 values spread across the lanes' VGPRs; the resulting C tile occupies 16 fp32 AGPRs per lane. The thread-to-element mapping shown here is the "standard" layout (column-major in C). Swapping A and B at the call site transposes C; swizzling B further lets you store contiguous fp16x8 chunks.

AGPR scheduling — the cost the demo hides

The pedagogical kernel passes v_a, v_b as plain VGPRs and writes into v_c declared as a normal fp32 array. The compiler inserts v_accvgpr_write/read moves to shuffle values between VGPR and AGPR around each MFMA. matrix_core/matrix_core.cc:139-212 shows the next step: declare the destination with the "a" ASM constraint so the compiler lets the value live in AGPR directly:

matrix_core/matrix_core.cc · line 192asm volatile("s_waitcnt vmcnt(0)" : : : "memory"); asm volatile("v_mfma_f32_32x32x8f16 %0, %1, %2, %3\n" "s_nop 16" // resolve data dependency : "+v"(v_c) : "a"(v_a), "a"(v_b), "v"(v_c) : );

Two details worth noting. First, s_nop 16 after the MFMA is the recommended dead-time before any subsequent instruction can read the accumulator — MFMA has a deep pipeline (16 cycles for 32×32 on gfx942). Skip the nop and you may get stale results. Second, the constraint "a" requests AGPR allocation; "v" requests VGPR. For the destination of MFMA, you must use "a".

What matrix_core_asm adds

The _asm variant uses v_mfma_f32_16x16x1f32 — a different shape (16×16 with K=1) chosen because it lets the kernel fit 128 MFMA instructions in a tight loop with single-VGPR A and B (kernel.s:33-45). The kernel is an IPS benchmark, not a real GEMM: it computes garbage into AGPRs for the sole purpose of measuring sustained MFMA throughput. The harness measures total instruction count divided by wall time. Use this when you want to know "how fast can this CU actually run MFMA back-to-back" — that number is the upper bound for any real kernel that uses the same shape.

If you remember one thing

The MFMA tile shape determines how every other piece of the kernel must be structured. Pick the tile first (32x32x8 fp16 is the standard for large GEMM on CDNA3; 16x16x32 fp16 is better for FlashAttention-shaped work); then design A's load pattern to match the lane-to-fragment mapping; then design B's; then plan the C store-back. Trying to design the LDS layout first and bolt MFMA on after is how most novice AMD GEMMs end up at 30% of peak.

M5 · Lanes talking to each otherCross-lane · DPP wavefront

If you want to compute a reduction across the 64 lanes of a wavefront — softmax denominator, layer-norm variance, attention max — the naive approach is to write to LDS, synchronize, and read back. DPP (Data Parallel Primitives) is a much cheaper alternative: lane-to-lane communication baked directly into the operand of an ALU instruction. A v_add_f32 v[res], v[result], v[result] dpp_ctrl:0xb1 reads v[result] from a different lane in addition to the current one, and adds them in one cycle.

wave_reduce_dpp/wave_reduce_dpp.hpp is the canonical implementation. It builds a 6-stage tree reduction over wave64 using six DPP control codes (wave_reduce_dpp.hpp:14-77). Each stage halves the active lane count:

StageDPP codeOperationActive lanes after
10xb1quad_perm:[1,0,3,2] — swap pairs within quad32
20x4equad_perm:[2,3,0,1] — swap halves within quad16
30x114row_shr:4 — shift by 4 within 16-lane row8
40x118row_shr:8 — shift by 8 within 16-lane row4
50x142row_bcast:15 — broadcast lane 15 to lanes 16-312
60x143row_bcast:31 — broadcast lane 31 to lanes 32-631

After all six stages the reduced value lives in the highest lane (lane 63). One final __builtin_amdgcn_readlane(result, 63) broadcasts it to every lane. Total cost: 6 ALU instructions + 1 readlane. Compare to LDS: 1 write + 1 barrier + 6 reads + 6 ALU = roughly 4–5× slower in steady state.

Plate VI DPP butterfly — six stages over 64 lanes wave_reduce_dpp.hpp · wave_reduce<float, 64>
STAGE 1 · 0xb1 STAGE 2 · 0x4e STAGE 3 · 0x114 STAGE 4 · 0x118 STAGE 5 · 0x142 STAGE 6 · 0x143 RESULT lane 0 lane 16 lane 32 lane 48 lane 63 readlane(result, 63) broadcast to every lane
Six stages, each halving the count of active lanes (green = pair-quad shuffles; amber = row shuffles; magenta = cross-row broadcast). The result accumulates into lane 63, then a single readlane publishes it back to every lane. Compare against NVIDIA's __shfl_* tree: same shape, one extra stage (because wave64 vs warp32). On AMD the DPP code is fused into the consumer instruction's operand, so the "shuffle" doesn't even cost a separate cycle.

The bpermute escape hatch

DPP control codes are limited to a fixed set of patterns (quad swaps, row shifts, row broadcasts). For arbitrary cross-lane permutations — bitonic sort, bit-reversal, scan operations — wave_reduce_dpp.hpp:135-145 falls back to __builtin_amdgcn_ds_bpermute. This uses LDS as a one-cycle shuffle network: each lane writes its value at its own lane-id index, then reads at another lane's id. The LDS is purely a routing fabric — no actual storage is consumed.

Why this matters for attention

FlashAttention's online softmax needs three wave-reductions per tile (max, sum, normalize). On NVIDIA those are __shfl_xor trees. On AMD they are exactly the wave_reduce template you see here. If your attention kernel is using LDS reductions instead of DPP, you are paying a ~4× overhead on the inner loop's hottest path. wave_reduce_dpp/ is the reference implementation to copy.

M6 · The compiler is not your friendIteration speed · co-exec, measure_ips

Two folders in this repo do not contain "real" kernels. They are infrastructure — tools for closing the feedback loop between writing an assembly snippet and seeing it run. They are the most under-appreciated examples in the repo for anyone designing a kernel-generation system.

co-exec — escape from hipcc

co-exec/test_coexec.py is a single 320-line Python script that, on every invocation, regenerates both the C++ host launcher and the .s assembly kernel as inline strings, compiles them with clang++ directly (no hipcc wrapper), and runs the resulting executable (test_coexec.py:300-306). The point is that everything is rebuildable in one Python call:

test_coexec.py · line 117class asm_src_t: def compile(self, src, target, working_dir): cmd = "/opt/rocm/llvm/bin/clang++ " \ "-x assembler -target amdgcn--amdhsa -mcpu={} ".format(self.arch) cmd += src + " -o {}".format(target) # ... def disassemble(self, hsaco, output, working_dir): cmd = "/opt/rocm/llvm/bin/llvm-objdump " \ "--disassemble --mcpu={} ".format(self.arch) + hsaco # ... pipe to output file

For a kernel-generation agent, this template is the right shape. The agent's output is a string of .s; the executor's job is (1) write it to disk, (2) compile it (~200 ms), (3) load and run it (~10 ms), (4) read the profiling counters or output, (5) hand the result back to the agent. The whole loop closes in well under a second per variant.

Implication for kernel agents

If your agent invokes hipcc (which calls clang + lld + multiple ROCm linking passes), each variant compile is roughly 8-15 seconds. With co-exec's direct clang++ -x assembler path, the same step is roughly 200 ms — a 40× speedup. At 500 variants per day, that is the difference between a 70-minute job and a 28-hour job. Iteration speed compounds: faster loops let agents try more variants and stay in cached context.

measure_ips — what does this instruction actually cost?

measure_ips answers the basic empirical question: if I write back-to-back instances of instruction X, what throughput can a CU sustain? The technique is brute-force — a kernel that does nothing but issue 256 copies of the target instruction in a tight loop (measure_ips/kernel.s), wrapped in a host benchmark that times num_loops × num_blocks × num_cu × bdx total instructions divided by wall time (main.cpp:71-75).

The variants in measure_ips_bench/ push this further with Python-generated assembly for many instruction kinds. The output is a table: for each (instruction, allocation_class) pair, sustained instructions-per-second per CU. This is the data you need to make informed scheduling decisions — should I rewrite that v_mac_f32 as v_fma_f32? measure_ips_bench answers it empirically rather than asking the ISA manual.

Plate VII Buffer SRD anatomy — the 128 bits that govern every buffer instruction gfx942 raw buffer · DATA_FORMAT = 32
WORD 0 · bits 31..0 WORD 1 · bits 31..0 WORD 2 · bits 31..0 WORD 3 · bits 31..0 base_address[31:0] low 32 bits of GPU pointer copied verbatim from s_load_dwordx2 result base_address[47:32] high 16 bits (top 16 must be 0) s_and_b32 ptr_hi, 0xFFFF ← required num_records byte count — controls OOB behavior N * sizeof(element) → branchless loops 0x00020000 DATA_FORMAT = 32 (raw) TYPE = 0 → safe OOB CONSEQUENCE · what happens at offset ≥ num_records buffer_load … offen returns 0 to VGPR (harmless) buffer_load … offen lds returns 0 to LDS (harmless) buffer_store … offen silently dropped (harmless) buffer_atomic_add … offen no-op (harmless, but check returns) The OOB-by-construction property is why hand-written CDNA kernels skip exec mask manipulation entirely. Set num_records correctly and the buffer SRD does the conditional-execution work the assembly would otherwise spell out.
The 128-bit Buffer SRD sits in 4 consecutive SGPRs. Its num_records field is the lever — set it tight and your loads/stores get branchless OOB protection from the hardware. This is the mechanism behind Pattern 4 in vector_add_asm and behind every opus_* kernel that uses make_buffer_resource.

M7 · The other forty katasOddments · integer, atomics, type cvt

This section is a fast tour of the rest of the repo. None of these get a deep dive, but most are useful references when you reach for the corresponding pattern.

Integer arithmetic — magic_integer_division

Division is expensive on every GPU; AMD is no exception. magic_integer_division/magic_div.s implements the classic Granlund-Montgomery substitution: replace x / d with mul_hi(x, magic) + x >> shift, where magic and shift are precomputed on the host. The macros are clean:

magic_div.s · line 17; v_numer, s_denom only support uint32 within 31bit .macro .mdiv_u32_vs v_quot, v_numer, s_magic, s_shift, v_tmp v_mul_hi_u32 v[\v_tmp], s[\s_magic], v[\v_numer] v_add_u32 v[\v_tmp], v[\v_tmp], v[\v_numer] v_lshrrev_b32 v[\v_quot], s[\s_shift], v[\v_tmp] .endm

Use this any time you have a loop with i / N or i % N where N is loop-invariant (common in index calculations: row = idx / cols, col = idx % cols). Compute the magic/shift on host once, pass as kernarg, save 20+ cycles per access.

Type conversion — cvt_fp8, pk_cvt, lqq

CDNA3 introduced first-class fp8 support (E4M3 and E5M2). The conversion intrinsics are __builtin_amdgcn_cvt_pk_fp8_f32 and friends. cvt_fp8/ shows the basic round-trip; pk_cvt/ shows packed conversions (two fp8 values per 16-bit lane); lqq/ shows int8 → int4 quantization with scale/zero subtraction. If you are doing W4A16 inference (Kimi-K2.5, Qwen3-Coder-Next), the patterns here are exactly the kernels you need.

Atomics — cmpswap_atomic

bf16 does not have native atomic-add on most CDNA parts. cmpswap_atomic/ implements it via the compare-and-swap loop: read current value, compute new value with the desired bf16 addition, attempt CAS, retry on failure. The bf16x2 variant packs two values per atomic to halve the contention. Useful for accumulation kernels — embedding gradient updates, histogram-style reductions across workgroups.

Hardware probing — hwreg, smid

hwreg/ reads the HW_ID SGPR to figure out which physical CU and SIMD a wave is running on. This sounds esoteric but is invaluable for debugging: if you suspect bank conflicts, log HW_ID per workgroup and look for hotspots. The smid/ example is similar but uses a simpler intrinsic. Both are read-only and have no performance impact.

Production patterns — opus_attn, opus_gemm

The opus_* folders use the aiter library — a higher-level set of building blocks (buffer descriptors, MFMA dispatch, LDS allocation) that production AMD kernels actually use. opus_attn/ is a flash-attention-shaped kernel; opus_gemm/ is a more polished GEMM than matrix_core/. These are heavier reads (7 files each) and assume you have already internalized the patterns from M2–M5.

§ R · Six places to crashReefs

Things in this repo that will silently produce wrong output, or noisily refuse to compile, in ways that take a long time to diagnose. Calling them out here so you do not have to learn them the hard way.

  1. Reef 1 — buffer_load…lds rejected by the assembler
    The LLVM MC assembler does not accept the lds modifier on buffer_load_dword for GFX9/CDNA. You must hand-encode the MUBUF bits via .long. The disassembler will print the instruction back to you correctly, which makes it easy to think you wrote it correctly. vector_add_kernel.s:147-158 shows the workaround.
  2. Reef 2 — missing .amdhsa_accum_offset
    On gfx90a/gfx940/gfx942, omitting this directive produces a cryptic assembler error. Even if you use zero AGPRs, set it equal to .amdhsa_next_free_vgpr. The directive declares where the AGPR range begins in the unified register file; without it, the allocator cannot lay out registers.
  3. Reef 3 — vmcnt FIFO includes stores
    It is tempting to think vmcnt tracks "loads in flight." It does not — buffer_store pushes onto the same FIFO. If you reason about your prefetch pipeline ignoring stores, you will pick the wrong wait value. vector_add_asm's vmcnt(3) works precisely because it accounts for store traffic; vmcnt(2) would be wrong by exactly one.
  4. Reef 4 — code object v3 vs v5 metadata
    gfx942 with ROCm 6+ requires amdhsa.version: [1, 2] (CO v5). The older [1, 0] (v3) format causes invalid HSA metadata errors at assembly time. Several examples in this repo (e.g. matrix_core_asm/kernel.s) still use v3 metadata — they assemble on older ROCm but will need updating. Always check the version before copying a snippet.
  5. Reef 5 — MFMA dest must be AGPR ("a")
    Declaring the C accumulator as a "+v" (VGPR) instead of "+a" (AGPR) does not cause a compile error, but the compiler inserts v_accvgpr_write/read moves around every MFMA call — destroying the back-to-back MFMA throughput you were trying to measure. Always use "a" for the MFMA destination and let it stay in AGPR across MFMA iterations.
  6. Reef 6 — s_nop after MFMA
    MFMA has a deep pipeline. The recommended dead-time between an MFMA and the next instruction that reads its result is s_nop 16 (for 32×32 shapes). Skip it and you get either stale data or pipeline stalls invisible to the profiler. matrix_core/matrix_core.cc:194 shows the canonical pattern.

§ RL · Questions that cut across all eight sectionsRed lines for kernel agents

Three structural questions surfaced during this read that bear directly on the design of an autonomous kernel-optimization system. They are stated as questions because each has multiple defensible answers; the choice is engineering judgment, not deduction.

Red line 1 — what level does the agent emit at?

Choices: (a) raw .s assembly for clang to assemble, (b) HIP source with intrinsics (__builtin_amdgcn_*), (c) a higher-level DSL (aiter, opus, CK) where MFMA dispatch and SRD construction are abstracted. Each pushes the optimization frontier differently. Emitting .s gives maximum control but makes Reef 1 (the assembler's blind spot for buffer_load…lds) into a structural ceiling. Emitting HIP gives compiler optimizations for free but loses the ability to specify exact vmcnt values. Emitting DSL is the fastest iteration cycle but locks the agent into the DSL's expressive range.

The honest answer is probably all three, layered: agents that emit DSL for fast exploration, escalate to HIP+intrinsics for promising candidates, and to .s for the last 5% of squeezing. gcnasm itself is structured this way — it has matrix_core/ (HIP+intrinsic), matrix_core_asm/ (raw assembly), and opus_gemm/ (DSL) as three views of the same problem.

Red line 2 — how does the agent learn the vmcnt(3) heuristic?

The vmcnt(3) choice in vector_add_asm is not derivable from the ISA manual. It is a piece of engineering folklore — "count the number of operations from the previous half-iteration that are still useful to keep in flight." Three approaches: (a) hand-curate a knowledge base of such patterns and feed it to the agent as in-context examples, (b) let the agent search the integer space [0, FIFO_max] and measure performance, (c) teach the agent to read rocprof stall counters and infer the correct value from them.

(b) is the cheapest to implement but wastes most variants. (c) is the most principled but requires that the agent already understands rocprof's output schema. (a) is the right starting point: a static curated set of "pattern templates" (persistent kernel + double buffer + vmcnt-tuned prefetch) the agent can recognize and instantiate, then refine with measurement.

Red line 3 — what is the right unit of search granularity?

"Generate a kernel variant" is too coarse — most random variants will be worse than the seed. "Change one instruction" is too fine — the search space is enormous and most single changes are uninteresting. The sweet spot, judging from gcnasm, is somewhere around "swap one optimization pattern for another": replace flat loads with buffer loads (pattern swap), change vmcnt(0) to vmcnt(3) (parameter sweep within pattern), swap the standard MFMA layout for swap+swizzle (algorithmic restructure).

Each pattern in vector_add_asm's five-pattern list is the right granularity. If your agent's action space is "add or remove one pattern," you have a tractable search problem and a vocabulary that maps cleanly onto what experienced AMD engineers actually think about.

A concrete proposal

Build the agent's prompt template around gcnasm's pattern vocabulary directly. Seed each task with two examples: "here is a slow version (uses pattern X)" and "here is a fast version (uses pattern Y)". The agent's job is to propose which pattern swap applies to the target kernel, then the system measures. The five patterns from vector_add_asm are a good initial action set; the three MFMA layouts from matrix_core/ are a good second.

§ End · What to do with thisEpilogue

This repo will not make you a CDNA3 expert. The ISA manual will, painfully and slowly. What gcnasm does is provide worked examples — the equivalent of solved problems in a math textbook — that let you check your understanding of the manual against actual running code. The two should be read alongside each other; either alone is incomplete.

For kernel-agent work specifically, the most actionable artifacts in the repo are:

  • Run bandwidth_memread first, on every machine
    Establishes the roofline ceiling. Every subsequent benchmark is meaningless without this number.
  • Fork co-exec as the agent's executor template
    200ms compile-and-run vs 8-15s for hipcc. This is the iteration-speed unlock.
  • Use wave_reduce_dpp, magic_integer_division, transpose-lds as building-block library
    These are the patterns your generated kernels will assemble out of. Treat them as the agent's "vocabulary" of optimization moves.
  • Treat vector_add_asm's five patterns as the canonical action space
    Persistent, double-LDS, async-to-LDS, OOB control flow, vmcnt pipelining. Any kernel the agent generates should be classifiable along these five axes.

The repo is small enough that re-reading every six months is feasible. New examples appear occasionally — mla_agentic_v0, wmma_opus_rdna4, matrix_core_gfx950 are recent additions that hint at where AMD optimization is moving (MLA for inference, RDNA4 WMMA for client-side, gfx950 as the imminent successor to MI300). It is worth a bookmark.

The next entry in this series will likely cover either aiter (the production kernel library that opus_* examples build on) or Triton-ROCm's code generation pipeline. Both extend the chain started here in different directions — aiter pushes toward production patterns, Triton-ROCm pushes toward automated codegen. Either way, gcnasm is the foundation that makes the next two readable.

§ Refs · The six documents to keep openPrimary references

gcnasm makes sense only when read alongside the AMD specifications it implements against. The six documents below are the load-bearing references for every claim in this writeup — line numbers like vector_add_kernel.s:147 are the where, and these PDFs are the why. Bookmark all of them; the gap between "the instruction exists in the ISA manual" and "the LLVM assembler accepts it" is exactly the territory gcnasm maps.

  1. № 1 — AMD Instinct MI300 CDNA3 ISA Reference Guide · Aug 2025
    The 1,200-page authority for gfx942. Sections to read first: § 8 (Vector Memory · MUBUF/MTBUF/FLAT) for the buffer-load patterns in M2, § 7 (Vector ALU) for DPP control codes in M5, § 10 (MFMA) for the matrix-core instruction families in M4, and § 6 (Scalar Memory · s_waitcnt encoding) for the vmcnt mechanics this repo lives on. → amd.com · CDNA3 ISA PDF
  2. № 2 — AMD CDNA3 Architecture White Paper
    Architecture-level overview: XCD layout, AID/IOD chiplet topology, Infinity Fabric, HBM3 hierarchy, the unified VGPR+AGPR register file. Read this before the ISA reference so the instructions make sense in context — why MUBUF exists at all, why MFMA writes to AGPRs, what the 304-CU figure means physically. → amd.com · CDNA3 white paper
  3. № 3 — AMD Instinct CDNA4 ISA Reference Guide · Aug 2025
    For MI355X (gfx950). CDNA4 is a strict superset of CDNA3 — every pattern in gcnasm transfers, with added MFMA shapes (fp8/fp6/fp4) and an expanded MTBUF. Diff this against № 1 to see what's new and what to retune. → amd.com · CDNA4 ISA PDF
  4. № 4 — LLVM AMDGPU Backend documentation
    The definitive source for what the assembler actually accepts — which is sometimes a strict subset of what the ISA defines (Reef 1). When the LLVM source contradicts the AMD ISA manual, your code compiles against LLVM, not against AMD. Also documents intrinsics (__builtin_amdgcn_*) used in M2 and M4. → llvm.org · AMDGPUUsage
  5. № 5 — ROCm GPU architecture specifications
    Quick reference: CU count, peak FP/INT throughput, HBM bandwidth, LDS size per SKU. Fills in the constants for the roofline numbers in M3 (the 4.56 TB/s measured on MI308X versus the ~5.3 TB/s peak comes from this table). → rocm.docs.amd.com · gpu-arch-specs
  6. № 6 — AMD GPUOpen ISA documentation hub
    Index of every public AMD GPU ISA reference: GCN3-5, CDNA1-4, RDNA1-4. Bookmark for cross-architecture work — porting a CDNA kernel to RDNA, or comparing how MFMA evolves across generations. The same team maintains it; the structure of each PDF is consistent. → gpuopen.com · ISA hub
A reading sequence

If you are coming to AMD GPU programming for the first time, the right order is № 2 → § 2-3 of № 1 → this writeup → § 6-10 of № 1 → № 4 alongside as you start to assemble. The white paper builds intuition for the hardware shape; the early ISA chapters establish the register and memory model; this writeup gives you running examples to anchor everything against; the deep ISA chapters become readable once you have those anchors; the LLVM doc is your reference for what the toolchain actually accepts. № 3, № 5, № 6 are lookups, not sequential reads.