This post starts from the basics of Full Attention, breaks down the Sparse and Linear escape routes, and follows them all the way to DeepSeek’s NSA (Native Sparse Attention) and Gated Linear Attention (GLA). For each mechanism, I’ll cover the math, computational complexity, engineering tradeoffs, and real-world deployment performance.
1. Standard Full Attention
Scaled Dot-Product Attention is where everything starts:
Where:
- , ,
- is the sequence length, is the key dimension
Complexity Analysis:
| Dimension | Complexity |
|---|---|
| Time complexity | |
| Space complexity (attention matrix) | |
| KV Cache (inference) | per layer |
The core problem: means doubling the sequence length quadruples the compute. At 128K or 1M tokens, this quadratic cost becomes unbearable.
Multi-Head Attention splits into heads, each performing independent attention before concatenation:
Where , , , .
2. Two Escape Routes: Sparse vs Linear
Faced with , the community has taken two fundamentally different paths:
| Dimension | Sparse Attention | Linear Attention |
|---|---|---|
| Core idea | Compute only selected token pairs | Use kernel functions to bypass softmax |
| Attention matrix | Sparse | Never explicitly constructed |
| Time complexity | to | |
| Information loss | Only sees partial tokens | No explicit forgetting |
| Representative work | Longformer, BigBird, NSA | Linear Transformer, GLA, RetNet |
| Hardware friendliness | Medium (irregular sparse access) | High (matrix multiply dominant) |
| Long context performance | Depends on sparsity pattern design | Theoretically good, needs forget gates in practice |
Sparse or Linear? It’s not either/or. DeepSeek’s NSA proved that hybrid approaches (Sparse + compression + sliding window) yield the best engineering tradeoffs. GLA proved that Linear Attention with forget gates can replace Full Attention in specific scenarios.
3. Sparse Attention
3.1 Basic Sparsity Patterns
The core idea behind Sparse Attention: not every token needs to attend to every other token. Keep only the “important” token pairs, turning the attention matrix from dense to sparse.
Sliding Window
Each token only attends to neighboring tokens:
- Complexity: , linear in
- Problem: purely local, cannot capture long-range dependencies
Global Tokens
Select a few tokens (e.g., [CLS], segment starters) that can attend to all positions and be attended to by all positions:
- Number of global tokens is typically or a fixed constant
- Provides information aggregation points for the entire sequence
Random Connections
Randomly select token pairs to keep connections. The theoretical basis is the small-world property of random graphs — any two nodes can be connected through very few hops.
3.2 Longformer & BigBird
Longformer combines three patterns:
- Sliding window (local) — present in every layer
- Dilated sliding window — sampled at intervals, expands receptive field
- Global attention — task-relevant tokens marked as global
Complexity drops from to , where is the window size.
BigBird adds random attention on top of Longformer:
Theoretical proof: BigBird’s sparse attention pattern is Turing complete and a universal approximator of sequence-to-sequence functions.
3.3 Engineering Challenges of Sparse Attention
Sparsity patterns look great in theory, but present several hard engineering problems:
Hardware unfriendly: GPU Tensor Cores are optimized for dense matrix multiplication. Sparse attention patterns mean irregular memory access, making it hard to utilize full hardware bandwidth. FlashAttention’s block-sparse variant is the best compromise so far, but still requires sparsity patterns aligned at the block level.
- Sparsity patterns need to be determined at compile/initialization time; dynamic sparsity is harder to optimize
- Different layers may need different sparsity patterns
- Inference KV Cache is still — sparsity helps compute but not storage
4. Linear Attention
4.1 The Kernel Trick
The bottleneck in standard Attention is softmax — it forces computing (an matrix) before normalization. Linear Attention uses kernel functions to bypass this step.
Let be a feature map satisfying:
Then attention can be rewritten as:
4.2 Computation Order Magic
The key is associativity of matrix multiplication. The formula above computed as costs . But if we compute first:
Let , , then:
- Computing and :
- Computing each :
- Total complexity: (when )
From to : When (long sequences), this is a qualitative leap. For a 128K sequence with , that’s roughly times speedup.
4.3 RNN Equivalence
Linear Attention has an elegant RNN equivalent form:
- is the “state matrix” — a compressed version of the KV Cache
- During inference, only and need to be maintained, with space complexity independent of sequence length
- This means Linear Attention models can do constant-memory inference like RNNs
4.4 The Fatal Flaw: Never Forgetting
The RNN form reveals a fatal problem: .
Information only accumulates, never decays. As the sequence grows:
- gets “drowned” by ever-growing historical information
- New token signals get diluted
- Practical outcome: quality degrades sharply beyond a certain sequence length
This is why pure Linear Attention underperforms Full Attention in practice. Subsequent improvements (RetNet, GLA, Mamba) are all fundamentally solving this forgetting problem.
5. DeepSeek NSA (Native Sparse Attention)
DeepSeek’s NSA is one of the best-engineered Sparse Attention implementations to date. It’s not a new academic idea — it combines existing sparse patterns in a hardware-aware manner.
5.1 Three-Branch Architecture
NSA’s core design uses three parallel branches:
Branch 1: Compressed Attention
Groups contiguous tokens into blocks, compressing each block into a representative token:
Where is the block size (e.g., 32), is the block index. Sequence length drops from to .
- Purpose: capture coarse-grained global information
- Complexity:
Branch 2: Selected Attention
Uses Compressed Attention scores to select the Top- important token blocks, then runs full attention only on those blocks:
- Purpose: fine-grained attention on critical positions
- Complexity: , where
Branch 3: Sliding Window Attention
Standard sliding window with window size :
- Purpose: capture local context
- Complexity:
5.2 Gated Fusion
Outputs from the three branches are mixed via sigmoid gating:
Gating is query-dependent: different queries can learn different branch preferences.
5.3 Hardware-Aligned Design
A key design decision in NSA: all operations work on contiguous blocks:
- Compressed branch: block-wise MLP
- Selected branch: selects contiguous blocks, not scattered tokens
- Sliding window: naturally contiguous
Why does this matter? Data transfer efficiency from GPU HBM to SRAM depends on contiguity. FlashAttention already proved that block-wise computation avoids memory usage. All NSA branches naturally fit block-wise kernels without special memory layouts for sparsity patterns.
5.4 Performance Numbers
DeepSeek’s reported numbers (27B model, 64K context):
| Metric | Full Attention | NSA |
|---|---|---|
| Training speed | 1x | ~1.6x |
| Inference latency (64K context) | 1x | ~3x faster |
| Needle-in-a-Haystack | 97.5% | 97.2% |
| Long-text perplexity | baseline | near baseline |
Key observations:
- Quality is nearly lossless
- Training also gets a speedup (total compute across three branches < full )
- Inference speedup is larger (KV Cache can store only compressed + sliding window)
6. GLA (Gated Linear Attention)
GLA’s core contribution: adding a forget gate to Linear Attention, solving the “never forgetting” fatal flaw.
6.1 The Forget Gate
Recall Linear Attention’s RNN form: . GLA adds a forget gate (simplified to diagonal ):
When , it degrades to standard Linear Attention; when , historical information decays exponentially.
is data-dependent:
The model can learn different forgetting rates for different types of tokens.
6.2 Chunk-wise Parallel Training
GLA’s RNN form is good for inference but bad for training (serial). The paper proposes a chunk-wise parallel algorithm:
- Split the sequence into chunks of size
- Within each chunk, use parallel matrix multiplication (similar to Full Attention)
- Between chunks, use RNN recurrence (passing state)
# Pseudocode: GLA chunk-wise training
def gla_chunk_forward(Q, K, V, alpha, chunk_size=64):
N = Q.shape[1]
S = torch.zeros(batch, d, d_v) # state matrix
outputs = []
for i in range(0, N, chunk_size):
q = Q[:, i:i+chunk_size] # (B, C, d)
k = K[:, i:i+chunk_size] # (B, C, d)
v = V[:, i:i+chunk_size] # (B, C, d_v)
a = alpha[:, i:i+chunk_size] # (B, C, d)
# Intra-chunk: parallel attention computation
# Construct intra-chunk decay matrix
decay = compute_intra_chunk_decay(a) # (B, C, C)
intra = (q @ k.transpose(-1, -2)) * decay # (B, C, C)
o_intra = intra @ v # (B, C, d_v)
# Inter-chunk: read historical info from S
chunk_decay = compute_chunk_decay(a) # (B, d)
o_inter = q @ (chunk_decay.unsqueeze(-1) * S) # (B, C, d_v)
# Combine
o = o_intra + o_inter
outputs.append(o)
# Update state S
S = update_state(S, k, v, a)
return torch.cat(outputs, dim=1)
Training complexity: , where is chunk size, typically 64-256. When , this approaches linear.
6.3 The GLA Family
| Model | Forget Gate | State Size | Training Parallelism | Year |
|---|---|---|---|---|
| Linear Transformer | None | Fully parallel | 2020 | |
| RetNet | Fixed decay | Chunk-wise | 2023 | |
| GLA | Data-dependent | Chunk-wise | 2024 | |
| Mamba (S6) | Data-dependent | Scan | 2024 | |
| Mamba2 | Data-dependent | Chunk-wise | 2024 |
Trend: From fixed decay to data-dependent forget gates, from pure RNN training to chunk-wise parallelism, the Linear Attention family is gradually approaching the ideal of “train like a Transformer, infer like an RNN.”
6.4 Deployment in Qwen3-Coder-Next
Qwen3-Coder-Next uses a hybrid architecture:
- Some layers use Full Attention (ensuring global information flow)
- Other layers use GLA (reducing inference KV Cache pressure)
- Interleaved arrangement, typical ratio around 1:3 (1 Full + 3 GLA per 4 layers)
Engineering implications:
- KV Cache reduced significantly (GLA layers only maintain a state matrix, not full KV cache)
- Faster inference (GLA layers compute per token, independent of context length)
- Training speed close to pure Transformer (chunk-wise parallel training ensures GPU utilization)
7. Full Comparison
| Mechanism | Time Complexity | Space Complexity | Long-Context Quality | Inference KV Cache | Hardware Friendliness | Training Difficulty |
|---|---|---|---|---|---|---|
| Full Attention | Best | per layer | High (FlashAttn) | Simple | ||
| Longformer | Good | Medium | Simple | |||
| BigBird | Good | Medium | Simple | |||
| NSA | Near Full | High (block-wise) | Medium | |||
| Linear Attention | Poor (no forgetting) | High | Simple | |||
| GLA | Good | High | Medium | |||
| Mamba2 | Good | High | Medium |
8. Architecture Evolution
9. Engineering Takeaways
The most important conclusions for kernel engineers:
-
FlashAttention is the ultimate optimization for Full Attention, but it doesn’t change the fundamental. It reduces memory from to , but computation stays the same. Long-context scenarios still need Sparse or Linear.
-
NSA’s three-branch design naturally fits block-wise kernels. Compressed branch is block reduce + MLP, selected branch is block gather + FlashAttention, sliding window is standard FlashAttention’s window variant. All three can be implemented with existing efficient kernels.
-
GLA’s chunk-wise training is essentially “small FlashAttention + RNN recurrence.” The intra-chunk attention matrix is (e.g., ), which fits entirely in SRAM. Inter-chunk state transfer is matrix multiplication, perfect for Tensor Cores.
-
Hybrid architectures (Full + GLA/Mamba) are the current engineering optimum. Use a few Full Attention layers to ensure information completeness, and many Linear layers to reduce inference cost. Kernel development for these architectures needs to support both attention modes.
-
AMD MI300X/MI355X deployment considerations:
- FlashAttention: CK has highly optimized implementations
- NSA: Needs custom kernels (block-wise compression + selection)
- GLA: Triton for ROCm can write chunk-wise training kernels, but inference kernels need hand-written CK for optimal performance
- Hybrid dispatch: inference engines need to dynamically select kernel paths based on layer type
References
- Vaswani et al., “Attention Is All You Need”, NeurIPS 2017
- Beltagy et al., “Longformer: The Long-Document Transformer”, 2020
- Zaheer et al., “Big Bird: Transformers for Longer Sequences”, NeurIPS 2020
- Katharopoulos et al., “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”, ICML 2020
- Sun et al., “Retentive Network: A Successor to Transformer for Large Language Models”, 2023
- Yang et al., “Gated Linear Attention Transformers with Hardware-Efficient Training”, ICML 2024
- Gu & Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”, 2024
- Dao & Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”, ICML 2024
- DeepSeek, “Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention”, 2025