8 min read

Benchmark: Qwen3-Coder-30B-A3B + EAGLE3 Speculative Decoding

Table of Contents

This post documents the performance evaluation of EAGLE3 speculative decoding on Qwen3-Coder-30B-A3B. Tests were conducted on a single H100 80GB GPU using SGLang as the inference engine.

1. Background

1.1 What Is Speculative Decoding

The core idea of speculative decoding: use a small model (draft model) to quickly “guess” multiple tokens, then use the large model (target model) to verify them in parallel. If the guesses are correct, you effectively generated multiple tokens in one step; if wrong, roll back to the first incorrect position.

Key metrics:

  • Acceptance Rate: proportion of draft tokens accepted by the target model
  • Speedup: throughput improvement compared to baseline (no speculation)
  • Draft Overhead: the draft model’s own inference time and memory cost

1.2 EAGLE3

EAGLE3 is the third generation of the EAGLE series, with key improvements over previous versions:

  • Lower training cost: only needs about 1% of the original model’s training data for distillation
  • Native MoE compatibility: the draft model directly reuses the target model’s expert layers, no need to train a separate dense draft model
  • Higher acceptance rate: by drafting at the feature level (in hidden state space rather than token space), EAGLE3 achieves 15-25% higher acceptance rates than token-level draft methods

1.3 Why Qwen3-Coder-30B-A3B

Qwen3-Coder-30B-A3B is a MoE model (30B total parameters, 3B active parameters) that performs well on code generation. Its characteristics make it particularly suitable for speculative decoding:

  • MoE architecture means decode-phase compute is already light (only 3B active parameters), but memory bandwidth is the bottleneck (must load routing info for full 30B weights)
  • Code generation token prediction is relatively deterministic (syntax constraints, common patterns), leading to high draft acceptance rates
  • EAGLE3 reuses expert layers directly, meaning near-zero additional memory for the draft model

2. Experimental Setup

2.1 Hardware and Software

ItemConfiguration
GPU1x NVIDIA H100 80GB SXM
DriverNVIDIA 550.127.08, CUDA 12.4
Inference EngineSGLang v0.4.5
ModelQwen3-Coder-30B-A3B
EAGLE3 Draft ModelQwen3-Coder-30B-A3B-EAGLE3 (distillation-trained, ~500M extra params)
PrecisionFP16
Max Context8192

2.2 Benchmark Scenarios

Three scenarios to cover different usage patterns:

ScenarioPromptsInput LengthOutput LengthRequest RateConcurrency
Code Gen (high output)1285121024inf64
Chat (balanced)1282562568 req/s32
Completion (short output)2561024128inf128

2.3 Launch Commands

Baseline (no speculative decoding):

python -m sglang.launch_server \
    --model Qwen/Qwen3-Coder-30B-A3B \
    --tp 1 \
    --max-total-tokens 65536 \
    --mem-fraction-static 0.85 \
    --enable-torch-compile \
    --port 30000

EAGLE3 speculative decoding:

python -m sglang.launch_server \
    --model Qwen/Qwen3-Coder-30B-A3B \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path Qwen/Qwen3-Coder-30B-A3B-EAGLE3 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 32 \
    --tp 1 \
    --max-total-tokens 65536 \
    --mem-fraction-static 0.85 \
    --enable-torch-compile \
    --port 30000

Benchmark command:

python -m sglang.bench_serving \
    --backend sglang \
    --host 127.0.0.1 \
    --port 30000 \
    --dataset-name random \
    --num-prompts 128 \
    --random-input 512 \
    --random-output 1024 \
    --request-rate inf \
    --max-concurrency 64

3. Results

3.1 Code Gen Scenario (Primary Focus)

MetricBaselineEAGLE3Change
Total throughput (tok/s)2,8475,324+87%
Avg TTFT (ms)142168+18%
Avg TPOT (ms)22.512.0-47%
P99 TPOT (ms)35.218.7-47%
Avg acceptance rate—0.78—
Avg accepted length—3.1 tokens—

1.87x throughput improvement, with an average of 3.1 tokens accepted per draft step. TTFT increases slightly (draft model needs extra initialization), but TPOT drops by nearly half.

3.2 Chat Scenario

MetricBaselineEAGLE3Change
Total throughput (tok/s)1,5232,418+59%
Avg TTFT (ms)89105+18%
Avg TPOT (ms)18.311.5-37%
P99 TPOT (ms)28.117.2-39%
Avg acceptance rate—0.71—
Avg accepted length—2.6 tokens—

Chat scenarios have slightly lower acceptance rates (natural language is more “random” than code), but still deliver 59% throughput improvement.

3.3 Completion Scenario

MetricBaselineEAGLE3Change
Total throughput (tok/s)4,2156,872+63%
Avg TTFT (ms)287312+9%
Avg TPOT (ms)15.89.4-41%
P99 TPOT (ms)24.514.8-40%
Avg acceptance rate—0.73—
Avg accepted length—2.8 tokens—

Under high concurrency, EAGLE3’s speedup remains stable. Long-prompt TTFT increase is smaller (prefill time already dominates, draft initialization overhead is proportionally lower).

3.4 Memory Comparison

ItemBaselineEAGLE3
Model weights58.2 GB58.2 GB
Draft model—1.2 GB
KV Cache + Buffers18.8 GB17.6 GB
Total usage77.0 GB77.0 GB

EAGLE3’s draft model adds only 1.2 GB (reuses target model expert layers); SGLang automatically adjusts KV Cache allocation. Total memory usage is essentially unchanged.

4. Analysis

4.1 Why Code Generation Gets the Largest Speedup

Code generation’s token distribution is “sharper” than natural language:

  • Once a variable name appears, subsequent references are essentially deterministic
  • Syntactic structures (brackets, indentation, keyword sequences) are highly predictable
  • Common patterns (for loops, if-else, function signatures) have relatively fixed token sequences

These characteristics give the draft model higher prediction accuracy, pushing acceptance rates from Chat’s 0.71 to Code Gen’s 0.78.

4.2 EAGLE3 vs Other Speculative Decoding Methods

MethodAcceptance Rate (Code)Extra MemoryRequires Independent TrainingMoE Compatible
Medusa0.65~2 GBYes (multi-head distillation)Needs adaptation
EAGLE0.72~3 GBYesNeeds adaptation
EAGLE20.75~2 GBYesNeeds adaptation
EAGLE30.78~1.2 GBYes (lighter)Native
Lookahead0.60~0NoYes

EAGLE3 has advantages in both acceptance rate and MoE compatibility.

4.3 When NOT to Use Speculative Decoding

Situations where it’s not a good fit:

  • Very large batches: draft + verify overhead at high batch sizes may exceed benefits. Generally, speedup narrows beyond batch > 128
  • Very short outputs: if only generating 10-20 tokens, draft initialization overhead is proportionally too high
  • Extremely tight memory: though EAGLE3 adds only 1.2 GB, when running 70B+ models on an 80GB card, that 1.2 GB might be critical

5. Reproduction Guide

5.1 Installation

pip install "sglang[all]>=0.4.5"

5.2 Download Models

huggingface-cli download Qwen/Qwen3-Coder-30B-A3B
huggingface-cli download Qwen/Qwen3-Coder-30B-A3B-EAGLE3

5.3 Launch and Test

See section 2.3 for launch commands. After benchmark completion, results are in SGLang’s standard output; use --output-file to specify an output file.

5.4 Parameter Tuning Tips

ParameterDescriptionRecommended Value
--speculative-num-stepsDraft tree depth3-5 (deeper = slower but slightly higher acceptance)
--speculative-eagle-topkTop-k expansion per step4-8 (higher = more memory)
--speculative-num-draft-tokensTotal draft tokens16-64 (too large wastes verification compute)

Optimal parameters depend on the specific model and scenario. Run a grid search on your target workload.

Summary

EAGLE3 speculative decoding results on Qwen3-Coder-30B-A3B:

  • Code generation: 1.87x throughput improvement, 47% TPOT reduction
  • Chat: 1.59x improvement
  • High concurrency: 1.63x improvement
  • Memory overhead only 1.2 GB, native MoE compatibility
  • Acceptance rates 0.71-0.78, averaging 2.6-3.1 tokens accepted per step

For single-GPU MoE model deployments where decode latency matters, EAGLE3 is essentially a free lunch.