7 min read

NeMo-RL vs slime: RL Training Framework Comparison

Table of Contents

This post compares two RL training frameworks for LLMs: NVIDIA’s NeMo-RL and the community-driven slime. I cover algorithm support, engineering quality, MoE readiness, and ROCm compatibility, then give selection recommendations.

1. Background

1.1 Why We Need RL Training Frameworks

SFT training works fine with HuggingFace Transformers or DeepSpeed alone. But RL training (RLHF/GRPO/PPO) involves coordinating multiple roles:

  • Actor: generates rollouts (needs inference capability)
  • Critic / Reward Model: scores outputs (another model or rule-based)
  • Reference Model: computes KL constraint (prevents drifting too far)
  • Trainer: updates actor parameters based on rewards and KL

The scheduling, communication, and memory management of these roles is far more complex than SFT, requiring dedicated frameworks.

1.2 Candidate Frameworks

FrameworkMaintainerFirst ReleaseStarsCore Positioning
NeMo-RLNVIDIA2024-Q3~2.5KProduction-grade RL training, deep NeMo/Megatron integration
slimeCommunity (ByteDance & academia)2024-Q4~1.8KLightweight, flexible RL training, research-friendly
OpenRLHFCommunity2023-Q2~5KEarly framework, PPO/DPO
TRLHuggingFace2022-Q4~10KEntry-level, Transformers ecosystem

This post focuses on NeMo-RL and slime, as they represent the best current options for engineering quality and MoE support.

2. Feature Matrix

FeatureNeMo-RLslime
PPOComplete (GAE, clipping)Complete
GRPOSupportedSupported
DPO / SimPOSupportedSupported
REINFORCE (w/ baseline)SupportedSupported
Custom reward functionYes, via configYes, via Python callable
Rule-based reward (code exec, math verify)Built-in sandboxBuilt-in + external API
Online RL (generate + train)SupportedSupported
Offline RL (pre-generated rollouts)SupportedSupported
Multi-turn RLLimitedFull (conversation tree)

Key difference: NeMo-RL’s algorithm implementations are more mature (validated at scale internally at NVIDIA), but less customizable (config-driven). slime is more research-friendly (transparent code, easy to fork and modify), but less validated at very large scale (1000+ GPUs).

3. Architecture Comparison

3.1 NeMo-RL

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              NeMo-RL Controller          β”‚
β”‚  (Hydra config β†’ DAG of tasks)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Actor   β”‚  β”‚ Critic  β”‚  β”‚ Ref    β”‚  β”‚
β”‚  β”‚(Megatronβ”‚  β”‚(Megatronβ”‚  β”‚(vLLM / β”‚  β”‚
β”‚  β”‚ + vLLM) β”‚  β”‚  Core)  β”‚  β”‚ static)β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚        ↕ NCCL / Gloo ↕                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚     Megatron-Core Distributed    β”‚   β”‚
β”‚  β”‚     (TP/PP/DP/EP sharding)       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Characteristics:

  • Built on Megatron-Core for distribution, native TP/PP/EP support
  • Actor inference (rollout generation) uses vLLM engine
  • Training updates use Megatron’s optimizer
  • Config-driven (Hydra YAML); changing algorithms means modifying config, not code

3.2 slime

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           slime Orchestrator            β”‚
β”‚  (Python script β†’ Ray actors)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚ Rollout  β”‚  β”‚ Trainer  β”‚             β”‚
β”‚  β”‚ Workers  β”‚  β”‚ Workers  β”‚             β”‚
β”‚  β”‚ (SGLang /β”‚  β”‚(DeepSpeedβ”‚             β”‚
β”‚  β”‚  vLLM)   β”‚  β”‚  ZeRO)   β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚        ↕ Ray Object Store ↕             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚       Ray Cluster + NCCL         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Characteristics:

  • Ray-based scheduling; Rollout and Training are decoupled into independent Ray actor groups
  • Rollout workers can use SGLang or vLLM (swappable engines)
  • Training workers use DeepSpeed ZeRO (simple but sufficient)
  • Code-driven (Python scripts); modify algorithms by editing code directly

4. MoE Support

This is particularly important for our project since our target models (Kimi-K2.5, Qwen3-Coder-Next) are all MoE.

DimensionNeMo-RLslime
MoE trainingNative (Megatron-Core MoE)Via DeepSpeed MoE
Expert ParallelismNative EPRelies on DeepSpeed EP
MoE + TP + PPFull combinationEP + TP available, PP limited
Expert load balancing lossBuilt-inManual addition needed
Token drop policyConfigurable (capacity factor)Manual implementation needed
MoE rollout inferencevLLM MoESGLang MoE

NeMo-RL is more mature for MoE. Megatron-Core’s MoE implementation is extensively validated; EP + TP + PP three-dimensional parallelism works out of the box. slime’s DeepSpeed MoE support is newer, and reliability at large scale (384 experts, TP=8, EP=8) needs your own validation.

5. ROCm Compatibility

This is another critical dimension. Our hardware is MI300X/MI355X running ROCm 7.0.

DimensionNeMo-RLslime
Official ROCm supportNo (NVIDIA framework)Partial (community PRs)
Dependency ROCm compatibilityMegatron-Core: has forksDeepSpeed: official support
NCCL vs RCCLNCCL onlyConfigurable for RCCL
Triton kernelsCUDA TritonCan swap Triton-ROCm
FlashAttentionCUDA FA2Can swap CK FA
Practical usabilityNeeds extensive patchingModerate patching

Neither framework works out-of-the-box on ROCm. But slime is easier to port because: (1) shorter dependency chain (Ray + DeepSpeed vs the entire Megatron-Core suite); (2) DeepSpeed officially supports ROCm; (3) the inference engine can be SGLang (good ROCm support). NeMo-RL’s deep dependency on Megatron-Core makes ROCm adaptation significantly more work.

6. DX and Reproducibility

DimensionNeMo-RLslime
Installationpip install nemo-rl (but Megatron-Core needs separate install)pip install slime-rl
Minimal runnable example~50 lines YAML~30 lines Python
DocumentationGood (NVIDIA-style, complete but dense)Moderate (README + examples)
Wandb/TensorBoardBuilt-inBuilt-in
Checkpoint formatMegatron format (needs conversion)HuggingFace format (use directly)
Paper reproductionHas benchmark suiteHas recipe scripts
Debug friendlinessHard (many Megatron layers)Good (simple code)

slime has better DX. HuggingFace-format checkpoints mean no format conversion needed to interface with inference engines. NeMo-RL’s Megatron checkpoints need conversion to HF format for SGLang/vLLM inference, which is sometimes painful (especially for MoE models).

7. Migration Plan

Starting from scratch, here’s a selection flowchart:

RL framework decision tree: choose based on hardware, MoE model, scale, and PP requirements

For our scenario (AMD MI300X/MI355X, MoE models, research iteration focus):

  1. Short-term (1-2 months): Use slime + SGLang + DeepSpeed on ROCm
  2. Mid-term (3-6 months): Contribute ROCm + CK optimization patches to slime, establish benchmark baselines
  3. Long-term: If needing very large scale (1000+ GPU), evaluate feasibility of a NeMo-RL ROCm fork

8. Verdict

DimensionWinnerReason
Algorithm maturityNeMo-RLLarge-scale validation inside NVIDIA
MoE supportNeMo-RLMegatron-Core MoE is more complete
ROCm compatibilityslimeShorter dependency chain, DeepSpeed official ROCm support
Developer experienceslimeHF checkpoints, Python-driven, debug-friendly
CustomizabilityslimeTransparent code, easy to fork
Large-scale reliabilityNeMo-RLValidated on 1000+ GPUs

No absolute winner. NeMo-RL suits production-grade large-scale training on NVIDIA hardware; slime suits research iteration and AMD hardware. For us, slime is the right choice today, but thorough validation on MoE + ROCm scenarios is needed.