§ 00 · Why this paper, why nowPrologue
§ 00 · 为什么读这篇, 为什么是现在序章
There is a quiet contradiction at the center of agentic RL. We want to train models to be good agents — to fix real bugs, drive a terminal, orchestrate sub-agents over tens of thousands of tokens. But the things that make an agent good live inside a harness: Codex, Claude Code, a homegrown CLI. And the way we usually train means tearing that harness apart and rebuilding it inside the RL framework's environment API.
Agentic RL 的核心藏着一个不太被说破的矛盾。 我们想把模型训练成好的 agent —— 能修真实的 bug、 能驱动终端、 能在几万 token 的跨度上协调子 agent。 但让一个 agent 变强的那些东西, 都长在 harness 里面: Codex、 Claude Code、 自家的 CLI。 而我们惯常的训练方式, 意味着把这个 harness 拆开, 在 RL 框架的环境 API 里重建一遍。
Polar's argument is that you should not have to. The paper opens with one sentence that is really the whole thesis: "Can we train agents with RL without opening the box?" Its answer is a systems answer, not an algorithm one. Every LLM-based agent, no matter how baroque its internals, has to talk to a model. That model API call is a common interface that sits outside the agent. Put a provider-compatible proxy there, capture the tokens flowing through, and the harness becomes trainable while it runs completely unmodified.
Polar 的主张是: 你不必这么干。 论文开篇那一句其实就是全部论点: "能不能不打开盒子就用 RL 训练 agent?" 它给的是一个系统层面的答案, 而不是算法层面的。 任何基于 LLM 的 agent, 不管内部多繁复, 总得跟模型对话。 那个模型 API 调用是一个落在 agent 之外的公共接口。 在那里放一个 provider 兼容的代理, 抓住流过的 token, harness 就能在完全不改的前提下变得可训练。
I am reading this for a specific reason. Our own setup at AMD points open-source harnesses — kimi-cli and others — at locally-hosted models through SGLang's OpenAI-compatible endpoint. That endpoint is the seam Polar exploits. So this paper is not abstract for us: it is a blueprint for how a rollout-to-training loop could wrap the exact harnesses we already run, without us reimplementing any of them. The diagrams below are hand-drawn from the paper; every number is cross-checked against the text.
我读它有个很具体的理由。 我们在 AMD 的这套配置, 本来就通过 SGLang 的 OpenAI 兼容 endpoint, 把开源 harness(kimi-cli 这些)指向本地托管的模型。 那个 endpoint 正就是 Polar 利用的那道缝。 所以这篇论文对我们不抽象: 它是一张蓝图 —— 一个 rollout 到训练的闭环, 怎样把我们已经在跑的那些 harness 直接包住, 而不用我们重写其中任何一个。 下面的图都是照着论文手绘的; 每一个数字都跟正文逐一核对过。
Move the RL integration boundary from the harness to the model endpoint. The agent stays a black box; a proxy at the API boundary records prompt tokens, sampled tokens, and log-probabilities, and a reconstruction step turns those captured calls into token-faithful training trajectories. Rollout then scales as a service, decoupled from the GPU trainer.
把 RL 的集成边界从 harness 挪到 模型 endpoint。 agent 仍是黑盒; 一个位于 API 边界的代理记录 prompt token、 采样出的 token 和 log 概率, 一个重建步骤再把这些抓到的调用变成 token 级忠实的训练轨迹。 于是 rollout 以服务的形式扩展, 与 GPU trainer 解耦。
§ 01 · The integration burdenThe target is the system
§ 01 · 集成的负担目标就是那个系统
In classical RL the environment hides behind a tiny, standardized interface — reset(), step(action), a reward. The whole point of agentic RL is that this no longer holds. The training target is now "a complex software system with heterogeneous environments, various external tools, long-running workflows, possibly different languages, or even a closed-source binary." The interface stopped being simple, and that is the systems problem the paper is actually about.
在经典 RL 里, 环境藏在一个极小的标准接口背后 —— reset()、 step(action)、 一个 reward。 而 agentic RL 的要害恰恰是: 这一套不再成立。 现在的训练目标是"一个复杂的软件系统, 带着异构环境、 各种外部工具、 长时运行的工作流、 可能还有不同语言、 甚至是一个闭源二进制"。 接口不再简单了 —— 这正是这篇论文真正要解决的系统问题。
Prior work splits into two camps, and Polar positions itself against both:
先前的工作分成两个阵营, Polar 把自己摆在两者的对面:
- Bake the agent into the pipeline. SkyRL-Agent and PRIME-RL integrate agent execution directly into the RL loop. You adapt your agent to the infrastructure, not the other way round. Every new harness needs a framework-specific integration.
- Standardize a tracing interface. Agent Lightning and rLLM lower the cost with tracked clients, decorators, and workflow abstractions — but the agent still has to conform to a prescribed SDK shape. That breaks down exactly when harnesses get complex or ship as a binary you can't instrument.
- 把 agent 焊进流水线。 SkyRL-Agent 和 PRIME-RL 把 agent 执行直接集成进 RL 循环。 是你的 agent 去迁就基础设施, 而不是反过来。 每来一个新 harness, 就要做一次框架特定的集成。
- 标准化一个 tracing 接口。 Agent Lightning 和 rLLM 用 tracked client、 装饰器、 工作流抽象把成本压下来 —— 但 agent 仍然必须遵从一套规定好的 SDK 形态。 而这恰恰会在 harness 变复杂、 或者以你无法插桩的二进制形式发布时失效。
Polar's move is to choose the minimum integration point: not an SDK callback graph, but the provider API endpoint the harness already calls. The proxy that sits there becomes the observation device. It accepts Anthropic, OpenAI Chat, OpenAI Responses, and Google-style requests; translates them to the local backend; and records the token-level fields a trainer needs. As the paper puts it, this is "narrower than general observability instrumentation, but it is robust to harnesses implemented as command-line programs, package-managed tools, or binaries."
Polar 的做法是选一个最小集成点: 不是 SDK 的回调图, 而是 harness 本来就在调用的那个 provider API endpoint。 坐在那里的代理成了观测装置。 它接收 Anthropic、 OpenAI Chat、 OpenAI Responses、 Google 风格的请求; 把它们翻译给本地后端; 并记录 trainer 需要的 token 级字段。 用论文的话说, 这"比通用的可观测性插桩更窄, 但它对那些以命令行程序、 包管理工具、 或二进制形式实现的 harness 都很稳健"。
reset/step/reward, so it couples tightly to the trainer and every new agent is a fresh port. Right, Polar's contract: the harness runs as shipped, points its model base URL at the gateway, and the proxy at that single seam captures everything the trainer needs while the local inference server quietly serves the policy being trained.reset/step/reward, 于是它和 trainer 紧耦合, 每来一个新 agent 都得重新移植一遍。 右边是 Polar 的契约: harness 按出厂状态运行, 把它的模型 base URL 指向 gateway, 位于那一道缝上的代理就捕获 trainer 需要的一切, 而本地 inference 服务器悄悄提供着正在被训练的那个 policy。The name is a small tell about lineage. Polar rewrites the group's earlier ProRL Agent server, and the authors describe it as connecting the two "poles" of an agent's life — the training environment and the product harness — through one interface. It is also registered as a NeMo Gym environment.
这个名字透露了一点来历。 Polar 重写了该团队早先的 ProRL Agent server, 作者把它描述为用一个接口连接起 agent 一生的两"极" —— 训练环境与产品 harness。 它同时也被注册为一个 NeMo Gym 环境。
§ 02 · Rollout server + gateway nodeTwo components, one boundary
§ 02 · rollout server + gateway node两个组件, 一道边界
Polar is deliberately small in its parts. There are exactly two: a rollout server and gateway nodes. The split "keeps durable task management separate from per-session execution and capture."
Polar 在组件上刻意保持精简。 一共就两个: 一个 rollout server 和若干 gateway node。 这个拆分"让持久的任务管理, 与每个 session 的执行和捕获分离开"。
The rollout server is the coordinator. It accepts a TaskRequest and expands it into num_samples independent sessions — the scheduling unit. A session carries a session ID, a task ID, a timeout budget, a runtime specification, an agent specification, a trajectory builder, an evaluator, and a callback URL. The server dispatches sessions to gateways, persists compact terminal results, exposes task status by polling, and accepts gateway callbacks when sessions finish.
rollout server 是协调者。 它接收一个 TaskRequest, 把它展开成 num_samples 个独立的 session —— 也就是调度单元。 每个 session 带着 session ID、 task ID、 超时预算、 runtime 规格、 agent 规格、 一个 trajectory builder、 一个 evaluator、 以及一个 callback URL。 server 把 session 分派给各 gateway, 持久化紧凑的终态结果, 通过轮询暴露任务状态, 并在 session 结束时接收 gateway 回调。
A gateway node owns the full lifecycle of a session: it starts the runtime, prepares the harness, runs the harness commands, builds trajectories from captured completions, evaluates the output, tears everything down, and returns the result. Crucially, the same gateway also hosts the proxy endpoint that the harness calls for its model traffic. Co-locating the proxy with the session keeps capture tied to the session registry and avoids standing up a separate trace-collection service.
一个 gateway node 掌管一个 session 的完整生命周期: 启动 runtime、 准备 harness、 运行 harness 命令、 从捕获到的 completion 构建轨迹、 评估输出、 拆掉一切、 返回结果。 关键在于, 同一个 gateway 还托管着 harness 用来发模型流量的那个代理 endpoint。 把代理和 session 放在一起, 让捕获紧贴 session 注册表, 也省掉了单独再起一个 trace 收集服务。
TaskRequest into num_samples sessions and hands them to gateway nodes. Each gateway runs the harness inside an isolated runtime, intercepts its model calls through a co-located proxy, forwards normalized requests to the local inference servers, and builds + evaluates the trajectory. The trainer is a separate process: it consumes finished trajectories as a service and syncs fresh weights back to inference. Nothing in the trainer knows which harness produced the tokens.TaskRequest 扇出成 num_samples 个 session, 交给各 gateway node。 每个 gateway 在隔离的 runtime 里跑 harness, 通过同处一地的代理拦截它的模型调用, 把归一化后的请求转发给本地 inference 服务器, 再构建并评估轨迹。 trainer 是独立进程: 以服务的方式消费跑完的轨迹, 并把新权重同步回 inference。 trainer 里没有任何东西知道这些 token 是哪个 harness 产出的。Because the trainer only ever sees trajectories — not harness code — Polar is "agnostic to agent harnesses, training infrastructure, and RL algorithms." Swap GRPO for something else, swap Codex for Pi, swap Docker for Apptainer: none of those choices reach across the service boundary. That is what lets slow, long-tailed agent rollouts scale independently from the GPU trainer.
因为 trainer 永远只看到轨迹 —— 而非 harness 代码 —— Polar 因此"对 agent harness、 训练基础设施、 RL 算法都不可知"。 把 GRPO 换成别的、 把 Codex 换成 Pi、 把 Docker 换成 Apptainer: 这些选择没有一个会跨过服务边界。 这正是让缓慢、 长尾的 agent rollout 能与 GPU trainer 独立扩展的原因。
§ 03 · Harness & proxy captureThe proxy, in four steps
§ 03 · harness 与代理捕获代理的四个步骤
The proxy is where the cleverness concentrates, and it is intentionally dumb about agents. It "does not need to understand how the harness plans, manages tools, or decides when to stop. It only needs to preserve API compatibility and record enough information to reconstruct training samples." For each incoming model request it runs four steps.
代理是巧思集中的地方, 而它对 agent 故意保持"无知"。 它"不需要理解 harness 如何规划、 如何管理工具、 或如何决定何时停止。 它只需要保持 API 兼容, 并记录足够重建训练样本的信息"。 对每一个进来的模型请求, 它跑四个步骤。
logprobs=true, the field training needs. Step 3 stores a completion record: prompt token IDs, sampled response token IDs, finish reason, and log-probabilities. Step 4 transforms the answer back into the provider shape the harness expects; for streaming clients it fakes a server-sent-event stream from a non-streaming upstream response, which keeps token capture exact.logprobs=true, 这是训练需要的字段。 第 3 步存下一条 completion 记录: prompt token ID、 采样出的 response token ID、 finish reason、 以及 log 概率。 第 4 步把答复变回 harness 预期的 provider 形态; 对流式客户端, 它从一个非流式的上游响应伪造出一条 server-sent-event 流, 从而让 token 捕获保持精确。The four provider dialects
四种 provider 方言
Step 1 distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent-style calls. Step 2's provider transformer converts roles, content parts, tool definitions, tool choices, stop controls, and generation parameters into one canonical shape. This is the unglamorous, load-bearing engineering: an agent built for Claude and an agent built for GPT both end up as the same normalized request hitting the same local model.
第 1 步区分 Anthropic Messages、 OpenAI Chat Completions、 OpenAI Responses、 以及 Google generateContent 风格的调用。 第 2 步的 provider transformer 把 role、 content 片段、 工具定义、 工具选择、 停止控制、 生成参数统一转成一种规范形态。 这是不起眼但承重的工程: 一个为 Claude 写的 agent 和一个为 GPT 写的 agent, 最后都变成打到同一个本地模型上的同一种归一化请求。
Harness adapter — small on purpose
harness adapter —— 刻意做小
To onboard a harness you write a harness adapter, which the paper stresses "is small by design." It may install configuration, register MCP servers or skills, write provider settings, and return the shell commands that launch the agent. There are built-in shortcuts for the obvious suspects:
要接入一个 harness, 你写一个 harness adapter, 论文强调它"在设计上就很小"。 它可以安装配置、 注册 MCP server 或 skill、 写入 provider 设置、 并返回启动 agent 的 shell 命令。 对那几个明显的常客, 有内置的快捷集成:
Runtime interface — swap the box without friction
runtime 接口 —— 无摩擦换底座
Runtimes implement a common interface: start, stop, exec, upload, download, and cancellation. The first release supports Docker and rootless Apptainer (the latter for HPC clusters where you can't run a Docker daemon). Because the gateway only depends on this interface, a task can change isolation backend without touching anything else — a detail that matters a lot for those of us living on shared GPU clusters.
runtime 实现一个公共接口: start、 stop、 exec、 upload、 download、 以及取消。 第一个版本支持 Docker 和 rootless Apptainer(后者用于那种跑不起 Docker daemon 的 HPC 集群)。 因为 gateway 只依赖这个接口, 一个任务可以换隔离后端而不动其他任何东西 —— 这个细节对我们这些生活在共享 GPU 集群上的人特别重要。
§ 04 · Asynchronous rollout stagingKeeping the GPU fed
§ 04 · 异步 rollout staging别让 GPU 闲着
A single SWE-style rollout is a pile of costs with wildly different shapes: runtime startup, dependency preparation, harness execution, evaluator setup, test execution, patch application, teardown. Some are CPU-and-IO heavy (spinning up a container, installing a repo's deps); the expensive one — actually running the agent — is GPU-bound. If you run these serially per session, the GPU sits idle while a container boots.
一次 SWE 风格的 rollout, 是一堆形态迥异的成本: runtime 启动、 依赖准备、 harness 执行、 evaluator 搭建、 测试运行、 打补丁、 拆环境。 有些是 CPU 和 IO 密集的(起容器、 装一个 repo 的依赖); 而最贵的那个 —— 真正跑 agent —— 是 GPU 密集的。 如果你按 session 串行地跑这些, GPU 就会在容器开机时干等着。
Polar's answer is stage-isolated execution inside each gateway. Three isolated worker pools — INIT, RUNNING, POSTRUN — plus a bounded READY buffer between INIT and RUNNING. The buffer is the trick: CPU-heavy runtime preparation runs ahead in the background and parks finished runtimes in READY, so when a run slot frees, a warm runtime is already waiting. The GPU never waits on apt install.
Polar 的答案是每个 gateway 内部的阶段隔离执行。 三个隔离的 worker pool —— INIT、 RUNNING、 POSTRUN —— 外加 INIT 与 RUNNING 之间一个有界的 READY buffer。 buffer 是关键所在: CPU 重的 runtime 准备在后台提前跑, 把准备好的 runtime 停在 READY 里, 这样当一个运行 slot 空出来时, 一个热好的 runtime 已经在等着了。 GPU 永远不用等 apt install。
Two details I want to flag because they show real operational scar tissue. First, evaluator prewarm: when an evaluator needs a clean runtime (think: a fresh repo checkout to apply and test a patch against), the gateway starts preparing it during the agent run, not after — overlapping setup with the thing it will eventually grade. Second, timeout handling: a single shared deadline per session, and on timeout it does not just throw the work away. If model calls were already captured, the gateway still enters POSTRUN so the partial trajectory survives, tagged with a terminal-timeout status. For sparse-reward, long-tail SWE tasks, throwing away partial rollouts would be expensive; keeping them is the difference between usable and wasteful.
有两个细节我想拎出来, 因为它们透着真实的运维伤疤。 第一, evaluator 预热: 当一个 evaluator 需要干净 runtime 时(想象一下: 一份新 checkout 出来的 repo, 用来打补丁并跑测试), gateway 在 agent 运行期间就开始准备它, 而不是之后 —— 把搭建和它最终要评判的那件事重叠起来。 第二, 超时处理: 每个 session 一个共享 deadline, 而超时时它并不直接把活儿丢掉。 如果模型调用已经被捕获了, gateway 仍然进入 POSTRUN, 让这条部分轨迹存活下来, 标上 terminal-timeout 状态。 对稀疏奖励、 长尾的 SWE 任务来说, 丢掉部分 rollout 是很贵的; 留下它们, 就是"可用"和"浪费"之间的差别。
§ 05 · Trajectory reconstructionToken-faithful prefix merging
§ 05 · 轨迹重建token 级忠实的前缀合并
This is the part of the paper I find most elegant, because it is where "treat the harness as a black box" could have quietly broken the math — and the authors noticed. The trajectory builder turns an ordered list of captured CompletionSession objects into a Trajectory made of one or more Trace objects, each holding prompt/response token IDs, loss masks, messages, log-probabilities, rewards, and metadata. There are two ways to do it.
这是我觉得论文里最优雅的一段, 因为这里正是"把 harness 当黑盒"本可能悄悄把数学搞坏的地方 —— 而作者注意到了。 trajectory builder 把一串有序的、 捕获到的 CompletionSession 对象, 变成一个由一个或多个 Trace 对象组成的 Trajectory, 每个 Trace 装着 prompt / response 的 token ID、 loss mask、 message、 log 概率、 reward、 以及元数据。 有两种做法。
Per-request: correct but fragmented
逐请求: 正确但破碎
The conservative option (§3.4.1): every captured completion becomes its own trace. It is lossless per call — but it shatters a coherent multi-turn session into dozens of short, disconnected samples. For an agent that took 51 turns to fix a bug, that is 51 tiny training examples that have lost the thread of the episode, and a much heavier stream for the trainer to chew through.
保守的做法(§3.4.1): 每个捕获到的 completion 各自成为一条 trace。 对单次调用来说它无损 —— 但它把一段连贯的多轮 session 砸成几十个短小、 互不相连的样本。 对一个用了 51 轮才修好 bug 的 agent, 那就是 51 个丢掉了整段剧情线索的小训练样本, 以及一个让 trainer 啃起来重得多的数据流。
Prefix merging: stitch the episode back together
前缀合并: 把整段重新缝起来
The better option (§3.4.2) reconstructs longer traces by exploiting a structural fact about most harnesses: their conversation grows append-only. Turn N+1's prompt is usually just turn N's prompt with more appended. So if you process completions C₁…C_T in order, you can greedily partition them into chains, where a completion joins a chain only if two conditions hold:
更好的做法(§3.4.2)通过利用大多数 harness 的一个结构性事实来重建更长的 trace: 它们的对话是只追加地增长的。 第 N+1 轮的 prompt, 通常就是第 N 轮的 prompt 后面再追加一些。 所以如果你按顺序处理 completion C₁…C_T, 就能贪心地把它们划分成若干链, 而一个 completion 只有在两个条件都成立时才加入某条链:
- A normalized message-level grouping key marks it as a candidate continuation (a cheap bucketing filter on conversation context), and
- a strict token-prefix relation holds against the chain's last prompt: the first
|p_m|tokens of the next prompt must equal the entire previous prompt —p_{m+1}[1:|p_m|] = p_m.
- 一个归一化的 message 级分组键把它标记为候选续接(一个对对话上下文做的廉价分桶过滤), 并且
- 对链上最后一个 prompt, 严格的 token 前缀关系成立: 下一个 prompt 的前
|p_m|个 token 必须等于整个上一个 prompt ——p_{m+1}[1:|p_m|] = p_m。
When the prefix breaks — a context compaction rewrites history, or a sub-agent spawns with a fresh prompt — a new chain simply starts. Each chain becomes one long trace.
当前缀断裂时 —— 一次上下文压缩重写了历史, 或者一个子 agent 带着全新 prompt 启动 —— 就干脆开一条新链。 每条链成为一条长 trace。
aᾢ (orange, trainable) with the canonical interstitial tokens uᾢ the harness inserted between turns (gray, masked). The loss mask is 1 only on tokens the model actually generated. Their log-probabilities are the genuine behavior-policy values from rollout; the masked slots get synthetic logprob entries purely to keep response_logprobs aligned with response_ids. The result reads as one coherent multi-turn episode while remaining mathematically honest about which tokens are on-policy.aᾢ(橙色, 可训练)与 harness 在轮次之间插入的规范 interstitial token uᾢ(灰色, 被 mask)交错排列。 loss mask 只在模型真正生成的 token 上为 1。 它们的 log 概率是来自 rollout 的真实 behavior-policy 值; 被 mask 的位置拿到合成的 logprob 条目, 纯粹是为了让 response_logprobs 与 response_ids 对齐。 结果读起来像一段连贯的多轮剧情, 同时在"哪些 token 是 on-policy"这件事上保持数学上的诚实。Why the masking matters: retokenization drift
为什么 mask 很重要: retokenization drift
The subtlety the paper calls out (§2.4) is retokenization drift: if you reconstruct a trajectory by decoding the transcript back to text and re-encoding it, the token IDs you get can differ from the ones the model actually sampled. Train on those and you are training on tokens the behavior policy never emitted. Polar avoids this by construction: generated assistant tokens are copied verbatim from the inference responses, the non-generated interstitial tokens come from canonical prompt tokenization, and only the behavior-policy tokens are marked trainable. No round-trip, no drift.
论文点出的微妙之处(§2.4)是 retokenization drift: 如果你通过把对话解码回文本、 再重新编码来重建轨迹, 你拿到的 token ID 可能和模型当初真正采样出的不一样。 拿那些去训练, 你就是在用 behavior policy 从未吐出过的 token 训练。 Polar 从构造上就避开了这点: 生成的 assistant token 逐字从 inference 响应里复制, 没被生成的 interstitial token 来自规范的 prompt tokenization, 而且只有 behavior-policy token 被标为可训练。 没有往返, 也就没有 drift。
§ 06 · Evaluation & reward propagationGrading the episode
§ 06 · 评估与奖励传播给整段轨迹打分
Evaluators are registry-backed and receive the trajectory, the session's artifacts, and optionally a freshly prepared runtime context (this is what the prewarm in §04 was for). Three are built in:
evaluator 由注册表支撑, 接收轨迹、 session 的产物、 以及可选的一份新准备好的 runtime 上下文(§04 里的预热就是为它服务的)。 内置三个:
- a session-completion reward (did the agent finish cleanly?),
- a configurable test-on-output evaluator, and
- a SWE-Bench / SWE-Gym harness evaluator that applies the patch and runs the repo's tests.
- 一个 session-completion reward(agent 是否干净地收尾了?),
- 一个可配置的 test-on-output evaluator, 以及
- 一个 SWE-Bench / SWE-Gym harness evaluator, 它打上补丁并跑 repo 的测试。
On propagation, the paper draws a clean line: an outcome reward — one scalar for the whole episode, like "did the tests pass" — is broadcast to every trace in the session. Tasks with process rewards (per-step signal) instead need per-trace assignment. The registry is the extension point: custom rule-based verifiers, agent-as-judge scoring, and task-specific reward shaping all plug in here. (One thing the paper does not pin down is the within-trace token placement of the reward — whether it lands on the final token or spreads across the trace. Worth noting rather than guessing.)
在传播上, 论文划了一条干净的界线: 一个 outcome reward —— 整段轨迹一个标量, 比如"测试过没过" —— 会广播到该 session 的每一条 trace。 而带 process reward(逐步信号)的任务则需要逐 trace 分配。 注册表是扩展点: 自定义的基于规则的 verifier、 agent-as-judge 打分、 以及任务特定的 reward shaping 都从这里接入。 (论文没有钉死的一点, 是 reward 在一条 trace 内部具体落在哪个 token 上 —— 是落在最后一个 token, 还是摊到整条 trace。 这点值得标注, 而不是去猜。)
§ 07 · ExperimentsDoes it actually train?
§ 07 · 实验它真的能训出来吗?
The headline experiment (§4.1) is deliberately unflashy on the algorithm side: plain GRPO, run asynchronously through Slime, training the same Qwen3.5-4B base checkpoint four separate times — once inside each of four real coding harnesses — and scoring pass@1 on the full SWE-Bench Verified benchmark. The point isn't a fancy objective; it's that you can point RL at four genuinely different harnesses without reimplementing any of them.
招牌实验(§4.1)在算法一侧刻意不花哨: 朴素的 GRPO, 通过 Slime 异步地跑, 把同一个 Qwen3.5-4B 基座 checkpoint 分别训练四次 —— 每次在四个真实编程 harness 中的一个里面 —— 并在完整的 SWE-Bench Verified 上算 pass@1。 重点不在于精巧的目标函数; 而在于你能把 RL 对准四个真正不同的 harness, 却不用重写其中任何一个。
The ablation that justifies prefix merging
那个为前缀合并正名的消融
The second experiment is the one that earned the masthead number. Holding everything else fixed and running the same three training steps, the team compares the prefix_merging trajectory builder against per_request:
第二个实验, 就是为标题那个数字赢得资格的那个。 在其他一切不变、 跑同样三个训练 step 的前提下, 团队把 prefix_merging 这个 trajectory builder 与 per_request 做了对比:
Offline data generation
离线数据生成
Polar isn't only for online RL. Section 4.2 uses it as an offline SFT-data factory with a bigger model, Qwen3.5-122B-A10B (TP=8, max_model_len=32,768), generating trajectories over 1,638 instances drawn from seven SWE-Gym repositories, each in its own Apptainer container with a fresh checkout. After filtering, 504 trajectories were accepted (30.8%) at roughly 64 GPU-hours, averaging 104 messages and 51 assistant turns per accepted session. Acceptance varied widely by repo — highest on getmoto/moto (~54%), lowest on dask/dask (~18%) — a reminder that "how hard is this repo to make tests pass on" is itself most of the signal.
Polar 不只为在线 RL 服务。 §4.2 把它当作一个离线 SFT 数据工厂, 用一个更大的模型 Qwen3.5-122B-A10B(TP=8, max_model_len=32,768), 在取自七个 SWE-Gym 仓库的 1,638 个实例上生成轨迹, 每个都在自己的 Apptainer 容器里、 用一份新 checkout 跑。 过滤之后, 504 条轨迹被接受(30.8%), 总开销约 64 GPU 小时, 被接受的 session 平均 104 条 message、 51 个 assistant 轮次。 接受率在不同 repo 之间差异很大 —— 最高的是 getmoto/moto(约 54%), 最低的是 dask/dask(约 18%)—— 这提醒我们:"这个 repo 要让测试通过有多难"本身就是大部分信号。
The training recipe, for the record
训练配方, 留个记录
| Hyperparameter | Value | Hyperparameter | Value | ||
|---|---|---|---|---|---|
| 超参 | 取值 | 超参 | 取值 | ||
| Algorithm | 算法 | GRPO (async, Slime) | Samples / prompt | 每 prompt 采样 | 16 |
| Epochs | Epoch | 1 | Learning rate | 学习率 | 1×10⁻⁶ |
| Rollout batch size | Rollout batch | 4 | Weight decay | 权重衰减 | 0.1 |
| Training data | 训练数据 | NovaSky-AI/SkyRL-v0-293-data · 293 tasks · TIS enabled |
|||
Where Polar sits vs the field
Polar 在领域里的位置
Appendix A.1 lines Polar up against ProRL Agent, SkyRL-Agent, PRIME-RL, Agent Lightning, rLLM, and OpenClaw-RL on four axes — async RL support, async rollout staging, rollout-as-a-service, and native-harness agnosticism. Polar is the row that checks all four; the differentiator is that last column, native-harness agnosticism, which is the whole reason the proxy lives at the model boundary.
附录 A.1 把 Polar 与 ProRL Agent、 SkyRL-Agent、 PRIME-RL、 Agent Lightning、 rLLM、 OpenClaw-RL 在四个维度上排了一行 —— 异步 RL 支持、 异步 rollout staging、 rollout 即服务、 以及原生 harness 不可知。 Polar 是四项全勾的那一行; 真正的区分点是最后一列, 原生 harness 不可知 —— 这正是代理为什么要活在模型边界上的全部理由。
§ 08 · For our workWhat this means for AMD & kimi-cli
§ 08 · 对我们的意义这对 AMD 与 kimi-cli 意味着什么
I read papers through one lens these days: does it move our two goals — porting the open-source GPU ecosystem to AMD, and building a multi-agent system that writes fast AMD kernels. Polar lands squarely on the second, and here is why I think it matters.
这阵子我读论文只用一个视角: 它有没有推进我们的两个目标 —— 把开源 GPU 生态移植到 AMD, 以及搭一个会写高性能 AMD kernel 的多 agent 系统。 Polar 正好落在第二个上, 下面是我觉得它要紧的原因。
Our harnesses (kimi-cli and friends) already point at locally-hosted models through SGLang's OpenAI-compatible endpoint. Polar's whole architecture assumes exactly that endpoint exists and puts the RL observation point there. We would not be retrofitting our stack to fit a trainer — the trainer would attach to the seam we already run. A gateway proxy in front of our SGLang endpoint is a small, well-scoped thing to build.
我们的 harness(kimi-cli 这些)本来就通过 SGLang 的 OpenAI 兼容 endpoint 指向本地托管的模型。 Polar 的整个架构恰恰假设这个 endpoint 存在, 并把 RL 观测点放在那里。 我们不需要改造自己的栈去迁就 trainer —— trainer 会接到我们已经在跑的那道缝上。 在我们的 SGLang endpoint 前面加一个 gateway 代理, 是一件小而边界清晰的工程。
Our kernel work is a long-horizon, sparse-reward, expensive-to-evaluate loop: generate a kernel variant, compile, benchmark, profile, iterate. That is structurally the same shape as a SWE-Bench rollout — slow runtime setup, long tails, patch-level rewards. Polar's staging (READY buffer, evaluator prewarm, partial-trace recovery on timeout) is a direct answer to "the GPU sits idle while a container builds," which is precisely our iteration-speed bottleneck. Swap "apply patch + run tests" for "compile + run rocprof against roofline" and the machinery transfers.
我们的 kernel 工作是一个长跨度、 稀疏奖励、 评估昂贵的循环: 生成一个 kernel 变体、 编译、 benchmark、 profile、 迭代。 这和 SWE-Bench 的一次 rollout 在结构上是同一个形状 —— 慢的 runtime 搭建、 长尾、 补丁级奖励。 Polar 的 staging(READY buffer、 evaluator 预热、 超时时的部分轨迹回收)正是对"容器在构建、 GPU 在空转"这件事的直接回答, 而那恰恰是我们迭代速度的瓶颈。 把"打补丁 + 跑测试"换成"编译 + 对着 roofline 跑 rocprof", 这套机器就能迁过来。
If we ever train a kernel agent with RL, retokenization drift is a landmine — kernel code is full of tokens (hex constants, intrinsics like __builtin_amdgcn_*, register names) that re-encode unpredictably. Polar's copy-the-sampled-tokens-verbatim discipline, with interstitials masked, is the correct default and worth stealing wholesale.
如果我们哪天用 RL 训练一个 kernel agent, retokenization drift 是个地雷 —— kernel 代码里满是会被不可预测地重新编码的 token(十六进制常数、 像 __builtin_amdgcn_* 这样的 intrinsic、 寄存器名)。 Polar 那套"采样出的 token 逐字复制、 interstitial 一律 mask"的纪律是正确的默认值, 值得整套照搬。
The honest caveat: Polar's results are on coding/SWE tasks with a relatively small model and plain GRPO. Kernel optimization has a harder reward surface (correctness and performance against a roofline, not just "tests pass") and a much smaller pool of training tasks. The systems contribution transfers cleanly; the RL-converges-and-helps part we would have to earn ourselves. But "wrap the harness you already run, don't rewrite it" is exactly the kind of leverage we want.
老实说的保留: Polar 的结果是在编程 / SWE 任务上、 用一个相对小的模型和朴素 GRPO 拿到的。 kernel 优化的奖励面更难(既要正确又要对着 roofline 比性能, 不只是"测试过了"), 训练任务池也小得多。 系统层面的贡献能干净地迁移过来; 而"RL 会收敛并且有帮助"这一半, 得靠我们自己去挣。 但"包住你已经在跑的 harness, 别重写它", 正是我们想要的那种杠杆。
§ 09 · EpilogueThe boundary is the idea
§ 09 · 尾声边界本身就是那个想法
If you strip Polar down to one sentence, it is this: the hardest part of agentic RL is not the algorithm, it is the integration, and the cheapest place to integrate is the one interface every agent already exposes. The model API call. Everything else in the paper — the two-component split, the four-step proxy, the staging pools, prefix merging — is engineering in service of that single relocation of the boundary.
如果把 Polar 压成一句话, 就是: agentic RL 最难的不是算法, 是集成; 而最省事的集成位置, 是每个 agent 都已经暴露的那一个接口 —— 模型 API 调用。 论文里其余的一切 —— 两组件的拆分、 四步代理、 staging 的那几个 pool、 前缀合并 —— 都是为这一次边界挪移服务的工程。
What I appreciate is the restraint. There is no new objective, no exotic credit assignment. Plain GRPO, a 4B model, and a careful systems design that lets the same training loop wrap Codex, Claude Code, Qwen Code, and Pi without caring which is which. For a team like ours that wants agents pointed at our own models and our own hardware, "don't open the box" is not a limitation — it is the feature.
我欣赏的是它的克制。 没有新目标函数, 没有奇异的 credit assignment。 朴素 GRPO、 一个 4B 模型、 加上一套用心的系统设计, 让同一个训练循环去包住 Codex、 Claude Code、 Qwen Code、 Pi, 而不在乎哪个是哪个。 对我们这样一支想把 agent 对准自家模型、 自家硬件的团队来说, "别打开盒子"不是限制 —— 它就是那个特性。
§ 10 · SourcesReferences & citation
§ 10 · 来源参考文献与引用
This reading is based entirely on the paper text. If you cite this writeup, cite the paper:
这篇精读完全基于论文正文。 如果你要引用本文, 请引用原论文:
- Polar: Agentic RL on Any Harness at Scale. Xu, Zhang, Zhang, Han, Liu, Hu, Diao, Jin, Zou, Demoret, Kautz, Dong. arXiv:2605.24220 [cs.DC], 22 May 2026. arxiv.org/abs/2605.24220 — the paper this entry reads.
- Polar: Agentic RL on Any Harness at Scale. Xu、 Zhang、 Zhang、 Han、 Liu、 Hu、 Diao、 Jin、 Zou、 Demoret、 Kautz、 Dong。 arXiv:2605.24220 [cs.DC], 2026 年 5 月 22 日。 arxiv.org/abs/2605.24220 —— 本文精读的论文。
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez et al. The benchmark Polar trains and evaluates on (SWE-Bench Verified is the human-validated subset). arxiv.org/abs/2310.06770
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez 等。 Polar 训练与评估所用的基准(SWE-Bench Verified 是其人工校验子集)。 arxiv.org/abs/2310.06770
- SWE-Gym: Training Environments for Software Engineering Agents. The training-environment extension of SWE-bench that supplies Polar's tasks, verifiers, and trajectories. arxiv.org/abs/2412.21139
- SWE-Gym: Training Environments for Software Engineering Agents. SWE-bench 的训练环境扩展, 为 Polar 提供任务、 verifier 和轨迹。 arxiv.org/abs/2412.21139
- DeepSeekMath / GRPO. Group Relative Policy Optimization — the (deliberately plain) RL algorithm Polar uses. arxiv.org/abs/2402.03300
- DeepSeekMath / GRPO. Group Relative Policy Optimization —— Polar 所用的(刻意朴素的)RL 算法。 arxiv.org/abs/2402.03300
- Slime. The SGLang-native async rollout + Megatron training framework Polar runs GRPO through. github.com/THUDM/slime
- Slime. Polar 用来跑 GRPO 的 SGLang 原生异步 rollout + Megatron 训练框架。 github.com/THUDM/slime
- Agent Lightning. Training-agent disaggregation + a unified tracing interface — the low-intrusion approach Polar contrasts itself against. arxiv.org/abs/2508.03680
- Agent Lightning. 训练-agent 解耦 + 统一 tracing 接口 —— Polar 用来作对照的低侵入方案。 arxiv.org/abs/2508.03680
- SkyRL-Agent. Full-stack RL training/eval for long-horizon agents (with SkyRL-Gym), representing the "bake the agent into the pipeline" camp. github.com/NovaSky-AI/SkyRL
- SkyRL-Agent. 面向长跨度 agent 的全栈 RL 训练 / 评估(含 SkyRL-Gym), 代表"把 agent 焊进流水线"的阵营。 github.com/NovaSky-AI/SkyRL