SkyPilot — 一次源码精读

§ 00 · PrologueWhy we cracked this open.

SkyPilot 是开源多云编排里的事实标准。表面是一行 sky launch，底下藏着一套客户机/服务器/节点守护的三段式系统、一个混合 DP + ILP 求解器、一组面向 24 个云厂商的多态抽象，以及 6562 行单文件后端。读它，是为了把"如何用一个统一控制平面调度异构基础设施"这件事彻底搞清楚 — 因为我们手头正在搭的，是它的近亲。

SkyPilot is the de facto standard for open-source multi-cloud orchestration. On the surface it is a single sky launch; underneath sits a three-tier client / server / node-daemon system, a hybrid DP + ILP solver, a set of polymorphic abstractions spanning 24 cloud vendors, and a 6,562-line single-file backend. We read it to fully understand how one unified control plane schedules heterogeneous infrastructure — because what we're building right now is a close cousin of it.

这次精读分 11 个模块（M0 → M9，加一个中途插入的 M2.5），共 6.5 小时。本档案是读后的总览：架构总图、9-stage 执行流水线、Optimizer 的 DP / ILP 路由、两层 job 状态机、加新云后端的文件改动清单 — 全部以手绘 SVG 呈现，无运行时依赖，离线可读。

This reading spans 11 modules (M0 → M9, plus an M2.5 wedged in mid-way), 6.5 hours total. This file is the post-reading overview: the architecture map, the 9-stage execution pipeline, the Optimizer's DP / ILP routing, the two-level job state machine, and the file-change checklist for adding a new cloud backend — all rendered as hand-drawn SVG, no runtime dependencies, readable offline.

2,142

task.py

2,816

resources.py

228

dag.py

7,954

cli/command.py

3,237

client/sdk.py

3,524

server.py

939

execution.py

1,805

optimizer.py

6,562

cloud_vm_ray_backend.py

4,440

backend_utils.py

1,041

clouds/cloud.py

107

skylet/skylet.py

— 上表为 master 分支（commit f8bb042）核心文件行数；前 4 个最大的合计已是 SkyPilot 实际"心脏"的一半以上。

— Line counts of the core files on master (commit f8bb042); the four largest alone already make up more than half of SkyPilot's actual "heart."

Plate I — Architecture, in repose whole system, single view 1 : ∞

Three zones, one program. Zone A (Client) packages a Task and POSTs to Zone B. Zone B (FastAPI server) spawns a process per request, runs the 9-stage execution pipeline, and reaches into Zone C (provisioned VMs) over SSH / Ray / gRPC. The head-node skylet daemon is small (107 lines) — it is mostly an event-loop wrapper around four gRPC services and six recurring jobs. Special controllers (Managed Jobs · SkyServe · Load Balancer) are themselves ordinary SkyPilot tasks running in Zone C.

三个区，一个程序。 Zone A（客户端）打包一个 Task 并 POST 给 Zone B。Zone B（FastAPI server）为每个请求 spawn 一个进程，跑 9-stage 执行流水线，再通过 SSH / Ray / gRPC 伸进 Zone C（已 provision 的 VM）。head 节点上的 skylet 守护进程很小（107 行）—— 它基本上就是包着四个 gRPC service 和六个周期 job 的事件循环。特殊控制器（Managed Jobs · SkyServe · Load Balancer）本身就是跑在 Zone C 里的普通 SkyPilot task。

§ M0 · 20 minThe repository, in repose.

把 21 万行的仓库当成一个整体先看一眼。它叫 sky，主包就是 sky/，根目录还有三个有意思的伙伴：agent/（SkyPilot Agent Skills — 给 LLM agent 用的 GPU 接入封装，v0.12 新功能）、llm/（45+ 个 LLM 训练/serving 例子，从 DeepSeek-R1 到 Kimi-K2），还有 examples/amd/ — 4 个直接面向 MI 系列硬件的官方 YAML 范例。

Take in the 211K-line repo as a whole first. It is called sky; the main package is just sky/, and the root has three interesting companions: agent/ (SkyPilot Agent Skills — a GPU-access wrapper for LLM agents, new in v0.12), llm/ (45+ LLM training/serving examples, from DeepSeek-R1 to Kimi-K2), and examples/amd/ — four official YAML samples aimed directly at MI-series hardware.

观察 · OBSERVATION

OBSERVATION

主包 + 伙伴目录这种结构在大型 ML 系统里很常见，但 SkyPilot 把 llm/ 和 agent/ 都做成顶层目录（而不是塞进 examples/ 子目录），等于在宣告"这些是 first-class 的产品形态，不是普通示例"。版本和迭代节奏可以与主包解耦 — 这是个值得偷的工程姿势。

Main package + companion directories is common in large ML systems, but SkyPilot makes both llm/ and agent/ top-level directories (instead of burying them under examples/), which is a declaration that "these are first-class product surfaces, not mere samples." Their versioning and release cadence can decouple from the main package — a posture worth stealing.

公开 API 表面 · `sky/init.py`

The public API surface · `sky/init.py`

258 行的薄薄一份。它做的事情只有三件：(1) 在 import 时强制规范 HTTP/HTTPS 代理环境变量（因为 GCP SDK 对大小写敏感有 bug），(2) 重新导出 sky.client.sdk 里所有动词 API 到顶层命名空间，(3) 给 24 个云类型起别名。sky.launch 之所以不在 sky/launch.py，是因为 SkyPilot 用了门面模式 — 用户调 sky.launch() 感觉是本地函数，实际背后是 client → HTTP → server → execution pipeline 的整套链路。

A thin 258 lines. It does exactly three things: (1) at import time, force-normalizes the HTTP/HTTPS proxy environment variables (because the GCP SDK has a case-sensitivity bug), (2) re-exports all the verb APIs from sky.client.sdk into the top-level namespace, and (3) aliases the 24 cloud types. sky.launch doesn't live in sky/launch.py because SkyPilot uses the facade pattern — calling sky.launch() feels like a local function, but behind it is the full client → HTTP → server → execution pipeline chain.

# sky/__init__.py · line 83-84
# Keep this order to avoid cyclic imports
# pylint: disable=wrong-import-position
from sky import backends
from sky import batch  # noqa: F401
from sky import clouds
from sky.client.sdk import launch       # ← 重导出门面在这里
from sky.client.sdk import status
from sky.client.sdk import exec
...
AWS = clouds.AWS                          # ← 24 个云别名
GCP = clouds.GCP
Kubernetes = clouds.Kubernetes
K8s = Kubernetes                          # ← 别名的别名

那条 # Keep this order to avoid cyclic imports 注释是真实工程债 — 内部模块之间有循环依赖，导入顺序敏感。读源码时如果看到 task.py 和 resources.py 互相 import，不用惊讶。

That # Keep this order to avoid cyclic imports comment is real engineering debt — internal modules have circular dependencies, so import order is load-bearing. Don't be surprised when you see task.py and resources.py importing each other.

§ M1 · 35 minThree objects, one universe.

三个文件，三个数据类，是 SkyPilot 整个 universe 的元素周期表：

Three files, three data classes — the periodic table of SkyPilot's entire universe:

File	Lines	Classes	Role
`sky/task.py`	2,142	2	中枢请求对象（什么要跑）
`sky/resources.py`	2,816	2	资源规格（在哪跑、怎么跑）
`sky/dag.py`	228	3	Task DAG（多个 Task 怎么编排）

File	Lines	Classes	Role
`sky/task.py`	2,142	2	The central request object (what to run)
`sky/resources.py`	2,816	2	Resource spec (where and how to run)
`sky/dag.py`	228	3	Task DAG (how multiple Tasks are orchestrated)

反直觉的是 — Resources 比 Task 还大。原因是 Resources 有 30+ 个 @property 暴露视图：infra / cloud / region / zone / instance_type / cpus / memory / accelerators / use_spot / disk_size / disk_tier / network_tier / image_id / ports / labels / autostop_config / priority / docker_login_config / ... 每个字段都对应一个云上的实际配置维度，24 个云全部要覆盖 — 字段就是这么多。

Counterintuitively — Resources is bigger than Task. The reason is that Resources exposes 30+ @property views: infra / cloud / region / zone / instance_type / cpus / memory / accelerators / use_spot / disk_size / disk_tier / network_tier / image_id / ports / labels / autostop_config / priority / docker_login_config / ... Each field maps to a real configuration dimension on the cloud, and all 24 clouds have to be covered — that's just how many fields there are.

设计模式 · IMMUTABLE VALUE OBJECT

DESIGN PATTERN · IMMUTABLE VALUE OBJECT

所有 30+ 字段都是只读 @property，写操作走私有 _set_* 方法 + 公开 Resources.copy(**override) 函数式更新。这种 immutable 设计在 Optimizer 流水线里至关重要 — 它要从用户的"约束 Resources"枚举出 N 个候选（AWS-spot vs GCP-on-demand vs ...），如果 Resources 是 mutable 的，并发枚举就会互相污染。

All 30+ fields are read-only @property; writes go through private _set_* methods plus the public functional-update Resources.copy(**override). This immutable design is critical in the Optimizer pipeline — it has to enumerate N candidates from the user's "constraint Resources" (AWS-spot vs GCP-on-demand vs ...), and if Resources were mutable, concurrent enumeration would cross-contaminate.

付出的代价是 DX： Resources 类长达 2816 行，每个字段都要写 property + private setter + 在 copy/__init__ 里处理。这是教科书级的"开发者体验换正确性"工程取舍。

The cost is DX: the Resources class runs 2,816 lines, with every field requiring a property + private setter + handling in copy/__init__. A textbook "trade developer experience for correctness" engineering call.

YAML → Python 的翻译：`_fill_in_env_vars` 的 JSON 中转大法

Translating YAML → Python: the JSON round-trip trick in `_fill_in_env_vars`

用户写 file_mounts: { /model/llama-${SIZE}b: s3://llama-weights/${SIZE}b }，SkyPilot 怎么把 ${SIZE} 替换成实际值？直觉做法是递归遍历嵌套 dict 的每个字符串字段。SkyPilot 用了一个聪明的 trick：

The user writes file_mounts: { /model/llama-${SIZE}b: s3://llama-weights/${SIZE}b } — how does SkyPilot substitute ${SIZE} with the real value? The intuitive approach is to recursively walk every string field of the nested dict. SkyPilot uses a clever trick instead:

def _fill_in_env_vars(yaml_field, task_envs):
    yaml_field_str = json.dumps(yaml_field)           # 1) 嵌套 dict → JSON 字符串

    def replace_var(match):
        var_name = match.group(1)
        return task_envs.get(var_name, match.group(0))

    pattern = r'\$\{?\b([a-zA-Z_][a-zA-Z0-9_]*)\b\}?' # 2) regex 抓所有 $VAR / ${VAR}
    yaml_field_str = re.sub(pattern, replace_var, yaml_field_str)
    return json.loads(yaml_field_str)                 # 3) JSON 反序列化回 dict

3 行核心代码就替代了"递归遍历嵌套结构"的几十行 walker。可行性的前提：(1) schema 已经被 validate_schema 在上游校验过，(2) 不需要支持 bash 默认值语法 ${VAR:-default}（这是个挂了好几年的 TODO，line 115）。性能上每次跑一次 JSON 序列化/反序列化不是最优，但对配置加载这种"少量、一次性"场景，简洁度胜过性能。

Three lines of core code replace dozens of lines of "recursive nested-structure walker." The preconditions that make it work: (1) the schema has already been validated upstream by validate_schema, and (2) there's no need to support bash default-value syntax ${VAR:-default} (a TODO that's been open for years, line 115). Running a JSON serialize/deserialize each time isn't optimal, but for "small, one-shot" config loading, brevity wins over performance.

可偷的设计 · STEAL THIS

STEAL THIS

这个模式可以直接搬到你 AMD 的 agent 配置系统 — 需要在嵌套 YAML/JSON 里做 ${MI300X_HOSTS} / ${TRITON_VERSION} 之类的模板替换时，JSON 中转 + regex 比写递归 walker 简洁 10 倍，而且 LLM 也容易 review。

This pattern drops straight into your AMD agent config system — whenever you need template substitution like ${MI300X_HOSTS} / ${TRITON_VERSION} inside nested YAML/JSON, a JSON round-trip + regex is 10× more compact than a recursive walker, and far easier for an LLM to review.

另外两个值得记的点

Two more things worth noting

ManagedSecretRef（line 279）— 一个 dataclass，三个字段（name / mount_path / scope_override）。它让 YAML 里能写 secrets: [secrets:HF_TOKEN, secrets:workspace:GH_PAT] 这种命名引用而不是内联值；token 实际值存在 server 端的 vault 里，YAML 自己不暴露。scope_override 决定查哪个命名空间（personal / workspace / global）。

ManagedSecretRef (line 279) — a dataclass with three fields (name / mount_path / scope_override). It lets YAML write named references like secrets: [secrets:HF_TOKEN, secrets:workspace:GH_PAT] instead of inline values; the actual token lives in the server-side vault, and the YAML never exposes it. scope_override decides which namespace to look in (personal / workspace / global).

register_task_validator（line 36）— SkyPilot 的 plugin 机制典范。一个全局列表 _task_validators，Task.validate() 时遍历调用。谁注册不显式，靠 module import 副作用触发。如果你将来想给 SkyPilot 加"AMD-specific 任务校验"（比如 run 命令里出现 cuda 关键字但 resources 是 AMD 时警告），不用 fork 主代码，写个 plugin 注册到这里即可。

register_task_validator (line 36) — a model of SkyPilot's plugin mechanism. A global list _task_validators, iterated over during Task.validate(). Who registers is implicit, triggered by module-import side effects. If you ever want to add "AMD-specific task validation" to SkyPilot (e.g. warn when the run command contains the cuda keyword but resources is AMD), you don't fork the main code — just write a plugin and register it here.

§ M2 / M2.5 · 60 minCLI · SDK · the FastAPI in the middle.

把用户敲的 sky launch foo.yaml 拆解开看，它跨越四个进程边界：终端 shell → Click CLI → Python SDK → FastAPI server → 子进程 executor。每一跳都值得展开。

Take the user's sky launch foo.yaml apart and it crosses four process boundaries: terminal shell → Click CLI → Python SDK → FastAPI server → subprocess executor. Every hop is worth unpacking.

Click CLI · `sky/client/cli/command.py` · 7,954 行

Click CLI · `sky/client/cli/command.py` · 7,954 lines

我们之前估计 2900 行，实际接近三倍 — 它把 30+ 个子命令（launch / exec / status / queue / cost_report / logs / down / stop / autostop / start / check / show-gpus / jobs / serve ...）的处理逻辑全放一起。launch 子命令处理器在 line 1235，接受 30+ 个 Click 选项 — 注意到 infra、cloud、region、zone 是并存的，老接口（cloud/region/zone）和新接口（infra）渐进迁移期共存。

We'd estimated 2,900 lines; it's nearly triple that — it lumps the handler logic for 30+ subcommands (launch / exec / status / queue / cost_report / logs / down / stop / autostop / start / check / show-gpus / jobs / serve ...) all together. The launch subcommand handler is at line 1235 and takes 30+ Click options — note that infra, cloud, region, and zone coexist: the old interface (cloud/region/zone) and the new one (infra) live side by side during a gradual migration.

task_or_dag = _make_task_or_dag_from_entrypoint_with_overrides(
    entrypoint=entrypoint,        # foo.yaml 路径
    name=name, workdir=workdir,
    cloud=cloud, region=region, zone=zone,
    gpus=gpus, cpus=cpus, memory=memory,
    instance_type=instance_type, num_nodes=num_nodes,
    use_spot=use_spot, image_id=image_id,
    env=env, secret=secret,
    ...
)
# ↑ 这一调内部走 M1 的 Task.from_yaml_config
#   合并 CLI overrides 进 envs/secrets, 跑 _fill_in_env_vars,
#   返回纯净的 Task 对象

request_id = sdk.launch(task, dryrun=dryrun, ...)
# ↑ 这一调走 HTTP，返回 request_id (不是结果！)

Python SDK · `sky/client/sdk.py` · 3,237 行

Python SDK · `sky/client/sdk.py` · 3,237 lines

关键签名揭示一切：def launch(task, ...) -> RequestId[Tuple[Optional[int], Optional[ResourceHandle]]]。返回 RequestId 而不是结果。SDK 是异步的 — 你拿到 ID 之后用 sky.stream_and_get(request_id) 才能取实际结果（或者 CLI 替你做这件事）。原因：launch 可能要跑 30 分钟（provision VM + 装依赖 + 跑 setup），HTTP 请求绝不能阻塞。

The key signature tells the whole story: def launch(task, ...) -> RequestId[Tuple[Optional[int], Optional[ResourceHandle]]]. It returns a RequestId, not a result. The SDK is async — once you have the ID, you call sky.stream_and_get(request_id) to fetch the actual result (or the CLI does it for you). The reason: a launch may run for 30 minutes (provision the VM + install dependencies + run setup), and the HTTP request must never block.

SDK 还有几个 _is_launched_by_jobs_controller / _is_launched_by_sky_serve_controller 这种内部标志 — 这暗示 Managed Jobs 和 SkyServe 内部会递归调 SDK launch 子任务（在 controller VM 上启动 worker VM）。

The SDK also carries internal flags like _is_launched_by_jobs_controller / _is_launched_by_sky_serve_controller — a hint that Managed Jobs and SkyServe recursively call SDK launch for sub-tasks internally (spinning up worker VMs from the controller VM).

FastAPI server · `sky/server/server.py` · 3,524 行

FastAPI server · `sky/server/server.py` · 3,524 lines

line 926 一行揭示骨架：

One line at line 926 reveals the skeleton:

app = fastapi.FastAPI(prefix='/api/v1', debug=True, lifespan=lifespan)

然后是 10+ 个 middleware：RBAC / RequestID / BasicAuth / BearerToken / AuthProxy / SecurityHeaders / InternalDashboardPrefix / CacheControlStatic / PathClean / GracefulShutdown / APIVersion。从 routes 看，端点包括 /token / /api/v1/auth/* / /check / /enabled_clouds 以及一长串异步任务（cleanup_upload_ids / cleanup_unreferenced_file_mounts / loop_lag_monitor 等），全部跑在 FastAPI 的 lifespan 上下文里。

Then come 10+ middlewares: RBAC / RequestID / BasicAuth / BearerToken / AuthProxy / SecurityHeaders / InternalDashboardPrefix / CacheControlStatic / PathClean / GracefulShutdown / APIVersion. From the routes, the endpoints include /token / /api/v1/auth/* / /check / /enabled_clouds plus a long list of async tasks (cleanup_upload_ids / cleanup_unreferenced_file_mounts / loop_lag_monitor and more), all running inside FastAPI's lifespan context.

Request executor · `sky/server/requests/executor.py`

这是最关键的一段，整个客户机/服务器解耦的真相在这里：

This is the most critical piece — the whole truth of the client/server decoupling lives here:

multiprocessing.set_start_method('spawn', force=True)
# On macOS, the default start method for multiprocessing is 'fork', which...

每个 launch 请求被丢进一个独立的子进程（spawn，不是 fork）跑。这才是为什么 launch 不会阻塞其他请求：它在另一个 Python 解释器里跑，FastAPI 线程立刻返回 request_id。配合 BurstableExecutor，可以根据负载动态加进程。还有一个 OnDemandThreadExecutor（line 100）专门处理"轻量同步请求在协程里的执行"。

Each launch request is dropped into its own subprocess (spawn, not fork) to run. That's exactly why launch doesn't block other requests: it runs in a separate Python interpreter, and the FastAPI thread returns the request_id immediately. Paired with BurstableExecutor, processes can be added dynamically based on load. There's also an OnDemandThreadExecutor (line 100) dedicated to "running lightweight synchronous requests inside coroutines."

三段式架构的真相 · WHY THIS MATTERS

WHY THIS MATTERS

"用户调 sky.launch() 像本地函数"的体验，是 4 个进程 + 1 个 HTTP 协议 + spawn 多进程 executor 共同协作的结果。它的代价是整个系统是 eventually consistent 的 — 你拿到 request_id 那一刻，可能 VM 还没开始 provision。后续所有"等待 / 状态查询 / 取消"都要回到 server 查那个 request_id 的状态。

The "calling sky.launch() feels like a local function" experience is the result of 4 processes + 1 HTTP protocol + a spawn-based multiprocess executor working together. The cost is that the whole system is eventually consistent — the moment you get the request_id, the VM may not have started provisioning yet. Every subsequent "wait / status query / cancel" has to go back to the server and check the state of that request_id.

这种"submit → poll / stream"模式直接对应你 multi-agent kernel optimization 系统里的 agent 任务投递 — 同样的 request_id 设计可以借鉴。

This "submit → poll / stream" pattern maps directly onto agent task submission in your multi-agent kernel optimization system — the same request_id design is worth borrowing.

Plate II — One launch, dissected CLI keystroke to spawned subprocess temporal

四进程 / 一次 RPC / 一次子进程派生。在 server 返回 request_id 的那一刻，executor 子进程才刚刚开始跑真正的 launch 流水线（接下来 30 分钟的事情）。CLI 默认会自动 stream 后续日志，但 --async 模式下就会立刻退出，只留 ID 在手 — 这正是 controller 在 spot 恢复后能"接续"先前 launch 的关键。

Four processes / one RPC / one subprocess fork. The instant the server returns the request_id, the executor subprocess has only just started running the real launch pipeline (the next 30 minutes of work). By default the CLI auto-streams the subsequent logs, but in --async mode it exits immediately, leaving you with just the ID — which is precisely what lets a controller "resume" a prior launch after spot recovery.

§ M3 · 40 minNine stages, one pipeline.

一旦请求落到 executor 子进程，sky/execution.py（939 行）就开始按 9 个 stage 推进。我们之前 plan 时以为是 7 个 — 实际枚举有 9 个，多出来的是 CLONE_DISK（experimental，从另一个 cluster 克隆磁盘到新 cluster）和 OPTIMIZE（之前漏数）。

Once a request lands in the executor subprocess, sky/execution.py (939 lines) starts advancing through 9 stages. While planning we'd assumed 7 — the actual enum has 9, the extras being CLONE_DISK (experimental, clones a disk from another cluster into a new one) and OPTIMIZE (which we'd previously missed counting).

class Stage(enum.Enum):
    CLONE_DISK = enum.auto()
    OPTIMIZE = enum.auto()
    PROVISION = enum.auto()
    SYNC_WORKDIR = enum.auto()
    SYNC_FILE_MOUNTS = enum.auto()
    SETUP = enum.auto()
    PRE_EXEC = enum.auto()
    EXEC = enum.auto()
    DOWN = enum.auto()

每个 stage 都直接映射到 backend.method() 调用 — backend 抽象就是这套"做 9 件事"的契约：

Each stage maps directly to a backend.method() call — the backend abstraction is exactly this "do 9 things" contract:

Stage	Maps to	What happens
CLONE_DISK	`_maybe_clone_disk_from_cluster`	从另一个 cluster 复制磁盘镜像
OPTIMIZE	`Optimizer.optimize(dag)`	挑 cloud × region × instance，写入 `task.best_resources`
PROVISION	`backend.provision(...)`	真正去云上拉 VM，返回 ResourceHandle
SYNC_WORKDIR	`backend.sync_workdir(...)`	rsync 本地 workdir 到 VM
SYNC_FILE_MOUNTS	`backend.sync_file_mounts(...)`	storage mounts（S3/GCS）+ file mounts
SETUP	`backend.setup(...)`	跑用户 setup 命令（装依赖）
PRE_EXEC	`backend.set_autostop(...)`	配置 idle autostop
EXEC	`backend.execute(...)`	跑用户 run 命令，返回 job_id
DOWN	`backend.teardown(...)`	`--down` 时拆 cluster

Stage	Maps to	What happens
CLONE_DISK	`_maybe_clone_disk_from_cluster`	copy a disk image from another cluster
OPTIMIZE	`Optimizer.optimize(dag)`	pick cloud × region × instance, write to `task.best_resources`
PROVISION	`backend.provision(...)`	actually pull VMs from the cloud, return a ResourceHandle
SYNC_WORKDIR	`backend.sync_workdir(...)`	rsync the local workdir to the VM
SYNC_FILE_MOUNTS	`backend.sync_file_mounts(...)`	storage mounts (S3/GCS) + file mounts
SETUP	`backend.setup(...)`	run the user setup command (install deps)
PRE_EXEC	`backend.set_autostop(...)`	configure idle autostop
EXEC	`backend.execute(...)`	run the user run command, return a job_id
DOWN	`backend.teardown(...)`	tear down the cluster when `--down`

隐藏的并发控制 · OPTIMISTIC + LOCK-INTERIOR FALLBACK

HIDDEN CONCURRENCY CONTROL · OPTIMISTIC + LOCK-INTERIOR FALLBACK

execution.py:474-505 的注释揭示一个微妙设计：OPTIMIZE 在 per-cluster lock 之外跑（因为 optimize 耗时，不能阻塞别的请求），但 backend 拿到锁后会发现"决策可能已过期"（比如 cluster 刚被别人删了）。补救方法是给 backend 传一个 planner callback — 锁内如果发现需要重 plan，就调它再跑一次 optimizer。

The comment at execution.py:474-505 reveals a subtle design: OPTIMIZE runs outside the per-cluster lock (because optimize is slow and can't block other requests), but once the backend grabs the lock it may find "the decision could be stale" (e.g. the cluster was just deleted by someone else). The fix is to pass the backend a planner callback — if it detects inside the lock that a re-plan is needed, it calls this to run the optimizer again.

这是"乐观优化 + 临界区兜底"的并发模式，处理"长决策 + 临界区状态可能漂移"的经典做法。

This is the "optimistic optimization + critical-section fallback" concurrency pattern — a classic way to handle "long decision + state may drift inside the critical section."

Plate III — The 9-stage pipeline execution.py, in one elevation temporal

exec command only fires SYNC_WORKDIR + EXEC OPTIMIZE runs OUTSIDE per-cluster lock planner callback re-fires inside lock if cached decision stale (see line 474-505)

每个 stage 都是条件触发（if Stage.X in stages:），所以 sky exec（在已存在的 cluster 上跑命令）只会触发 SYNC_WORKDIR + EXEC 两个 stage。provision / setup / exec 三个 stage 是"贵"的 — 各自可能耗时数分钟。

Every stage is conditionally fired (if Stage.X in stages:), so sky exec (running a command on an existing cluster) only triggers two stages: SYNC_WORKDIR + EXEC. The three stages provision / setup / exec are the "expensive" ones — each can take minutes.

§ M4 · 45 minThe Optimizer, DP × ILP.

Optimizer 是 SkyPilot 的技术宝石 — 1805 行做一件事：在 24 个云 × N 个 region × M 个 instance type × spot/on-demand × egress 的笛卡尔积里挑出代价最小（或时间最短）的资源分配方案。

The Optimizer is SkyPilot's technical jewel — 1,805 lines doing one thing: across the Cartesian product of 24 clouds × N regions × M instance types × spot/on-demand × egress, pick the resource allocation with the lowest cost (or shortest time).

关键的判别在第一步：拿到 Dag 之后，问它 dag.is_chain()：

The key branch happens in the first step: given the Dag, ask it dag.is_chain():

DAG 拓扑	算法	复杂度	文件位置
Chain（线性 pipeline）	动态规划 (DP)	O(N · R²)	`_optimize_by_dp · line 429`
General DAG（分支/合并）	整数线性规划 (ILP)，PuLP + CBC	NP-hard，但实际可解	`_optimize_by_ilp · line 490`

DAG topology	Algorithm	Complexity	File location
Chain (linear pipeline)	Dynamic Programming (DP)	O(N · R²)	`_optimize_by_dp · line 429`
General DAG (branch/merge)	Integer Linear Programming (ILP), PuLP + CBC	NP-hard, but tractable in practice	`_optimize_by_ilp · line 490`

动态规划 · chain DAG 上的最短路径

Dynamic programming · shortest path on a chain DAG

状态定义：dp_best_objective[node][resources] = 用 resources 配置跑完 node 的最小累积代价。状态转移：

State definition: dp_best_objective[node][resources] = the minimum cumulative cost to finish node using the resources config. State transition:

# dp_best_objective[node][resources]
#     = my_execution_cost(node, resources)
#     + min over parent_resources of (
#           dp_best_objective[parent][parent_resources]
#         + egress_cost(parent → node, parent_resources, resources)
#       )

再加一个 dp_point_backs[node][resources] = best_parent_resources 用于反向回溯路径。这就是 Bellman 最短路径在"资源候选图"上的应用 — 每个 (node, resources) 是图上一个节点，相邻节点之间的边权是 execution_cost + egress_cost。

Add a dp_point_backs[node][resources] = best_parent_resources for backtracking the path. This is Bellman shortest-path applied to a "resource-candidate graph" — each (node, resources) is a vertex, and the edge weight between adjacent vertices is execution_cost + egress_cost.

整数线性规划 · 一般 DAG 上的双线性优化

Integer linear programming · bilinear optimization on a general DAG

ILP 公式的 docstring 是教科书级的清晰，直接抄录关键部分：

The docstring for the ILP formulation is textbook-clear; the key part is reproduced verbatim:

For cost optimization (after linearization):
  minimize_{c, e}  Σ c[v]ᵀ · k[v]     # execution costs at each node v
                 + Σ e[u,v]ᵀ · F[u,v] # egress costs on each edge (u,v)
  subject to:
    Σ c[v] == 1          for each v in V    # one-hot: pick one resource
    Σ e[u,v] == 1        for each (u,v) in E
    e[u,v] = flatten(c[u] @ c[v]ᵀ)           # linearize the bilinear term

For time (makespan) optimization:
  minimize finish_time[sink]
  subject to:
    finish_time[v] >= c[v]ᵀ·k[v] + finish_time[u] + e[u,v]ᵀ·F[u,v]
    for each (u,v) in E
    plus the same one-hot constraints

数学 trick · LINEARIZE THE BILINEAR

MATH TRICK · LINEARIZE THE BILINEAR

纯二次决策 c[u] · c[v]ᵀ（"u 选 r₁ 且 v 选 r₂"）让问题不是 ILP 而是更难求解的 quadratic IP。SkyPilot 用经典的 McCormick 线性化：引入辅助变量 e[u,v] 表示"edge 端配置 (r₁, r₂) 是否被同时选中"，加 one-hot 约束使 e 与 c 自洽，然后整个问题就成了 ILP，PuLP + CBC 求解器能跑。

The raw quadratic decision c[u] · c[v]ᵀ ("u picks r₁ and v picks r₂") makes the problem not an ILP but a harder-to-solve quadratic IP. SkyPilot uses the classic McCormick linearization: introduce an auxiliary variable e[u,v] for "whether the edge-endpoint configs (r₁, r₂) are both selected," add one-hot constraints so that e stays consistent with c, and the whole problem becomes an ILP that the PuLP + CBC solver can handle.

这个 trick 在你做"多 agent 资源调度"时直接可用 — 任何"两个对象都得选某种配置且配对有代价"的问题，都能这么打平成 ILP。

This trick is directly usable for your "multi-agent resource scheduling" — any problem where "two objects each pick a config and the pairing carries a cost" can be flattened into an ILP this way.

三种"特殊"路径

Three "special" paths

除了主路径，optimize_job_group（line 1037）处理 v0.12 引入的"job group"语义 — 多个 task 必须跑在同一个 cloud / region 上（典型：RL 训练里 actor 和 replay-buffer 必须在同一可用区减少延迟）。它走的是 _optimize_same_infra + _find_common_infras + _select_best_infra 这条特殊路径，把"基础设施一致"作为硬约束。

Beyond the main path, optimize_job_group (line 1037) handles the "job group" semantics introduced in v0.12 — multiple tasks that must run on the same cloud / region (typical case: in RL training the actor and replay-buffer must share an availability zone to cut latency). It takes a special path through _optimize_same_infra + _find_common_infras + _select_best_infra, treating "infrastructure consistency" as a hard constraint.

Plate IV — How the Optimizer decides chain → DP, general → ILP conceptual

所有 chain DAG（包括最常见的"单 task"特例）走 DP 分支，瞬间完成。多分支 / 多汇聚的 DAG 走 ILP，把笛卡尔积空间打成线性规划。OPTIMIZE 这个 stage 在 client 端还是 server 端跑？—— 在 server 端：catalog 数据和 PuLP 都装在 server 容器里，client 不背这个重担。

Every chain DAG (including the most common "single-task" special case) takes the DP branch and finishes instantly. Multi-branch / multi-merge DAGs take ILP, flattening the Cartesian product space into a linear program. Does the OPTIMIZE stage run on the client or the server? — On the server: the catalog data and PuLP both live in the server container, so the client doesn't carry that weight.

§ M5 / M6 · 75 minBackend × Cloud, the polymorphic dance.

这两层互为因果，必须一起读。backend.py（212 行）定义 Backend[ResourceHandle] 泛型抽象基类，列出 9 个 stage 对应的方法。cloud.py（1041 行）定义 Cloud 抽象基类，列出 14 个云必须实现的方法。

These two layers are mutually defining and must be read together. backend.py (212 lines) defines the generic abstract base class Backend[ResourceHandle], listing the methods for the 9 stages. cloud.py (1,041 lines) defines the abstract base class Cloud, listing the 14 methods every cloud must implement.

关键洞察是：Backend 是"调用方"，Cloud 是"被调用方"。先读 backend 你才能知道 cloud 接口为什么长那样 — 抽象方法的形状是被调用模式塑造出来的。

The key insight: Backend is the "caller," Cloud is the "callee." Read backend first and you'll understand why the cloud interface looks the way it does — the shape of the abstract methods is dictated by the call pattern.

Backend 抽象的形状

The shape of the Backend abstraction

class Backend(Generic[_ResourceHandleType]):
    def provision(self, task, to_provision_config, dryrun, ...) -> Tuple[ResourceHandle, bool]: ...
    def sync_workdir(self, handle, workdir, envs_and_secrets): ...
    def sync_file_mounts(self, handle, all_file_mounts, storage_mounts): ...
    def setup(self, handle, task, detach_setup): ...
    def execute(self, handle, task, dryrun) -> Optional[int]: ...     # → job_id
    def teardown(self, handle, terminate): ...
    def register_info(self, **kwargs): ...
    # ... 加上几个 internal _method 钩子

主实现 CloudVmRayBackend 在 cloud_vm_ray_backend.py 里，6562 行单文件。它的"主类"CloudVmRayBackend 本身从 line 3038 开始，到 line 6562 — 单个类 3500 行。文件里还有 7 个 helper 类：

The main implementation CloudVmRayBackend lives in cloud_vm_ray_backend.py, a 6,562-line single file. The "main class" CloudVmRayBackend itself starts at line 3038 and runs to line 6562 — a single class of 3,500 lines. The file also holds 7 helper classes:

Class	Line	Role
`GangSchedulingStatus`	385	多节点 gang schedule 的状态枚举
`FailoverCloudErrorHandlerV1 / V2`	402 / 529	provision 失败时的故障转移决策（V1 老接口、V2 新接口）
`RetryingVmProvisioner`	796	关键 — 在 zone → region → cloud 三层重试 provision
`SSHTunnelInfo`	1938	SSH 隧道连接信息
`CloudVmRayResourceHandle`	1955	对外的 ResourceHandle 实现 — 持有 cluster 状态
`LocalResourcesHandle`	2795	local-mode 特例
`SkyletClient`	2843	★ backend → skylet 的 gRPC 客户端
`CloudVmRayBackend`	3038	主类

Class	Line	Role
`GangSchedulingStatus`	385	state enum for multi-node gang scheduling
`FailoverCloudErrorHandlerV1 / V2`	402 / 529	failover decisions on provision failure (V1 old interface, V2 new)
`RetryingVmProvisioner`	796	the key one — retries provision across three levels: zone → region → cloud
`SSHTunnelInfo`	1938	SSH tunnel connection info
`CloudVmRayResourceHandle`	1955	the public ResourceHandle implementation — holds cluster state
`LocalResourcesHandle`	2795	local-mode special case
`SkyletClient`	2843	★ the backend → skylet gRPC client
`CloudVmRayBackend`	3038	the main class

名字里的"Ray"是什么 Ray

Which "Ray" is the Ray in the name

CloudVmRayBackend 里的 "Ray" 指 ray.io 项目 — 但不是用 Ray 做分布式训练，而是借用 Ray 的 cluster launcher（ray up / ray attach）来管 VM 生命周期。SkyPilot 早期是 Ray 的扩展，后来独立发展，这是历史包袱 + 设计选择。sky/skylet/ray_patches/ 子目录里能看到他们对 Ray cluster launcher 的私有 patch。

The "Ray" in CloudVmRayBackend refers to the ray.io project — but it isn't using Ray for distributed training. It borrows Ray's cluster launcher (ray up / ray attach) to manage VM lifecycles. SkyPilot started as a Ray extension and later spun off, so this is part legacy baggage, part design choice. The sky/skylet/ray_patches/ subdirectory shows their private patches to the Ray cluster launcher.

Cloud 抽象的 14 个必实现方法

The Cloud abstraction's 14 required methods

每个云（AWS / GCP / Kubernetes / SSH / Slurm / RunPod / ...）必须告诉 SkyPilot：

Every cloud (AWS / GCP / Kubernetes / SSH / Slurm / RunPod / ...) must tell SkyPilot:

regions_with_offering(instance_type, accel, use_spot, region, zone)
zones_provision_loop(...)               # 迭代 zones 用于 provision 尝试
get_zone_shell_cmd()                    # 在 VM 内拿 zone 的 shell 命令
instance_type_to_hourly_cost(it, spot)  # 给 Optimizer 喂价
accelerators_to_hourly_cost(accel)
get_egress_cost(num_gigabytes)          # 给 ILP 喂 egress 矩阵
make_deploy_resources_variables(...)    # 把 Resources 翻译成 Ray YAML 模板的填充变量
get_vcpus_mem_from_instance_type(it)
get_accelerators_from_instance_type(it)
get_default_instance_type(...)
_get_feasible_launchable_resources(r)   # 预过滤候选 — 给 Optimizer 减负
get_credential_file_mounts()
_unsupported_features_for_resources(r)
query_status(name_id_filter)            # 从云上拉 cluster 当前状态

具体实现 26 个云覆盖 — 加起来代码量惊人（aws.py 1712 行，kubernetes.py 1491 行）。这正是"加新云后端"工作的核心 — 后面 Red Line 1 会详述。

The concrete implementations cover 26 clouds — the combined code volume is staggering (aws.py is 1,712 lines, kubernetes.py 1,491). This is exactly the heart of "adding a new cloud backend" — detailed later in Red Line 1.

§ M7 · 40 minSkylet, the satellite.

最让人惊喜的发现：sky/skylet/skylet.py 只有 107 行。这个被誉为"SkyPilot 的 kubelet"的节点守护进程，本体精简到不可思议 — 它就做两件事：(1) 启动 gRPC server，(2) 跑一个 6 事件的轮询事件循环。重逻辑全在配套模块里。

The most pleasant surprise: sky/skylet/skylet.py is only 107 lines. This node daemon — dubbed "SkyPilot's kubelet" — has an absurdly lean core. It does just two things: (1) start a gRPC server, and (2) run a 6-event polling loop. All the heavy logic lives in companion modules.

EVENTS = [
    events.AutostopEvent(),                         # idle 自动关机
    events.JobSchedulerEvent(),                     # 节点本地 job 调度
    events.ManagedJobEvent(),                       # managed job 状态同步
    events.ServiceUpdateEvent(pool=False),          # serving controller 健康
    events.ServiceUpdateEvent(pool=True),           # pool 状态刷新
    events.UsageHeartbeatReportEvent(),             # 用量遥测 (10 min 一次)
]

def run_event_loop():
    for event in EVENTS:
        event.start()
    while True:
        time.sleep(events.EVENT_CHECKING_INTERVAL_SECONDS)
        for event in EVENTS:
            event.run()

gRPC server 暴露 4 个服务，全部用 protobuf 生成的 stub：

The gRPC server exposes 4 services, all using protobuf-generated stubs:

Service	Proto	谁是 client
`AutostopServiceImpl`	`autostopv1`	backend, 配置 autostop
`JobsServiceImpl`	`jobsv1`	backend, 提交/查 job
`ServeServiceImpl`	`servev1`	serve controller
`ManagedJobsServiceImpl`	`managed_jobsv1`	jobs controller

Service	Proto	Who's the client
`AutostopServiceImpl`	`autostopv1`	backend, configures autostop
`JobsServiceImpl`	`jobsv1`	backend, submits/queries jobs
`ServeServiceImpl`	`servev1`	serve controller
`ManagedJobsServiceImpl`	`managed_jobsv1`	jobs controller

SkyletClient（cloud_vm_ray_backend.py:2843）就是这些 gRPC service 的客户端 — backend 通过它跟节点上的 skylet 通信。

SkyletClient (cloud_vm_ray_backend.py:2843) is the client for these gRPC services — the backend talks to the skylet on the node through it.

值得偷的设计模式 · MICRO-DAEMON + FAT MODULES

A PATTERN WORTH STEALING · MICRO-DAEMON + FAT MODULES

守护进程本体保持极薄，重逻辑全部下沉到独立模块（job_lib.py 1459 行 · log_lib.py 909 行 · autostop_lib.py 382 行 · events.py 442 行 · services.py 634 行）。这种"小核心 + 大附件"的拆法让你可以单独测试每个 lib，而 daemon 本身只是 wiring。

The daemon proper stays razor-thin, with all the heavy logic pushed down into separate modules (job_lib.py 1,459 lines · log_lib.py 909 · autostop_lib.py 382 · events.py 442 · services.py 634). This "small core + large attachments" split lets you test each lib in isolation, while the daemon itself is just wiring.

对你 multi-agent kernel optimization 系统的 worker daemon 设计完全适用 — 把"agent 本体"做成 100 行的 event loop + gRPC server，把"kernel benchmark / profile / analysis"做成独立 lib。

It fits the worker-daemon design of your multi-agent kernel optimization system perfectly — make the "agent proper" a 100-line event loop + gRPC server, and make "kernel benchmark / profile / analysis" standalone libs.

Plate V — Skylet 107 lines, exploded 1 : 1

左臂（gRPC）应答外部调用；右臂（event loop）巡视本地状态。两条腿共享同一组下层库 — 这是"daemon 即 wiring，logic 在 lib"的范式。

The left arm (gRPC) answers external calls; the right arm (event loop) patrols local state. Both legs share the same lower libraries — this is the "daemon is wiring, logic in libs" paradigm.

§ M8 · 35 minManaged Jobs, two state machines.

Managed Jobs 是 SkyPilot 的杀手特性 — 在 spot 实例上跑长跑作业不丢进度。它的核心设计是两层状态机，一个隔离另一个的瞬态故障：

Managed Jobs is SkyPilot's killer feature — run long jobs on spot instances without losing progress. Its core design is a two-level state machine, one isolating the transient failures of the other:

class ManagedJobStatus(enum.Enum):
    """
    The ManagedJobStatus is a higher level status than the JobStatus.
    Each managed job submitted to a cluster will have a JobStatus
    associated with it:
        JobStatus = [INIT, SETTING_UP, PENDING, RUNNING, ...]
    Whenever the cluster is preempted and recovered, the JobStatus
    transitions multiple times.

    However, a managed job only has one ManagedJobStatus on the jobs controller.
        ManagedJobStatus = [PENDING, STARTING, RUNNING, ...]
    """

翻译成人话：

In plain terms:

JobStatus（在 worker cluster 的节点上）：被 spot 抢占时会经历 RUNNING → FAILED；恢复后新 cluster 起来又走 INIT → SETTING_UP → PENDING → RUNNING。一个 managed job 的生命周期里这个底层状态可能反复 RUNNING → FAILED → RUNNING。
ManagedJobStatus（在独立的 jobs controller VM 上）：从用户视角看到的状态。底下被抢 N 次，这里始终是 RUNNING。只有当 controller 决定"不再 recover"了，才会进入 FAILED_NO_RESOURCE 或 FAILED_CONTROLLER。

JobStatus (on the nodes of the worker cluster): when preempted by spot, it goes RUNNING → FAILED; after recovery a new cluster comes up and walks INIT → SETTING_UP → PENDING → RUNNING again. Over a managed job's lifetime this low-level status may go RUNNING → FAILED → RUNNING repeatedly.
ManagedJobStatus (on the separate jobs controller VM): the status as the user sees it. No matter how many times it's preempted underneath, this stays RUNNING. Only when the controller decides "no more recovery" does it enter FAILED_NO_RESOURCE or FAILED_CONTROLLER.

这种"上层状态吸收下层抖动"的设计同样反映在 recovery_strategy.py（1024 行）里：

This "the upper status absorbs the lower jitter" design is also reflected in recovery_strategy.py (1,024 lines):

Strategy class	Line	Behavior on preemption
`StrategyExecutor`	61	抽象基类，定义 `recover()` 钩子
`FailoverStrategyExecutor`	815	等所有 zones/regions 都试一遍，再换 cloud
`EagerFailoverStrategyExecutor`	936	抢占后立即跳到下一个 region，不死磕

Strategy class	Line	Behavior on preemption
`StrategyExecutor`	61	abstract base class, defines the `recover()` hook
`FailoverStrategyExecutor`	815	exhaust all zones/regions, then switch clouds
`EagerFailoverStrategyExecutor`	936	jump to the next region immediately on preemption, no hammering

scheduler.py（466 行）单独控制"controller 进程"的并行度 — 因为每个 managed job 在 controller VM 上是一个独立 Python 进程，太多 controller 会撑爆 VM 内存。

scheduler.py (466 lines) separately controls the concurrency of the "controller processes" — because each managed job is a separate Python process on the controller VM, too many controllers will blow out the VM's memory.

直接相关 · DIRECTLY APPLICABLE

DIRECTLY APPLICABLE

"两层状态机吸收瞬态故障"的设计是你的 multi-agent kernel optimization 系统应该照搬的 — agent 跑 kernel benchmark 时 ROCm 偶发 OOM / driver hang / 节点掉线，底层 task 状态反复抖动，但上层 agent 任务状态应该稳定保持 RUNNING，让 agent 不感知底层故障。recovery_strategy.py 那种"策略可插拔"的接口直接抄。

The "two-level state machine absorbs transient failures" design is one your multi-agent kernel optimization system should copy outright — when an agent runs a kernel benchmark and ROCm intermittently OOMs / driver-hangs / drops a node, the lower task status jitters back and forth, but the upper agent task status should hold steady at RUNNING so the agent never feels the underlying failure. Copy the "pluggable strategy" interface of recovery_strategy.py directly.

Plate VI — The two-level state machine Managed Jobs · spot resilience temporal

用户看到的"一个 job 跑了 6 小时"是上层平直的红线；底下其实是三段独立的 cluster 生命周期，每次抢占触发 recovery_strategy.recover() 重新 provision + sync + setup + exec。两层状态机的隔离让"长跑作业"变成可能。

What the user sees — "one job ran for 6 hours" — is the flat red line on the upper lane; underneath are actually three separate cluster lifecycles, each preemption triggering recovery_strategy.recover() to re-provision + sync + setup + exec. The isolation of the two-level state machine is what makes "long-running jobs" possible.

§ M9 · 35 minSkyServe, scaling out.

SkyServe 是模型 serving 子系统 — 把 N 个 replica 跑在多云多区，自动扩缩，路由请求。结构上跟 Managed Jobs 镜像对称：每个 service 有独立的 controller VM，下面挂 K 个 replica VM，外加一个独立的 load balancer 进程负责把流量分发到健康的 replica。

SkyServe is the model-serving subsystem — run N replicas across multiple clouds and regions, autoscale, and route requests. Structurally it mirrors Managed Jobs: each service has its own controller VM, with K replica VMs hanging beneath it, plus a separate load balancer process that distributes traffic to the healthy replicas.

File	Lines	Role
`controller.py`	297	service 生命周期
`replica_managers.py`	1,564	★ replica 健康监控 + 自愈
`autoscalers.py`	1,288	QueueLengthAutoscaler 等扩缩策略
`load_balancer.py`	342	FastAPI 进程，路由 HTTP
`load_balancing_policies.py`	262	round_robin / least_load / 等
`spot_placer.py`	281	replica 在 spot 上的放置策略
`service_spec.py`	661	YAML schema
`serve_state.py`	835	state DB
`serve_utils.py`	1,934	helpers

File	Lines	Role
`controller.py`	297	service lifecycle
`replica_managers.py`	1,564	★ replica health monitoring + self-healing
`autoscalers.py`	1,288	scaling policies like QueueLengthAutoscaler
`load_balancer.py`	342	FastAPI process, routes HTTP
`load_balancing_policies.py`	262	round_robin / least_load / etc.
`spot_placer.py`	281	placement strategy for replicas on spot
`service_spec.py`	661	YAML schema
`serve_state.py`	835	state DB
`serve_utils.py`	1,934	helpers

关键发现：load balancer 自己就是一个 FastAPI 进程（load_balancer.py:53: self._app = fastapi.FastAPI()），跑在 controller VM 上。意味着 user request 的 hot path 是 client → LB FastAPI → replica HTTP server，多一跳但代价可控（同 VM 内）。

Key finding: the load balancer is itself a FastAPI process (load_balancer.py:53: self._app = fastapi.FastAPI()) running on the controller VM. That means the hot path for a user request is client → LB FastAPI → replica HTTP server — one extra hop, but the cost is bounded (it's within the same VM).

spot_placer.py 是个有趣的独立模块 — 它不复用 Optimizer 的 ILP 逻辑，而是有专门策略：把 replica 分散在不同 region / zone，降低"同一时刻全军覆没"的概率。这是 serving 特有的可用性考虑。

spot_placer.py is an interesting standalone module — it does not reuse the Optimizer's ILP logic, but has a dedicated strategy: spread replicas across different regions / zones to lower the chance of "the whole fleet going down at once." This is an availability concern specific to serving.

Autoscaler 用 AutoscalerDecisionOperator 二元决策（SCALE_UP / SCALE_DOWN）+ AutoscalerDecision(operator, target) 数据类。QueueLengthAutoscaler 是一个具体实现 — 看队列深度做扩缩。这种"decision + operator"结构方便插入新策略（QPS / latency / GPU util）。

The autoscaler uses an AutoscalerDecisionOperator binary decision (SCALE_UP / SCALE_DOWN) + an AutoscalerDecision(operator, target) dataclass. QueueLengthAutoscaler is one concrete implementation — it scales based on queue depth. This "decision + operator" structure makes it easy to plug in new policies (QPS / latency / GPU util).

§ Traps · five of themWhat I wish I'd known.

新人最容易栽的五个坑 — 如果你将来读源码、做改动、或者跟 SkyPilot 团队提 PR，先记住这几条。

The five traps newcomers fall into most — if you're going to read the source, make changes, or send the SkyPilot team a PR, memorize these first.

Trap № 1 · The "Ray" pun

cloud_vm_ray_backend.py 里的 "Ray" 不是分布式训练 Ray，是借 ray.io 的 cluster launcher 管 VM。sky/skylet/ray_patches/ 子目录可以确认 — 那是他们对 Ray cluster launcher 的私有 patch。SkyPilot 早期是 Ray 的扩展，独立后还沿用了 Ray 的 VM 编排底层。

The "Ray" in cloud_vm_ray_backend.py is not distributed-training Ray; it borrows ray.io's cluster launcher to manage VMs. The sky/skylet/ray_patches/ subdirectory confirms it — those are their private patches to the Ray cluster launcher. SkyPilot started as a Ray extension, and after spinning off it kept using Ray's VM-orchestration layer.

Trap № 2 · Four state.py files

这是新人必混的：

A guaranteed source of confusion for newcomers:

sky/skylet/job_lib.py — 节点本地 job DB
sky/jobs/state.py — managed job 全局 DB（在 jobs controller 上）
sky/serve/serve_state.py — serving 状态
sky/server/state.py — API server 状态

sky/skylet/job_lib.py — node-local job DB
sky/jobs/state.py — global managed-job DB (on the jobs controller)
sky/serve/serve_state.py — serving state
sky/server/state.py — API server state

四个不同 FSM，对应 4 个不同层级的 job 概念。读 M7 / M8 / M9 时画一张对照表别忘了。

Four different FSMs, mapping to 4 different levels of the job concept. When reading M7 / M8 / M9, don't forget to draw a cross-reference table.

Trap № 3 · _fill_in_env_vars 不只是字符串替换

Trap № 3 · _fill_in_env_vars is more than string substitution

看似只是 ${VAR} 替换，实际上还涉及 secrets 不写日志、Docker 登录凭证不进 envs、和 file_mounts/storage/service/volumes 的多场景调用。它是个安全边界，不是无脑替换 — 加 plugin 时如果你引入新的"需要替换 env vars 的字段"，记得调它。

It looks like mere ${VAR} substitution, but it also handles keeping secrets out of logs, keeping Docker login credentials out of envs, and multi-scenario calls across file_mounts/storage/service/volumes. It's a security boundary, not a dumb replace — when adding a plugin, if you introduce a new "field that needs env-var substitution," remember to call it.

Trap № 4 · Optimizer 在 server 端跑（不在 client）

Trap № 4 · The Optimizer runs on the server (not the client)

直觉很容易反过来想 — client 上有 PuLP 也能跑啊。但 catalog 数据（pricing / availability CSV）在 server 端容器里，client 不背这个重，所以 OPTIMIZE 是 server 端 executor 子进程里跑的。这点决定了：(1) catalog 同步策略只发生在 server 上，(2) client 想"先看看代价"得发请求到 server。

Intuition easily gets this backwards — the client has PuLP too, so why not run it there? But the catalog data (pricing / availability CSV) lives in the server-side container, and the client doesn't carry that weight, so OPTIMIZE runs in the server-side executor subprocess. This determines that: (1) the catalog sync strategy only happens on the server, and (2) a client that wants to "preview the cost" has to send a request to the server.

Trap № 5 · Plugin 机制隐式调用

Trap № 5 · The plugin mechanism is invoked implicitly

register_task_validator · sky/server/plugins.py · plugin_hooks.py — SkyPilot 有完整 plugin 体系，但不在主流程显式调用，靠装饰器 + entry_points + module import 副作用触发。读代码时看到"这函数貌似没人调"，先怀疑是 plugin hook。

register_task_validator · sky/server/plugins.py · plugin_hooks.py — SkyPilot has a full plugin system, but it isn't called explicitly in the main flow; it's triggered by decorators + entry_points + module-import side effects. When you read the code and see "this function seems to have no callers," suspect a plugin hook first.

§ Red lines · threeThe big questions.

读完整套源码，你应该能口头回答这三个大问题。每个红线问题贯穿多个模块，是真正的"懂了"指标。

After reading the whole source, you should be able to answer these three big questions out loud. Each red-line question cuts across multiple modules — the real measure of having "got it."

如果要给 SkyPilot 加一个新云后端（比如 AMD ROCm 集群、或 Nebius / Crusoe 这种 neocloud），要改哪些文件？按什么顺序？

To add a new cloud backend to SkyPilot (say an AMD ROCm cluster, or a neocloud like Nebius / Crusoe), which files do you change, and in what order?

这是对你 AMD 工作直接相关的红线。读完 M5 / M6 / M7 你应该能列出：

sky/clouds/<new>.py — 实现 Cloud 抽象基类的 14 个抽象方法
sky/catalog/<new>_catalog.py — 提供 pricing / availability（给 Optimizer 喂价格矩阵）
sky/provision/<new>/ — 实现节点起停的低层逻辑
sky/clouds/__init__.py — 注册新云别名（让 sky.MyCloud 能用）
sky/check.py — 加上凭证检查（sky check 能识别新云）
（可选）sky/serve/spot_placer.py — 如果新云有 spot，注册放置策略
（可选）sky/dashboard/ — Web UI 里显示新云的图标和元数据

CLAUDE.md 官方的"Adding a new cloud provider"流程是 4 步（前 3 步 + 第 4 步），实际工作量通常落在 1+3+5 — 见下方 Plate VII。

This is the red line most directly relevant to your AMD work. After M5 / M6 / M7 you should be able to list:

sky/clouds/<new>.py — implement the 14 abstract methods of the Cloud base class
sky/catalog/<new>_catalog.py — provide pricing / availability (feed the Optimizer its price matrix)
sky/provision/<new>/ — implement the low-level node start/stop logic
sky/clouds/__init__.py — register the new cloud alias (so sky.MyCloud works)
sky/check.py — add the credential check (so sky check recognizes the new cloud)
(optional) sky/serve/spot_placer.py — register a placement strategy if the new cloud has spot
(optional) sky/dashboard/ — show the new cloud's icon and metadata in the Web UI

The official "Adding a new cloud provider" flow in CLAUDE.md is 4 steps (the first 3 + step 4), but the real work usually lands on 1+3+5 — see Plate VII below.

一条 sky launch foo.yaml 命令，从按下回车到 VM 上 run 起来，跨越多少个进程、几次 RPC、几次 SSH？

For one sky launch foo.yaml command, from pressing Enter to the run starting on the VM — how many processes, RPCs, and SSH connections does it cross?

读完整个 repo 你应该能画一张时序图：

进程：terminal → CLI Python 进程 → SDK 同进程 → server FastAPI 进程 → executor spawn 的子进程 → ray launcher 在 client VM 起子进程 → 云 SDK 调云 API 起 VM → cloud-init 在 VM 上拉 skylet → skylet 起 4 个 gRPC service。共 9 个 Python 进程边界。

RPC / 网络：CLI → server 1 次 HTTP POST + N 次 stream/poll；server → 云 API 数十次；server SSH → VM 数次（rsync workdir、装依赖、提交 job）；backend → skylet gRPC 数次。核心路径约 5 类网络协议：HTTP、云厂商 REST API、SSH、gRPC、object storage upload/download。

After reading the whole repo you should be able to draw a sequence diagram:

Processes: terminal → CLI Python process → SDK (same process) → server FastAPI process → executor-spawned subprocess → ray launcher spinning up a subprocess on the client VM → cloud SDK calling the cloud API to start the VM → cloud-init pulling skylet onto the VM → skylet starting 4 gRPC services. 9 Python process boundaries in all.

RPC / network: CLI → server, 1 HTTP POST + N stream/polls; server → cloud API, dozens; server SSH → VM, several (rsync workdir, install deps, submit job); backend → skylet gRPC, several. Roughly 5 classes of network protocol on the core path: HTTP, cloud-vendor REST API, SSH, gRPC, and object-storage upload/download.

SkyPilot 的核心抽象到底是什么？Task / Resources / Cluster — 设计者把"状态"放在了哪一层？

What is SkyPilot's core abstraction, really? Task / Resources / Cluster — at which layer did the designers put "state"?

三分法明显：

Task 是无状态请求（值对象）— 一次 launch 对应一个 Task，请求结束生命周期结束
Resources 是规格不可变值对象（functional update via .copy(**override)）— 描述"想要什么"
Cluster 是带状态实体（DB 在 sky/server/state.py 和 sky/skylet/job_lib.py）— 描述"现在是什么"

这种"请求 / 规格 / 实体"三分法是后续可以借鉴到你的 multi-agent kernel optimization 系统的核心设计原则。你的 kernel optimization 任务（Task）应该是无状态的、可重放的；硬件 + 软件配置（Resources）应该是不可变值对象，可以快速枚举变体；实际跑起来的 agent worker（Cluster 类比）才有状态、需要 DB。

The three-way split is clear:

Task is a stateless request (value object) — one launch corresponds to one Task, and its lifecycle ends when the request ends
Resources is an immutable spec value object (functional update via .copy(**override)) — describes "what you want"
Cluster is a stateful entity (DBs in sky/server/state.py and sky/skylet/job_lib.py) — describes "what is right now"

This "request / spec / entity" trichotomy is a core design principle worth borrowing for your multi-agent kernel optimization system. Your kernel optimization task (Task) should be stateless and replayable; the hardware + software config (Resources) should be an immutable value object you can enumerate variants of quickly; and only the actually-running agent worker (the Cluster analogue) carries state and needs a DB.

Plate VII — Adding an AMD ROCm cloud file-by-file for Jhin

第 1、2、3 步是新文件，第 4、5 步是修改现有文件，第 6、7 步可选。如果你的 AMD 集群是 K8s 上跑（CSP 风格），最简方式是直接用 SkyPilot 已有的 kubernetes 后端，不需要新写 cloud —— 只需在 K8s 上 deploy SkyPilot helm chart，集群就能被 sky launch --infra k8s 调度。

Steps 1, 2, 3 are new files; steps 4, 5 edit existing files; steps 6, 7 are optional. If your AMD cluster runs on K8s (CSP-style), the simplest path is to just use SkyPilot's existing kubernetes backend — no new cloud needed. Deploy the SkyPilot helm chart on K8s and the cluster can be scheduled by sky launch --infra k8s.

⤬

§ EpilogueWhat we found, and what's next.

SkyPilot 不是一个云抽象层；它是一个三段式分布式系统（client / API server / VM agent），中间用一个 9-stage 流水线把"用户意图"翻译成"VM 上跑的进程"。它的工程价值不是哪一个具体算法（虽然 Optimizer 的 DP+ILP 混合很漂亮），而是边界清晰 — Task 是请求、Resources 是规格、Backend 是执行契约、Cloud 是多态被调用方、Skylet 是节点上的事件循环、Managed Jobs 是两层状态机吸收瞬态故障、SkyServe 是多副本拓扑。每一层的职责锐利分明，加新功能（新云、新调度策略、新 serving 拓扑）有清晰的接入点。

SkyPilot is not a cloud abstraction layer; it's a three-tier distributed system (client / API server / VM agent), with a 9-stage pipeline in the middle translating "user intent" into "processes running on a VM." Its engineering value isn't any single algorithm (though the Optimizer's DP+ILP hybrid is beautiful), but its clean boundaries — Task is the request, Resources is the spec, Backend is the execution contract, Cloud is the polymorphic callee, Skylet is the node's event loop, Managed Jobs is a two-level state machine absorbing transient failures, and SkyServe is the multi-replica topology. Each layer's responsibility is sharply defined, and adding new features (a new cloud, a new scheduling policy, a new serving topology) has a clear entry point.

对你 multi-agent kernel optimization 系统的几个直接借鉴：

A few things to borrow directly for your multi-agent kernel optimization system:

异步 request_id 模式 — agent 任务投递返回 ID，长跑用 stream/poll 拿结果。直接照搬 sky/server/requests/executor.py 的 spawn 多进程模式。
JSON 中转字符串替换 — agent 配置里的 ${MI300X_HOSTS} 模板替换，三行代码搞定，参考 M1 的 _fill_in_env_vars。
两层状态机吸收抖动 — kernel benchmark 偶发 OOM / driver hang 应该被下层状态机吸收，agent 任务的上层状态保持稳定。参考 M8 的 JobStatus vs ManagedJobStatus。
"daemon 是 wiring, logic 在 lib" — agent worker 本体保持 100 行级别，把 benchmark / profile / analysis 拆成独立 lib。参考 M7 的 skylet 107 行。
插件化 validator / strategy — recovery strategy 可插拔，task validator 可注册。给你的 kernel optimization 加新策略时不要 fork 主代码，靠 plugin 注册。参考 register_task_validator 和 recovery_strategy.py。

The async request_id pattern — agent task submission returns an ID; long runs use stream/poll to fetch results. Copy the spawn-based multiprocess pattern of sky/server/requests/executor.py outright.
JSON round-trip string substitution — template substitution like ${MI300X_HOSTS} in agent configs, done in three lines; see _fill_in_env_vars from M1.
A two-level state machine absorbs jitter — intermittent OOM / driver hang during a kernel benchmark should be absorbed by the lower state machine while the agent task's upper status stays stable. See JobStatus vs ManagedJobStatus from M8.
"The daemon is wiring, the logic is in libs" — keep the agent worker proper at the 100-line scale and split benchmark / profile / analysis into standalone libs. See skylet's 107 lines from M7.
Pluggable validators / strategies — the recovery strategy is pluggable, the task validator is registerable. When adding a new strategy to your kernel optimization, don't fork the main code — register a plugin. See register_task_validator and recovery_strategy.py.

— Fin.

§ 00 · PrologueWhy we cracked this open.

§ M0 · 20 minThe repository, in repose.

公开 API 表面 · sky/__init__.py

The public API surface · sky/__init__.py

§ M1 · 35 minThree objects, one universe.

YAML → Python 的翻译：_fill_in_env_vars 的 JSON 中转大法

Translating YAML → Python: the JSON round-trip trick in _fill_in_env_vars

另外两个值得记的点

Two more things worth noting

§ M2 / M2.5 · 60 minCLI · SDK · the FastAPI in the middle.

Click CLI · sky/client/cli/command.py · 7,954 行

Click CLI · sky/client/cli/command.py · 7,954 lines

Python SDK · sky/client/sdk.py · 3,237 行

Python SDK · sky/client/sdk.py · 3,237 lines

FastAPI server · sky/server/server.py · 3,524 行

FastAPI server · sky/server/server.py · 3,524 lines

Request executor · sky/server/requests/executor.py

Request executor · sky/server/requests/executor.py

§ M3 · 40 minNine stages, one pipeline.

§ M4 · 45 minThe Optimizer, DP × ILP.

动态规划 · chain DAG 上的最短路径

Dynamic programming · shortest path on a chain DAG

整数线性规划 · 一般 DAG 上的双线性优化

Integer linear programming · bilinear optimization on a general DAG

三种"特殊"路径

Three "special" paths

§ M5 / M6 · 75 minBackend × Cloud, the polymorphic dance.

Backend 抽象的形状

The shape of the Backend abstraction

名字里的"Ray"是什么 Ray

Which "Ray" is the Ray in the name

Cloud 抽象的 14 个必实现方法

The Cloud abstraction's 14 required methods

§ M7 · 40 minSkylet, the satellite.

§ M8 · 35 minManaged Jobs, two state machines.

§ M9 · 35 minSkyServe, scaling out.

§ Traps · five of themWhat I wish I'd known.

§ Red lines · threeThe big questions.

§ EpilogueWhat we found, and what's next.

公开 API 表面 · `sky/init.py`

The public API surface · `sky/init.py`

YAML → Python 的翻译：`_fill_in_env_vars` 的 JSON 中转大法

Translating YAML → Python: the JSON round-trip trick in `_fill_in_env_vars`

Click CLI · `sky/client/cli/command.py` · 7,954 行

Click CLI · `sky/client/cli/command.py` · 7,954 lines

Python SDK · `sky/client/sdk.py` · 3,237 行

Python SDK · `sky/client/sdk.py` · 3,237 lines

FastAPI server · `sky/server/server.py` · 3,524 行

FastAPI server · `sky/server/server.py` · 3,524 lines

Request executor · `sky/server/requests/executor.py`

Request executor · `sky/server/requests/executor.py`