rl

rl is the online reinforcement learning stage. Starting from the actor sft produced, it runs GRPO/GSPO against SWE-bench tasks via Harbor + vLLM + verl, on a multi-GPU node with Kubernetes-backed task execution.

Full docs: coming soon — paste the rl docs URL here when published (e.g. https://swe-rl-docs.pages.dev/docs).

Docs site not published yet

The rl block does not yet have a standalone documentation site. Once it ships, replace the placeholder above with the live URL. For now, the agent contract in subblock/rl/CLAUDE.md is the source of truth.

At a glance

Inputs: model (typically wired from sft.output.checkpoint_path), infrastructure, training, data, experiment, credentials.
Output: actor checkpoints under repos/harbor-verl-train/outputs/.
Runs: Local (8× GPU + K8s). Long-running.

How to run

/rl:setup        # build venv, sync submodules, apply verl patch
/rl:check        # preflight: GPUs, K8s reachability, vLLM env, base model
/rl:run          # launch training (long-running)
/rl:dashboard    # launch the trajectory + metric webui

Reference

Block contract: subblock/rl/CLAUDE.md
Full docs site: coming soon

rl

At a glance

How to run

Reference

On this page