[rl] refactor grader and trainer generator actor #2244

wwwjn · 2026-01-16T07:50:36Z

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

ghstack-source-id: ba69db2 Pull Request resolved: #2244

tianyu-l · 2026-01-19T11:57:00Z

torchtitan/experiments/rl/unified/actors/scorer.py

+
+
+@dataclass
+class TrajectoryData:


I thought we deprecated the name trajectory which is intrinsically ambiguous, but I don't know what we replace it by, Episode?

tianyu-l · 2026-01-19T11:58:01Z

torchtitan/experiments/rl/unified/actors/scorer.py

+    rewards: torch.Tensor
+
+
+class Scorer(Actor):


I thought you chose to use Grader. Not sure what's the difference but but aligned.

tianyu-l · 2026-01-19T12:00:06Z

torchtitan/experiments/rl/unified/actors/trainer.py

+    def _load_initial_weights(self, model: torch.nn.Module, model_path: str) -> None:
+        """Load initial weights from HuggingFace checkpoint."""
+        from torchtitan.experiments.rl.vllm_compat.weights.converter import (
+            vllm_to_torchtitan,


why using this function instead of our utils like from_hf?

also, this comment seems not addressed

tianyu-l · 2026-01-19T12:04:19Z

torchtitan/experiments/rl/unified/models/attention.py

-        q = q.transpose(1, 2)
-        k = k.transpose(1, 2)
-        v = v.transpose(1, 2)
+        # vLLM attention expects bfloat16 /  inputs


I think this can't just happen for attention.
For torchtitan, by default dtype is fp32, and mixed precision is handled by FSDP so under pure TP forward dtype is fp32.
If vllm by default use overall bf16, we should match. O/w this is another place where torchtitan-native vllm forward would be slow.

Yes dtype difference could be a reason that we are 40% slow when TP is not enabled

so I don't think we should do it ad hoc for just attention layer. I imagine we should init a bf16 model and load state dict in bf16 precision.

tianyu-l · 2026-01-19T12:04:40Z

torchtitan/experiments/rl/unified/simple_grpo.py

 This demonstrates:
-1. Distributed actor architecture with Generator (vLLM) and Trainer (TorchTitan) components
+1. Distributed actor architecture with Generator (vLLM), Scorer, and Trainer (TorchTitan) components
 2. File based weight synchronization between trainer and generator


is this still true?

tianyu-l · 2026-01-19T12:10:39Z

torchtitan/experiments/rl/unified/simple_grpo.py

        job_config,  # Pass full job_config
    )

+    # Spawn scorer on trainer mesh (can share resources with trainer)


Would like to learn more on how Scorer/Grader work with trainer / generator.
Naively I would think they should be put on generator mesh, not trainer_mesh, although they may be the same and you are only using gpus=0 right now.

https://github.com/meta-pytorch/monarch/blob/main/docs/source/examples/grpo_actor.py#L505

I follow the practice here, the scorer is spawned on trainer mesh. My intuition is the main bottleneck is generator (generator takes longer time), so we want to put more work (eg, calculate rewards + advantages) on trainer side instead.

If we only think about algorithm , we can put it on trainer or generator. If we put it on generator, the generated "episode" will be scored episode. If we put it on trainer, the generator can just pass "unscored" episode to trainer

[ghstack-poisoned]

ghstack-source-id: f2a8d93 Pull Request resolved: #2244

tianyu-l · 2026-01-25T21:54:21Z

torchtitan/experiments/rl/unified/models/attention.py

-        q = q.transpose(1, 2)
-        k = k.transpose(1, 2)
-        v = v.transpose(1, 2)
+        # vLLM attention expects bfloat16 /  inputs


so I don't think we should do it ad hoc for just attention layer. I imagine we should init a bf16 model and load state dict in bf16 precision.

tianyu-l · 2026-01-25T21:58:48Z

torchtitan/experiments/rl/unified/actors/grader.py

+    rewards: torch.Tensor
+
+
+class Grader(Actor):


Usage of score / grade / reward is a bit arbitrary right now.
claude tells me:

For general RL work: Stick with "reward function" or "reward model"
For RLHF pipelines: "Reward model" is standard
If choosing between scorer/grader: "Scorer" is slightly more aligned with ML conventions, as it emphasizes the quantitative nature of the output

When comparing scorer & grader, it sounds to me scorer is better than grader because the latter (more or less) suggests discrete reward value only.

tianyu-l · 2026-01-25T21:59:13Z

torchtitan/experiments/rl/unified/simple_grpo.py

+        # 1. Generator produces episode (without rewards)
+        episode = generator.generate.call().get().item(gpus=0)
+        # 2. Grader computes rewards
+        episode = grader.score.call(episode).get().item(gpus=0)


Suggested change

episode = grader.score.call(episode).get().item(gpus=0)

scored_episode = grader.score.call(episode).get().item(gpus=0)

tianyu-l · 2026-01-25T21:59:51Z

torchtitan/experiments/rl/unified/simple_grpo.py

-        metrics = trainer.step.call(batch).get().item(gpus=0)
+        # Fully sync RL loop with separate scoring step
+        # 1. Generator produces episode (without rewards)
+        episode = generator.generate.call().get().item(gpus=0)


There should be difference between episode vs. episodes. The generate call seems returning an Episodes object?

refactor scorer and trainer generator actor

3260c95

[ghstack-poisoned]

This was referenced Jan 16, 2026

[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2191

Open

[rl] refactor model registery #2194

Open

pytorch-bot bot added the ciflow/8gpu label Jan 16, 2026

wwwjn mentioned this pull request Jan 14, 2026

[rl] refactor save and load model weights using DCP #2221

Open

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2026

wwwjn added a commit that referenced this pull request Jan 16, 2026

refactor scorer and trainer generator actor

999b2f1

ghstack-source-id: ba69db2 Pull Request resolved: #2244

wwwjn changed the title ~~refactor scorer and trainer generator actor~~ [rl] refactor scorer and trainer generator actor Jan 16, 2026

wwwjn requested review from acisseJZhong and zhxchen17 January 16, 2026 21:35

tianyu-l reviewed Jan 19, 2026

View reviewed changes

wwwjn changed the title ~~[rl] refactor scorer and trainer generator actor~~ [rl] refactor grader and trainer generator actor Jan 20, 2026

Update on "[rl] refactor grader and trainer generator actor"

d2e0fef

[ghstack-poisoned]

wwwjn added a commit that referenced this pull request Jan 20, 2026

refactor scorer and trainer generator actor

4c98e59

ghstack-source-id: f2a8d93 Pull Request resolved: #2244

tianyu-l reviewed Jan 25, 2026

View reviewed changes

	episode = grader.score.call(episode).get().item(gpus=0)
	scored_episode = grader.score.call(episode).get().item(gpus=0)



		@dataclass
		class TrajectoryData:

[rl] refactor grader and trainer generator actor #2244

Are you sure you want to change the base?

[rl] refactor grader and trainer generator actor #2244

Uh oh!

Conversation

wwwjn commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wwwjn commented Jan 16, 2026 •

edited

Loading

wwwjn Jan 20, 2026 •

edited

Loading