[Model Runner] Support not_need_stop async memcopy #6176

Sunny-bot1 · 2026-01-22T16:08:24Z

Motivation

not_need_stop、sampled_token_ids准备放在sample_and_propress阶段异步拷贝，利用cuda event在save_output阶段进行同步

Modifications

利用pin_memory为后续异步拷贝做准备：

修改shared_inputs["not_need_stop"]为pin_memory，由于pin_memory tensor在python侧被访问后会变为gpu tensor（why?），所以增加get_stop和set_stop两个自定义算子通过指针访问pin_memory tensor
增加shared_inputs["sampled_token_ids"].pin_memory()
增加shared_inputs["not_need_stop_device"]
将not_need_stop和sampled_token_ids的CPU-GPU 拷贝放在python侧，方便后续灵活调整拷贝时机

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-01-22T16:08:31Z

Thanks for your contribution!

Copilot

Pull request overview

此 PR 旨在优化 not_need_stop 标志的内存复制操作，通过引入异步内存复制来提高性能。变更涉及在 GPU 上维护 not_need_stop 标志的单独副本，并使用固定内存（pinned memory）和自定义 CUDA 操作来管理 CPU 和 GPU 之间的数据传输。

Changes:

添加了新的 CUDA 操作 get_stop 和 set_stop 用于管理 GPU 上的 stop 标志
在 ModelOutputData 中添加了 not_need_stop_gpu 字段以支持 GPU 版本的 stop 标志
修改了 update_inputs_v1.cu 以移除同步内存复制逻辑
更新了内存分配策略，使用固定内存（pinned memory）进行异步操作

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
custom_ops/gpu_ops/set_stop.cu	新增 CUDA 操作文件，实现 `get_stop` 和 `set_stop` 函数用于访问和设置 GPU 上的 stop 标志
custom_ops/gpu_ops/update_inputs_v1.cu	移除了同步的 GPU-CPU 内存复制代码，现在直接使用 GPU 上的 `not_need_stop`
custom_ops/setup_ops.py	将新的 `set_stop.cu` 文件添加到编译列表中
fastdeploy/worker/output.py	在 `ModelOutputData` 类中添加 `not_need_stop_gpu` 字段
fastdeploy/worker/gpu_model_runner.py	导入并使用新的 `get_stop` 和 `set_stop` 操作；初始化固定内存和 GPU 张量
fastdeploy/model_executor/pre_and_post_process.py	更新后处理逻辑以使用 GPU 版本的 stop 标志，并添加异步内存复制

Copilot · 2026-01-23T08:01:44Z

custom_ops/gpu_ops/set_stop.cu

+void SetStop(paddle::Tensor& not_need_stop, bool flag) {
+  bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
+  not_need_stop_data[0] = flag;
+}


SetStop 函数直接修改 GPU 内存，但没有任何同步机制。这在异步 GPU 操作环境中可能导致数据竞争。应该使用 Paddle 的异步操作 API 或添加适当的同步，以确保操作的正确性。

Copilot · 2026-01-23T08:01:44Z

fastdeploy/worker/gpu_model_runner.py

    def not_need_stop(self) -> bool:
        """Stop decoding if the tensor meets the termination condition"""
-        return self.share_inputs["not_need_stop"][0]
+        return get_stop(self.share_inputs["not_need_stop"])


get_stop 函数返回一个 CPU 张量，但这里直接将其作为布尔值返回。应该修改为 return get_stop(self.share_inputs["not_need_stop"])[0] 或 return get_stop(self.share_inputs["not_need_stop"]).item()，以正确提取布尔值。参考其他模型运行器（如 gcu_model_runner.py:1189, metax_model_runner.py:2647）的实现，它们都使用 [0] 来访问张量中的值。

Suggested change

return get_stop(self.share_inputs["not_need_stop"])

# NOTE: get_stop returns a CPU tensor; extract the scalar value as bool

return bool(get_stop(self.share_inputs["not_need_stop"])[0])

Copilot · 2026-01-23T08:01:45Z

fastdeploy/worker/output.py

    """
        Indicates if stopping conditions should be ignored
    """
    not_need_stop: bool


新添加的 not_need_stop_gpu 字段缺少文档字符串。为了保持与代码库其他部分的一致性（每个字段都有对应的文档字符串），应该为这个新字段添加类似的文档说明，例如：

"""
GPU version of not_need_stop flag for async operations
"""
not_need_stop_gpu: paddle.Tensor

Suggested change

not_need_stop: bool

not_need_stop: bool

"""

GPU version of not_need_stop flag for async operations

"""

Copilot · 2026-01-23T08:01:45Z

fastdeploy/worker/gpu_model_runner.py

        self.share_inputs["step_idx"] = paddle.full([max_num_seqs, 1], 0, dtype="int64")
-        self.share_inputs["not_need_stop"] = paddle.full([1], False, dtype="bool").cpu()
+        self.share_inputs["not_need_stop"] = paddle.full([1], False, dtype="bool").pin_memory()
+        self.share_inputs["not_need_stop_gpu"] = paddle.full([1], False, dtype="bool")


变量命名 not_need_stop_gpu 可能会引起混淆。现在 not_need_stop 使用 .pin_memory() 分配在固定内存中（pinned memory），而 not_need_stop_gpu 在 GPU 上。建议考虑更清晰的命名，例如将 not_need_stop 重命名为 not_need_stop_pinned 或将 not_need_stop_gpu 重命名为 not_need_stop_device，以更清楚地表示它们的内存位置和用途。

Suggested change

self.share_inputs["not_need_stop_gpu"] = paddle.full([1], False, dtype="bool")

# not_need_stop_device is on device (e.g., GPU), while not_need_stop is pinned on CPU

not_need_stop_device = paddle.full([1], False, dtype="bool")

self.share_inputs["not_need_stop_device"] = not_need_stop_device

# Keep old key for backward compatibility

self.share_inputs["not_need_stop_gpu"] = not_need_stop_device

Copilot · 2026-01-23T08:01:45Z

custom_ops/gpu_ops/set_stop.cu

+  bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
+  auto not_need_stop_cpu =
+      GetEmptyTensor({1}, paddle::DataType::BOOL, paddle::CPUPlace());
+  bool* not_need_stop_cpu_data =
+      const_cast<bool*>(not_need_stop_cpu.data<bool>());
+  not_need_stop_cpu_data[0] = not_need_stop_data[0];
+  return {not_need_stop_cpu};
+}
+
+void SetStop(paddle::Tensor& not_need_stop, bool flag) {
+  bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
+  not_need_stop_data[0] = flag;


GetStop 函数直接访问 GPU 内存中的数据（not_need_stop.data()）并将其复制到 CPU 内存，但没有进行任何同步操作。这可能导致数据竞争和不一致的问题。应该在访问 GPU 数据之前添加适当的同步机制（如 cudaStreamSynchronize 或类似的 Paddle API），以确保 GPU 操作已完成。

Suggested change

bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());

auto not_need_stop_cpu =

GetEmptyTensor({1}, paddle::DataType::BOOL, paddle::CPUPlace());

bool* not_need_stop_cpu_data =

const_cast<bool*>(not_need_stop_cpu.data<bool>());

not_need_stop_cpu_data[0] = not_need_stop_data[0];

return {not_need_stop_cpu};

}

void SetStop(paddle::Tensor& not_need_stop, bool flag) {

bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());

not_need_stop_data[0] = flag;

// Copy tensor to CPU with blocking to ensure all device operations are finished

auto not_need_stop_cpu = not_need_stop.copy_to(paddle::CPUPlace(), true);

return {not_need_stop_cpu};

}

void SetStop(paddle::Tensor& not_need_stop, bool flag) {

// Modify the value on CPU and then copy back to the original place

auto cpu_tensor = not_need_stop.copy_to(paddle::CPUPlace(), true);

bool* cpu_data =

cpu_tensor.mutable_data<bool>(paddle::CPUPlace());

cpu_data[0] = flag;

auto updated_tensor = cpu_tensor.copy_to(not_need_stop.place(), true);

not_need_stop = updated_tensor;

Copilot · 2026-01-23T08:01:46Z

fastdeploy/model_executor/pre_and_post_process.py

                model_output.is_block_step,
                block_size,
            )
+            model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)


在 update_inputs_v1.cu 中删除了异步内存复制的代码（not_need_stop_gpu.copy_to），但在 pre_and_post_process.py 的第 405 行添加了 model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)。这个 copy_ 操作的第二个参数 False 表示阻塞同步复制，这可能会影响性能。建议检查是否应该使用异步复制（True）来保持与 PR 标题中"async memcopy"的一致性。

Suggested change

model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)

model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, True)

Copilot · 2026-01-23T08:01:46Z

fastdeploy/model_executor/pre_and_post_process.py

                async_output_queue.put(output)
        else:
            if sampler_output.logprobs_tensors is None:
+                share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids)


在 save_output 调用之前添加了 share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids) 操作。这个复制操作没有指定是否异步，可能会引入额外的同步开销。建议明确指定复制模式（同步或异步），并确保与整体的异步内存复制策略一致。

Suggested change

share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids)

share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids, False)

Copilot · 2026-01-23T08:01:46Z

fastdeploy/worker/output.py

        Indicates if stopping conditions should be ignored
    """
    not_need_stop: bool
+    not_need_stop_gpu: paddle.Tensor


PR 描述不完整。根据自定义编码指南，PR 描述至少应该解释为什么要进行这些修改以及要解决什么问题。请填写以下部分：

Motivation - 说明此 PR 的目的和目标

Modifications - 详细说明此 PR 中所做的更改

Usage or Command - 如果这涉及新功能或性能优化，请提供使用方法或运行命令

Accuracy Tests - 如果此 PR 影响模型输出，请提供准确性测试结果

这些信息对于审查者理解变更的上下文和影响至关重要。

…into not_need_stop

add not_need_stop_gpu

9155c22

Sunny-bot1 had a problem deploying to Metax_ci January 22, 2026 16:08 — with GitHub Actions Error

add sampled_token_ids_cpu

b1d4055

Sunny-bot1 had a problem deploying to Metax_ci January 22, 2026 16:17 — with GitHub Actions Failure

Sunny-bot1 changed the title ~~[Model Runner] Support not_need_gpu async memcopy~~ [Model Runner] Support not_need_stop async memcopy Jan 22, 2026

Jiang-Jia-Jun requested a review from Copilot January 23, 2026 07:54

Copilot started reviewing on behalf of Jiang-Jia-Jun January 23, 2026 07:54 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

fix

bc008b7

Sunny-bot1 had a problem deploying to Metax_ci January 23, 2026 10:29 — with GitHub Actions Error

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

c45d6a9

…into not_need_stop

Sunny-bot1 had a problem deploying to Metax_ci January 23, 2026 10:30 — with GitHub Actions Failure

Sunny-bot1 added 2 commits January 26, 2026 13:49

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

7d80225

…into not_need_stop

add sync

f77fb2c

Sunny-bot1 had a problem deploying to Metax_ci January 26, 2026 07:06 — with GitHub Actions Failure

fix

a4e11ca

Sunny-bot1 had a problem deploying to Metax_ci January 26, 2026 09:29 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model Runner] Support not_need_stop async memcopy #6176

[Model Runner] Support not_need_stop async memcopy #6176

Uh oh!

Sunny-bot1 commented Jan 22, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return get_stop(self.share_inputs["not_need_stop"])
	# NOTE: get_stop returns a CPU tensor; extract the scalar value as bool
	return bool(get_stop(self.share_inputs["not_need_stop"])[0])

-        self.share_inputs["not_need_stop_gpu"] = paddle.full([1], False, dtype="bool")
+        # not_need_stop_device is on device (e.g., GPU), while not_need_stop is pinned on CPU
+        not_need_stop_device = paddle.full([1], False, dtype="bool")
+        self.share_inputs["not_need_stop_device"] = not_need_stop_device
+        # Keep old key for backward compatibility
+        self.share_inputs["not_need_stop_gpu"] = not_need_stop_device

	model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)
	model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, True)

	share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids)
	share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids, False)

[Model Runner] Support not_need_stop async memcopy #6176

Are you sure you want to change the base?

[Model Runner] Support not_need_stop async memcopy #6176

Uh oh!

Conversation

Sunny-bot1 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sunny-bot1 commented Jan 22, 2026 •

edited

Loading