Skip to content

Conversation

@Sunny-bot1
Copy link
Collaborator

@Sunny-bot1 Sunny-bot1 commented Jan 22, 2026

Motivation

not_need_stop、sampled_token_ids准备放在sample_and_propress阶段异步拷贝,利用cuda event在save_output阶段进行同步

Modifications

利用pin_memory为后续异步拷贝做准备:

  • 修改shared_inputs["not_need_stop"]为pin_memory,由于pin_memory tensor在python侧被访问后会变为gpu tensor(why?),所以增加get_stop和set_stop两个自定义算子通过指针访问pin_memory tensor
  • 增加shared_inputs["sampled_token_ids"].pin_memory()
  • 增加shared_inputs["not_need_stop_device"]
  • 将not_need_stop和sampled_token_ids的CPU-GPU 拷贝放在python侧,方便后续灵活调整拷贝时机

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Jan 22, 2026

Thanks for your contribution!

@Sunny-bot1 Sunny-bot1 changed the title [Model Runner] Support not_need_gpu async memcopy [Model Runner] Support not_need_stop async memcopy Jan 22, 2026
@Jiang-Jia-Jun Jiang-Jia-Jun requested a review from Copilot January 23, 2026 07:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

此 PR 旨在优化 not_need_stop 标志的内存复制操作,通过引入异步内存复制来提高性能。变更涉及在 GPU 上维护 not_need_stop 标志的单独副本,并使用固定内存(pinned memory)和自定义 CUDA 操作来管理 CPU 和 GPU 之间的数据传输。

Changes:

  • 添加了新的 CUDA 操作 get_stopset_stop 用于管理 GPU 上的 stop 标志
  • ModelOutputData 中添加了 not_need_stop_gpu 字段以支持 GPU 版本的 stop 标志
  • 修改了 update_inputs_v1.cu 以移除同步内存复制逻辑
  • 更新了内存分配策略,使用固定内存(pinned memory)进行异步操作

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
custom_ops/gpu_ops/set_stop.cu 新增 CUDA 操作文件,实现 get_stopset_stop 函数用于访问和设置 GPU 上的 stop 标志
custom_ops/gpu_ops/update_inputs_v1.cu 移除了同步的 GPU-CPU 内存复制代码,现在直接使用 GPU 上的 not_need_stop
custom_ops/setup_ops.py 将新的 set_stop.cu 文件添加到编译列表中
fastdeploy/worker/output.py ModelOutputData 类中添加 not_need_stop_gpu 字段
fastdeploy/worker/gpu_model_runner.py 导入并使用新的 get_stopset_stop 操作;初始化固定内存和 GPU 张量
fastdeploy/model_executor/pre_and_post_process.py 更新后处理逻辑以使用 GPU 版本的 stop 标志,并添加异步内存复制

Comment on lines +27 to +30
void SetStop(paddle::Tensor& not_need_stop, bool flag) {
bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
not_need_stop_data[0] = flag;
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetStop 函数直接修改 GPU 内存,但没有任何同步机制。这在异步 GPU 操作环境中可能导致数据竞争。应该使用 Paddle 的异步操作 API 或添加适当的同步,以确保操作的正确性。

Copilot uses AI. Check for mistakes.
def not_need_stop(self) -> bool:
"""Stop decoding if the tensor meets the termination condition"""
return self.share_inputs["not_need_stop"][0]
return get_stop(self.share_inputs["not_need_stop"])
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_stop 函数返回一个 CPU 张量,但这里直接将其作为布尔值返回。应该修改为 return get_stop(self.share_inputs["not_need_stop"])[0]return get_stop(self.share_inputs["not_need_stop"]).item(),以正确提取布尔值。参考其他模型运行器(如 gcu_model_runner.py:1189, metax_model_runner.py:2647)的实现,它们都使用 [0] 来访问张量中的值。

Suggested change
return get_stop(self.share_inputs["not_need_stop"])
# NOTE: get_stop returns a CPU tensor; extract the scalar value as bool
return bool(get_stop(self.share_inputs["not_need_stop"])[0])

Copilot uses AI. Check for mistakes.
"""
Indicates if stopping conditions should be ignored
"""
not_need_stop: bool
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新添加的 not_need_stop_gpu 字段缺少文档字符串。为了保持与代码库其他部分的一致性(每个字段都有对应的文档字符串),应该为这个新字段添加类似的文档说明,例如:

"""
GPU version of not_need_stop flag for async operations
"""
not_need_stop_gpu: paddle.Tensor

Suggested change
not_need_stop: bool
not_need_stop: bool
"""
GPU version of not_need_stop flag for async operations
"""

Copilot uses AI. Check for mistakes.
self.share_inputs["step_idx"] = paddle.full([max_num_seqs, 1], 0, dtype="int64")
self.share_inputs["not_need_stop"] = paddle.full([1], False, dtype="bool").cpu()
self.share_inputs["not_need_stop"] = paddle.full([1], False, dtype="bool").pin_memory()
self.share_inputs["not_need_stop_gpu"] = paddle.full([1], False, dtype="bool")
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量命名 not_need_stop_gpu 可能会引起混淆。现在 not_need_stop 使用 .pin_memory() 分配在固定内存中(pinned memory),而 not_need_stop_gpu 在 GPU 上。建议考虑更清晰的命名,例如将 not_need_stop 重命名为 not_need_stop_pinned 或将 not_need_stop_gpu 重命名为 not_need_stop_device,以更清楚地表示它们的内存位置和用途。

Suggested change
self.share_inputs["not_need_stop_gpu"] = paddle.full([1], False, dtype="bool")
# not_need_stop_device is on device (e.g., GPU), while not_need_stop is pinned on CPU
not_need_stop_device = paddle.full([1], False, dtype="bool")
self.share_inputs["not_need_stop_device"] = not_need_stop_device
# Keep old key for backward compatibility
self.share_inputs["not_need_stop_gpu"] = not_need_stop_device

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +29
bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
auto not_need_stop_cpu =
GetEmptyTensor({1}, paddle::DataType::BOOL, paddle::CPUPlace());
bool* not_need_stop_cpu_data =
const_cast<bool*>(not_need_stop_cpu.data<bool>());
not_need_stop_cpu_data[0] = not_need_stop_data[0];
return {not_need_stop_cpu};
}

void SetStop(paddle::Tensor& not_need_stop, bool flag) {
bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
not_need_stop_data[0] = flag;
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetStop 函数直接访问 GPU 内存中的数据(not_need_stop.data())并将其复制到 CPU 内存,但没有进行任何同步操作。这可能导致数据竞争和不一致的问题。应该在访问 GPU 数据之前添加适当的同步机制(如 cudaStreamSynchronize 或类似的 Paddle API),以确保 GPU 操作已完成。

Suggested change
bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
auto not_need_stop_cpu =
GetEmptyTensor({1}, paddle::DataType::BOOL, paddle::CPUPlace());
bool* not_need_stop_cpu_data =
const_cast<bool*>(not_need_stop_cpu.data<bool>());
not_need_stop_cpu_data[0] = not_need_stop_data[0];
return {not_need_stop_cpu};
}
void SetStop(paddle::Tensor& not_need_stop, bool flag) {
bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>());
not_need_stop_data[0] = flag;
// Copy tensor to CPU with blocking to ensure all device operations are finished
auto not_need_stop_cpu = not_need_stop.copy_to(paddle::CPUPlace(), true);
return {not_need_stop_cpu};
}
void SetStop(paddle::Tensor& not_need_stop, bool flag) {
// Modify the value on CPU and then copy back to the original place
auto cpu_tensor = not_need_stop.copy_to(paddle::CPUPlace(), true);
bool* cpu_data =
cpu_tensor.mutable_data<bool>(paddle::CPUPlace());
cpu_data[0] = flag;
auto updated_tensor = cpu_tensor.copy_to(not_need_stop.place(), true);
not_need_stop = updated_tensor;

Copilot uses AI. Check for mistakes.
model_output.is_block_step,
block_size,
)
model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 update_inputs_v1.cu 中删除了异步内存复制的代码(not_need_stop_gpu.copy_to),但在 pre_and_post_process.py 的第 405 行添加了 model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)。这个 copy_ 操作的第二个参数 False 表示阻塞同步复制,这可能会影响性能。建议检查是否应该使用异步复制(True)来保持与 PR 标题中"async memcopy"的一致性。

Suggested change
model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)
model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, True)

Copilot uses AI. Check for mistakes.
async_output_queue.put(output)
else:
if sampler_output.logprobs_tensors is None:
share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 save_output 调用之前添加了 share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids) 操作。这个复制操作没有指定是否异步,可能会引入额外的同步开销。建议明确指定复制模式(同步或异步),并确保与整体的异步内存复制策略一致。

Suggested change
share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids)
share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids, False)

Copilot uses AI. Check for mistakes.
Indicates if stopping conditions should be ignored
"""
not_need_stop: bool
not_need_stop_gpu: paddle.Tensor
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述不完整。根据自定义编码指南,PR 描述至少应该解释为什么要进行这些修改以及要解决什么问题。请填写以下部分:

  1. Motivation - 说明此 PR 的目的和目标
  2. Modifications - 详细说明此 PR 中所做的更改
  3. Usage or Command - 如果这涉及新功能或性能优化,请提供使用方法或运行命令
  4. Accuracy Tests - 如果此 PR 影响模型输出,请提供准确性测试结果

这些信息对于审查者理解变更的上下文和影响至关重要。

Copilot generated this review using guidance from repository custom instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant