-
Notifications
You must be signed in to change notification settings - Fork 689
[Model Runner] Support not_need_stop async memcopy #6176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
此 PR 旨在优化 not_need_stop 标志的内存复制操作,通过引入异步内存复制来提高性能。变更涉及在 GPU 上维护 not_need_stop 标志的单独副本,并使用固定内存(pinned memory)和自定义 CUDA 操作来管理 CPU 和 GPU 之间的数据传输。
Changes:
- 添加了新的 CUDA 操作
get_stop和set_stop用于管理 GPU 上的 stop 标志 - 在
ModelOutputData中添加了not_need_stop_gpu字段以支持 GPU 版本的 stop 标志 - 修改了
update_inputs_v1.cu以移除同步内存复制逻辑 - 更新了内存分配策略,使用固定内存(pinned memory)进行异步操作
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| custom_ops/gpu_ops/set_stop.cu | 新增 CUDA 操作文件,实现 get_stop 和 set_stop 函数用于访问和设置 GPU 上的 stop 标志 |
| custom_ops/gpu_ops/update_inputs_v1.cu | 移除了同步的 GPU-CPU 内存复制代码,现在直接使用 GPU 上的 not_need_stop |
| custom_ops/setup_ops.py | 将新的 set_stop.cu 文件添加到编译列表中 |
| fastdeploy/worker/output.py | 在 ModelOutputData 类中添加 not_need_stop_gpu 字段 |
| fastdeploy/worker/gpu_model_runner.py | 导入并使用新的 get_stop 和 set_stop 操作;初始化固定内存和 GPU 张量 |
| fastdeploy/model_executor/pre_and_post_process.py | 更新后处理逻辑以使用 GPU 版本的 stop 标志,并添加异步内存复制 |
| void SetStop(paddle::Tensor& not_need_stop, bool flag) { | ||
| bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>()); | ||
| not_need_stop_data[0] = flag; | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetStop 函数直接修改 GPU 内存,但没有任何同步机制。这在异步 GPU 操作环境中可能导致数据竞争。应该使用 Paddle 的异步操作 API 或添加适当的同步,以确保操作的正确性。
| def not_need_stop(self) -> bool: | ||
| """Stop decoding if the tensor meets the termination condition""" | ||
| return self.share_inputs["not_need_stop"][0] | ||
| return get_stop(self.share_inputs["not_need_stop"]) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_stop 函数返回一个 CPU 张量,但这里直接将其作为布尔值返回。应该修改为 return get_stop(self.share_inputs["not_need_stop"])[0] 或 return get_stop(self.share_inputs["not_need_stop"]).item(),以正确提取布尔值。参考其他模型运行器(如 gcu_model_runner.py:1189, metax_model_runner.py:2647)的实现,它们都使用 [0] 来访问张量中的值。
| return get_stop(self.share_inputs["not_need_stop"]) | |
| # NOTE: get_stop returns a CPU tensor; extract the scalar value as bool | |
| return bool(get_stop(self.share_inputs["not_need_stop"])[0]) |
| """ | ||
| Indicates if stopping conditions should be ignored | ||
| """ | ||
| not_need_stop: bool |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新添加的 not_need_stop_gpu 字段缺少文档字符串。为了保持与代码库其他部分的一致性(每个字段都有对应的文档字符串),应该为这个新字段添加类似的文档说明,例如:
"""
GPU version of not_need_stop flag for async operations
"""
not_need_stop_gpu: paddle.Tensor
| not_need_stop: bool | |
| not_need_stop: bool | |
| """ | |
| GPU version of not_need_stop flag for async operations | |
| """ |
| self.share_inputs["step_idx"] = paddle.full([max_num_seqs, 1], 0, dtype="int64") | ||
| self.share_inputs["not_need_stop"] = paddle.full([1], False, dtype="bool").cpu() | ||
| self.share_inputs["not_need_stop"] = paddle.full([1], False, dtype="bool").pin_memory() | ||
| self.share_inputs["not_need_stop_gpu"] = paddle.full([1], False, dtype="bool") |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量命名 not_need_stop_gpu 可能会引起混淆。现在 not_need_stop 使用 .pin_memory() 分配在固定内存中(pinned memory),而 not_need_stop_gpu 在 GPU 上。建议考虑更清晰的命名,例如将 not_need_stop 重命名为 not_need_stop_pinned 或将 not_need_stop_gpu 重命名为 not_need_stop_device,以更清楚地表示它们的内存位置和用途。
| self.share_inputs["not_need_stop_gpu"] = paddle.full([1], False, dtype="bool") | |
| # not_need_stop_device is on device (e.g., GPU), while not_need_stop is pinned on CPU | |
| not_need_stop_device = paddle.full([1], False, dtype="bool") | |
| self.share_inputs["not_need_stop_device"] = not_need_stop_device | |
| # Keep old key for backward compatibility | |
| self.share_inputs["not_need_stop_gpu"] = not_need_stop_device |
| bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>()); | ||
| auto not_need_stop_cpu = | ||
| GetEmptyTensor({1}, paddle::DataType::BOOL, paddle::CPUPlace()); | ||
| bool* not_need_stop_cpu_data = | ||
| const_cast<bool*>(not_need_stop_cpu.data<bool>()); | ||
| not_need_stop_cpu_data[0] = not_need_stop_data[0]; | ||
| return {not_need_stop_cpu}; | ||
| } | ||
|
|
||
| void SetStop(paddle::Tensor& not_need_stop, bool flag) { | ||
| bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>()); | ||
| not_need_stop_data[0] = flag; |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetStop 函数直接访问 GPU 内存中的数据(not_need_stop.data())并将其复制到 CPU 内存,但没有进行任何同步操作。这可能导致数据竞争和不一致的问题。应该在访问 GPU 数据之前添加适当的同步机制(如 cudaStreamSynchronize 或类似的 Paddle API),以确保 GPU 操作已完成。
| bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>()); | |
| auto not_need_stop_cpu = | |
| GetEmptyTensor({1}, paddle::DataType::BOOL, paddle::CPUPlace()); | |
| bool* not_need_stop_cpu_data = | |
| const_cast<bool*>(not_need_stop_cpu.data<bool>()); | |
| not_need_stop_cpu_data[0] = not_need_stop_data[0]; | |
| return {not_need_stop_cpu}; | |
| } | |
| void SetStop(paddle::Tensor& not_need_stop, bool flag) { | |
| bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>()); | |
| not_need_stop_data[0] = flag; | |
| // Copy tensor to CPU with blocking to ensure all device operations are finished | |
| auto not_need_stop_cpu = not_need_stop.copy_to(paddle::CPUPlace(), true); | |
| return {not_need_stop_cpu}; | |
| } | |
| void SetStop(paddle::Tensor& not_need_stop, bool flag) { | |
| // Modify the value on CPU and then copy back to the original place | |
| auto cpu_tensor = not_need_stop.copy_to(paddle::CPUPlace(), true); | |
| bool* cpu_data = | |
| cpu_tensor.mutable_data<bool>(paddle::CPUPlace()); | |
| cpu_data[0] = flag; | |
| auto updated_tensor = cpu_tensor.copy_to(not_need_stop.place(), true); | |
| not_need_stop = updated_tensor; |
| model_output.is_block_step, | ||
| block_size, | ||
| ) | ||
| model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在 update_inputs_v1.cu 中删除了异步内存复制的代码(not_need_stop_gpu.copy_to),但在 pre_and_post_process.py 的第 405 行添加了 model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False)。这个 copy_ 操作的第二个参数 False 表示阻塞同步复制,这可能会影响性能。建议检查是否应该使用异步复制(True)来保持与 PR 标题中"async memcopy"的一致性。
| model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, False) | |
| model_output.not_need_stop.copy_(model_output.not_need_stop_gpu, True) |
| async_output_queue.put(output) | ||
| else: | ||
| if sampler_output.logprobs_tensors is None: | ||
| share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在 save_output 调用之前添加了 share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids) 操作。这个复制操作没有指定是否异步,可能会引入额外的同步开销。建议明确指定复制模式(同步或异步),并确保与整体的异步内存复制策略一致。
| share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids) | |
| share_inputs["sampled_token_ids"].copy_(sampler_output.sampled_token_ids, False) |
fastdeploy/worker/output.py
Outdated
| Indicates if stopping conditions should be ignored | ||
| """ | ||
| not_need_stop: bool | ||
| not_need_stop_gpu: paddle.Tensor |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR 描述不完整。根据自定义编码指南,PR 描述至少应该解释为什么要进行这些修改以及要解决什么问题。请填写以下部分:
- Motivation - 说明此 PR 的目的和目标
- Modifications - 详细说明此 PR 中所做的更改
- Usage or Command - 如果这涉及新功能或性能优化,请提供使用方法或运行命令
- Accuracy Tests - 如果此 PR 影响模型输出,请提供准确性测试结果
这些信息对于审查者理解变更的上下文和影响至关重要。
…into not_need_stop
…into not_need_stop
Motivation
not_need_stop、sampled_token_ids准备放在sample_and_propress阶段异步拷贝,利用cuda event在save_output阶段进行同步
Modifications
利用pin_memory为后续异步拷贝做准备:
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.