Skip to content

Conversation

@mitu626
Copy link
Contributor

@mitu626 mitu626 commented Jan 22, 2026

Motivation

通过将不同的sm_version单独编译,保证每个custom_ops的包 <2GB,使得可以支持sm80,86,89,90编译到一个whl包

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  1. build.sh 支持根据 FD_UNIFY_BUILD 环境变量打开按照单独的sm编译custom_ops
  2. fastdeploy/model_executor/ops/gpu/init.py 支持按照当前系统sm版本选择不同的custom_ops编译包加载
  3. setup.py 支持将独立编译的sm custom_ops包打包到最终的whl包中

Usage or Command

export FD_UNIFY_BUILD="true"
bash build.sh 1 python false

(当前模式下固定会编译80, 90, 86, 89,其他场景,不设置 FD_UNIFY_BUILD 的时候,编译方式和当前一致)

Accuracy Tests

不涉及

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Jan 22, 2026

Thanks for your contribution!

Jiang-Jia-Jun
Jiang-Jia-Jun previously approved these changes Jan 23, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

此PR实现了在单个wheel包中支持多个CUDA SM版本(80, 86, 89, 90)的构建功能。通过将不同SM版本的custom_ops单独编译到不同的子目录中,确保每个custom_ops包小于2GB,从而实现统一打包。

Changes:

  • 新增FD_UNIFY_BUILD模式,支持将多个SM版本编译到一个wheel包
  • 实现运行时根据GPU的SM版本自动选择对应的custom_ops模块
  • 扩展setup.py的package_data以包含SM版本特定的目录

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 17 comments.

File Description
build.sh 添加build_custom_ops函数以支持统一构建模式,重构build_and_install_ops函数以接受参数化的构建架构和目标目录
fastdeploy/model_executor/ops/gpu/init.py 实现decide_module函数,在运行时根据当前GPU的SM版本选择对应的custom_ops模块
fastdeploy/import_ops.py 改进错误日志,在import失败时输出详细的异常信息
setup.py 扩展package_data配置以包含SM版本特定的fastdeploy_ops子目录及其内容

Comment on lines +24 to +35
import paddle

prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
print(f"current sm_version={sm_version}")

import os

curdir = os.path.dirname(os.path.abspath(__file__))
sm_version_path = os.path.join(curdir, f"fastdeploy_ops_{sm_version}")
if os.path.exists(sm_version_path):
return f".fastdeploy_ops_{sm_version}.fastdeploy_ops"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decide_module函数缺少错误处理。当CUDA不可用或paddle.device.cuda.get_device_properties()调用失败时(例如在CPU-only环境中),会导致未捕获的异常。建议添加try-except块来处理这种情况,并在CUDA不可用时返回默认的fastdeploy_ops模块。

Suggested change
import paddle
prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
print(f"current sm_version={sm_version}")
import os
curdir = os.path.dirname(os.path.abspath(__file__))
sm_version_path = os.path.join(curdir, f"fastdeploy_ops_{sm_version}")
if os.path.exists(sm_version_path):
return f".fastdeploy_ops_{sm_version}.fastdeploy_ops"
"""Decide which GPU ops module to load based on CUDA SM version.
If CUDA is unavailable or querying device properties fails (e.g., on CPU-only
environments), fall back to the default fastdeploy_ops module.
"""
try:
import paddle
except Exception:
# Paddle is not available, fall back to default ops.
return ".fastdeploy_ops"
try:
# If Paddle is compiled without CUDA, fall back directly.
if hasattr(paddle.device, "is_compiled_with_cuda") and not paddle.device.is_compiled_with_cuda():
return ".fastdeploy_ops"
prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
print(f"current sm_version={sm_version}")
import os
curdir = os.path.dirname(os.path.abspath(__file__))
sm_version_path = os.path.join(curdir, f"fastdeploy_ops_{sm_version}")
if os.path.exists(sm_version_path):
return f".fastdeploy_ops_{sm_version}.fastdeploy_ops"
except Exception:
# Any failure when querying CUDA properties should not break import;
# fall back to the default ops implementation.
return ".fastdeploy_ops"

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +31
import paddle

prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
print(f"current sm_version={sm_version}")

import os

Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decide_module函数没有考虑多GPU异构环境。如果系统中有多个不同SM版本的GPU,paddle.device.cuda.get_device_properties()只会返回当前设备(通常是设备0)的属性。这可能导致选择的custom_ops不适用于其他GPU。建议添加文档说明该功能假设所有GPU具有相同的SM版本,或者考虑在多GPU环境中检测是否存在不同的SM版本并提供警告。

Suggested change
import paddle
prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
print(f"current sm_version={sm_version}")
import os
import os
import warnings
import paddle
# Collect SM versions of all visible CUDA devices
device_count = paddle.device.cuda.device_count()
sm_versions = set()
for device_id in range(device_count):
prop = paddle.device.cuda.get_device_properties(device_id)
sm_versions.add(prop.major * 10 + prop.minor)
# Warn user if multiple different SM versions are detected.
# Custom ops are still selected based on device 0 and are assumed
# to be compatible with all GPUs in the current environment.
if len(sm_versions) > 1:
warnings.warn(
"Detected multiple GPUs with different SM versions. "
"Custom GPU ops are selected based on the SM version of device 0; "
"please ensure that all GPUs used with these ops have compatible "
"SM versions.",
UserWarning,
)
# Use device 0 as the canonical device for selecting SM-specific ops.
prop0 = paddle.device.cuda.get_device_properties(0)
sm_version = prop0.major * 10 + prop0.minor
print(f"current sm_version={sm_version}")

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +36
def decide_module():
import paddle

prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
print(f"current sm_version={sm_version}")

import os

curdir = os.path.dirname(os.path.abspath(__file__))
sm_version_path = os.path.join(curdir, f"fastdeploy_ops_{sm_version}")
if os.path.exists(sm_version_path):
return f".fastdeploy_ops_{sm_version}.fastdeploy_ops"
return ".fastdeploy_ops"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的decide_module函数和SM版本自动选择逻辑缺少单元测试。考虑到项目在tests/operators/目录下有大量的测试用例,建议添加测试来验证:1)当SM版本特定的目录存在时,返回正确的模块路径;2)当SM版本特定的目录不存在时,回退到默认模块;3)处理CUDA不可用的情况。这对于确保多SM版本打包功能的正确性至关重要。

Copilot uses AI. Check for mistakes.

prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
print(f"current sm_version={sm_version}")
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print语句应该使用logger记录而不是直接打印。在第28行使用print输出sm_version信息不符合代码库中的日志记录规范。参考fastdeploy/import_ops.py中使用logger的模式,应该使用logger.info或logger.debug来输出这类信息。

Copilot uses AI. Check for mistakes.
build_and_install_ops "[89]" "$custom_ops_dir"

build_and_install_ops "[80, 90]" "${OPS_TMP_DIR}"
cp -r $OPS_SRC_DIR/$OPS_TMP_DIR/* ./fastdeploy/model_executor/ops/gpu
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在build_custom_ops函数的FD_UNIFY_BUILD模式中,第282行的cp命令没有检查返回值。如果复制操作失败,脚本会继续执行但可能导致不完整的包。建议在cp命令后添加错误检查,或使用'set -e'确保命令失败时脚本终止。

Suggested change
cp -r $OPS_SRC_DIR/$OPS_TMP_DIR/* ./fastdeploy/model_executor/ops/gpu
cp -r "$OPS_SRC_DIR/$OPS_TMP_DIR/"* ./fastdeploy/model_executor/ops/gpu || {
echo -e "${RED}[FAIL]${NONE} failed to copy unified custom ops to target directory"
exit 1
}

Copilot uses AI. Check for mistakes.
Comment on lines +271 to +282
function build_custom_ops() {
if [ "$FD_UNIFY_BUILD" ]; then
mkdir -p ${OPS_SRC_DIR}/${OPS_TMP_DIR}

custom_ops_dir=${OPS_TMP_DIR}/fastdeploy_ops_86
build_and_install_ops "[86]" "$custom_ops_dir"

custom_ops_dir=${OPS_TMP_DIR}/fastdeploy_ops_89
build_and_install_ops "[89]" "$custom_ops_dir"

build_and_install_ops "[80, 90]" "${OPS_TMP_DIR}"
cp -r $OPS_SRC_DIR/$OPS_TMP_DIR/* ./fastdeploy/model_executor/ops/gpu
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新的统一构建方式会显著增加wheel包的大小,因为它包含了多个SM版本的custom_ops副本。虽然PR描述提到每个custom_ops包小于2GB,但没有说明最终wheel包的总大小。建议在PR描述或文档中添加关于最终包大小的估算,以及这种方式对存储空间和下载时间的影响,帮助用户评估是否应该使用统一构建模式。

Copilot uses AI. Check for mistakes.
function build_and_install_ops() {
local building_arcs=${1:-$FD_BUILDING_ARCS}
local tmp_dir=${2:-$OPS_TMP_DIR}
echo "BUILD CUSTOM OPS: ${building_arcs}, ${tmp_dir}"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第294行的日志输出格式与脚本其他部分不一致。该脚本在其他地方使用颜色编码的格式化日志输出(如第297行的"${BLUE}[build]${NONE}"),但第294行使用了简单的echo。为了保持一致性和更好的可读性,建议使用类似"echo -e "${BLUE}[build]${NONE} Building custom ops: ${building_arcs} to ${tmp_dir}""的格式。

Suggested change
echo "BUILD CUSTOM OPS: ${building_arcs}, ${tmp_dir}"
echo -e "${BLUE}[build]${NONE} Building custom ops: ${building_arcs} to ${tmp_dir}"

Copilot uses AI. Check for mistakes.

function build_custom_ops() {
if [ "$FD_UNIFY_BUILD" ]; then
mkdir -p ${OPS_SRC_DIR}/${OPS_TMP_DIR}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_custom_ops函数中缺少对FD_UNIFY_BUILD模式下目录创建的错误检查。如果mkdir -p失败或者目录权限问题导致后续操作失败,脚本不会提供清晰的错误信息。建议在mkdir之后检查目录是否成功创建。

Suggested change
mkdir -p ${OPS_SRC_DIR}/${OPS_TMP_DIR}
# Ensure ops temporary directory exists and is writable before building
local ops_build_dir="${OPS_SRC_DIR}/${OPS_TMP_DIR}"
mkdir -p "${ops_build_dir}"
if [ $? -ne 0 ] || [ ! -d "${ops_build_dir}" ] || [ ! -w "${ops_build_dir}" ]; then
echo "[ERROR] Failed to create or access directory: ${ops_build_dir}" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
Comment on lines +275 to +284
custom_ops_dir=${OPS_TMP_DIR}/fastdeploy_ops_86
build_and_install_ops "[86]" "$custom_ops_dir"

custom_ops_dir=${OPS_TMP_DIR}/fastdeploy_ops_89
build_and_install_ops "[89]" "$custom_ops_dir"

build_and_install_ops "[80, 90]" "${OPS_TMP_DIR}"
cp -r $OPS_SRC_DIR/$OPS_TMP_DIR/* ./fastdeploy/model_executor/ops/gpu
else
build_and_install_ops "$FD_BUILDING_ARCS" "$OPS_TMP_DIR"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_custom_ops函数中使用相对路径可能导致问题。在第275和278行,custom_ops_dir使用了相对路径${OPS_TMP_DIR}/fastdeploy_ops_XX,但这些路径是相对于当前目录的。由于build_and_install_ops会切换到OPS_SRC_DIR目录,这可能导致路径解析不正确。虽然第298行使用了readlink -f进行绝对路径转换,但建议在调用build_and_install_ops之前明确使用绝对路径或确保在正确的目录上下文中,以避免潜在的路径解析问题。

Suggested change
custom_ops_dir=${OPS_TMP_DIR}/fastdeploy_ops_86
build_and_install_ops "[86]" "$custom_ops_dir"
custom_ops_dir=${OPS_TMP_DIR}/fastdeploy_ops_89
build_and_install_ops "[89]" "$custom_ops_dir"
build_and_install_ops "[80, 90]" "${OPS_TMP_DIR}"
cp -r $OPS_SRC_DIR/$OPS_TMP_DIR/* ./fastdeploy/model_executor/ops/gpu
else
build_and_install_ops "$FD_BUILDING_ARCS" "$OPS_TMP_DIR"
custom_ops_dir=${OPS_SRC_DIR}/${OPS_TMP_DIR}/fastdeploy_ops_86
build_and_install_ops "[86]" "$custom_ops_dir"
custom_ops_dir=${OPS_SRC_DIR}/${OPS_TMP_DIR}/fastdeploy_ops_89
build_and_install_ops "[89]" "$custom_ops_dir"
build_and_install_ops "[80, 90]" "${OPS_SRC_DIR}/${OPS_TMP_DIR}"
cp -r $OPS_SRC_DIR/$OPS_TMP_DIR/* ./fastdeploy/model_executor/ops/gpu
else
build_and_install_ops "$FD_BUILDING_ARCS" "${OPS_SRC_DIR}/${OPS_TMP_DIR}"

Copilot uses AI. Check for mistakes.
import paddle

prop = paddle.device.cuda.get_device_properties()
sm_version = prop.major * 10 + prop.minor
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sm_version的计算使用了prop.major * 10 + prop.minor,但没有验证计算结果的合理性。对于未来的GPU架构,如果minor版本超过9,这个计算可能会产生意外的结果。虽然当前NVIDIA的命名约定使minor版本不会超过9,但为了代码的健壮性,建议添加断言或验证来确保计算出的sm_version在预期范围内(如80-100)。

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@17866c0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/ops/gpu/__init__.py 72.22% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6173   +/-   ##
==========================================
  Coverage           ?   67.03%           
==========================================
  Files              ?      383           
  Lines              ?    50543           
  Branches           ?     7894           
==========================================
  Hits               ?    33882           
  Misses             ?    14188           
  Partials           ?     2473           
Flag Coverage Δ
GPU 67.03% <75.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@EmmonsCurse EmmonsCurse merged commit 84a1780 into PaddlePaddle:develop Jan 26, 2026
22 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants