-
Notifications
You must be signed in to change notification settings - Fork 265
refactor: unify interwave and intrawave pipeline implementation #3552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the GEMM pipeline implementation to unify the Intrawave and Interwave scheduler implementations at the work-group level, eliminating code duplication while maintaining performance neutrality.
Changes:
- Unified PipelineImpl template that works for both Intrawave and Interwave schedulers by moving scheduler-specific behavior into BlockGemm specializations
- Added DoLocalPrefetch template parameter to BlockGemm operator() for pipeline-controlled prefetch behavior
- Added block_sync_lds() at the end of Intrawave BlockGemmImpl to encapsulate scheduler-specific synchronization
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_mem.hpp | Unified PipelineImpl template replacing separate Intrawave and Interwave specializations; scheduler-specific behavior now handled by BlockGemm implementations |
| include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_comp_v3.hpp | Updated block_gemm calls to explicitly control LocalPrefetch via template parameter; replaced automatic prefetch with explicit control |
| include/ck_tile/ops/gemm/block/block_universal_gemm_as_bs_cr.hpp | Added DoLocalPrefetch template parameter to operator(); added block_sync_lds() to Intrawave implementation to enable unified pipeline |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| template <GemmPipelineScheduler Scheduler> | ||
| // Unified pipeline implementation for both Intrawave and Interwave schedulers | ||
| // Scheduler-specific behavior is encapsulated in BlockGemm (via BlockGemmImpl specializations) | ||
| template <GemmPipelineScheduler Scheduler_> |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The template parameter 'Scheduler_' is declared but never used within the PipelineImpl struct. The scheduler-specific behavior is achieved through the BlockGemm specializations rather than this template parameter. Consider either removing this unused template parameter or adding a static_assert to validate that it matches the scheduler from the Problem definition for consistency.
| block_sync_lds(); | ||
| block_gemm.template operator()<false>( | ||
| c_block_tile, a_lds_gemm_window, b_lds_gemm_window); | ||
| // block_sync_lds(); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing the commented-out block_sync_lds() calls rather than leaving them in the code. If these are no longer needed due to the refactoring (where synchronization is now handled within the BlockGemm operator), they should be removed to avoid confusion. If they might be needed for debugging or future reference, add a comment explaining why they are commented out.
| // block_sync_lds(); |
| block_sync_lds(); | ||
| block_gemm.template operator()<false>( | ||
| c_block_tile, a_lds_gemm_window, b_lds_gemm_window); | ||
| // block_sync_lds(); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing the commented-out block_sync_lds() calls rather than leaving them in the code. If these are no longer needed due to the refactoring (where synchronization is now handled within the BlockGemm operator), they should be removed to avoid confusion. If they might be needed for debugging or future reference, add a comment explaining why they are commented out.
| // block_sync_lds(); |
| // Note: Interwave scheduler requires LocalPrefetch as part of its design | ||
| // The DoLocalPrefetch flag is provided for API consistency but should typically be | ||
| // true | ||
| if constexpr(DoLocalPrefetch) | ||
| { | ||
| LocalPrefetch<kIter.value>( | ||
| a_block_window, b_block_window, a_load_tr, b_load_tr); | ||
| } |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states that LocalPrefetch is required for Interwave scheduler design, but the code still allows it to be disabled via DoLocalPrefetch=false. Consider either enforcing this requirement with a static_assert or updating the comment to clarify what happens if DoLocalPrefetch is false for Interwave.
at work-group level.
I benchmarked the code after refactoring against current develop on MI300 for all supported data_types.
There does not seem to be any significant performance difference. The average % change is neutral.
Problem shapes with large %change seem to be outliers that disappear when benchmarking is run for a second time.
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered