refactor: unify interwave and intrawave pipeline implementation #3552

AviralGoelAMD · 2026-01-12T16:30:35Z

at work-group level.

I benchmarked the code after refactoring against current develop on MI300 for all supported data_types.

There does not seem to be any significant performance difference. The average % change is neutral.
Problem shapes with large %change seem to be outliers that disappear when benchmarking is run for a second time.

  Comparing: interwave_bf16
  Develop: develop_interwave_bf16.csv
  Refactor: refactor_interwave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 25 (46.3%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.08%

Largest improvement: +2.19%
  M=16, N=64, K=256, DataType=bf16
  0.137 -> 0.140 TFlops

Largest regression: -2.99%
  M=2048, N=5120, K=1024, DataType=bf16
  262.119 -> 254.280 TFlops

Average change by data type:
  bf16: -0.08% (n=54)
================================================================================

================================================================================
Comparing: interwave_bf8
  Develop: develop_interwave_bf8.csv
  Refactor: refactor_interwave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 30 (55.6%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +3.62%
  M=1024, N=4096, K=2048, DataType=bf8
  402.355 -> 416.931 TFlops

Largest regression: -4.46%
  M=1536, N=2816, K=1152, DataType=bf8
  389.701 -> 372.326 TFlops

Average change by data type:
  bf8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp16
  Develop: develop_interwave_fp16.csv
  Refactor: refactor_interwave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 29 (53.7%)
Regressions (refactor < develop): 22 (40.7%)
Neutral (no change): 3 (5.6%)

Average percent change: +0.49%

Largest improvement: +13.77%
  M=16, N=64, K=256, DataType=fp16
  0.138 -> 0.157 TFlops

Largest regression: -2.81%
  M=1024, N=2304, K=4096, DataType=fp16
  252.705 -> 245.602 TFlops

Average change by data type:
  fp16: +0.49% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp8
  Develop: develop_interwave_fp8.csv
  Refactor: refactor_interwave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 29 (53.7%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.22%

Largest improvement: +12.90%
  M=16, N=64, K=128, DataType=fp8
  0.124 -> 0.140 TFlops

Largest regression: -13.87%
  M=16, N=64, K=256, DataType=fp8
  0.238 -> 0.205 TFlops

Average change by data type:
  fp8: -0.22% (n=54)
================================================================================

================================================================================
Comparing: interwave_int8
  Develop: develop_interwave_int8.csv
  Refactor: refactor_interwave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 27 (50.0%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +8.94%
  M=16, N=64, K=128, DataType=int8
  0.123 -> 0.134 TFlops

Largest regression: -9.12%
  M=512, N=768, K=640, DataType=int8
  103.993 -> 94.504 TFlops

Average change by data type:
  int8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf16
  Develop: develop_intrawave_bf16.csv
  Refactor: refactor_intrawave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 30 (55.6%)
Regressions (refactor < develop): 24 (44.4%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.11%

Largest improvement: +2.72%
  M=1536, N=2816, K=1152, DataType=bf16
  215.054 -> 220.908 TFlops

Largest regression: -12.93%
  M=16, N=64, K=256, DataType=bf16
  0.147 -> 0.128 TFlops

Average change by data type:
  bf16: -0.11% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf8
  Develop: develop_intrawave_bf8.csv
  Refactor: refactor_intrawave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 12 (22.2%)
Regressions (refactor < develop): 42 (77.8%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.66%

Largest improvement: +1.16%
  M=16795, N=2304, K=4096, DataType=bf8
  565.446 -> 572.001 TFlops

Largest regression: -10.57%
  M=16, N=64, K=256, DataType=bf8
  0.227 -> 0.203 TFlops

Average change by data type:
  bf8: -1.66% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp16
  Develop: develop_intrawave_fp16.csv
  Refactor: refactor_intrawave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 21 (38.9%)
Regressions (refactor < develop): 32 (59.3%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.13%

Largest improvement: +2.46%
  M=16, N=5120, K=1024, DataType=fp16
  15.963 -> 16.356 TFlops

Largest regression: -5.26%
  M=16, N=64, K=256, DataType=fp16
  0.152 -> 0.144 TFlops

Average change by data type:
  fp16: -0.13% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp8
  Develop: develop_intrawave_fp8.csv
  Refactor: refactor_intrawave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 17 (31.5%)
Regressions (refactor < develop): 37 (68.5%)
Neutral (no change): 0 (0.0%)

Average percent change: +0.23%

Largest improvement: +23.96%
  M=16, N=64, K=128, DataType=fp8
  0.096 -> 0.119 TFlops

Largest regression: -10.21%
  M=512, N=768, K=640, DataType=fp8
  105.923 -> 95.105 TFlops

Average change by data type:
  fp8: +0.23% (n=54)
================================================================================

================================================================================
Comparing: intrawave_int8
  Develop: develop_intrawave_int8.csv
  Refactor: refactor_intrawave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 8 (14.8%)
Regressions (refactor < develop): 46 (85.2%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.79%

Largest improvement: +13.40%
  M=16, N=64, K=128, DataType=int8
  0.097 -> 0.110 TFlops

Largest regression: -9.51%
  M=9, N=5120, K=1024, DataType=int8
  15.002 -> 13.576 TFlops

Average change by data type:
  int8: -1.79% (n=54)
================================================================================

(rocprof-venv) root@banff-cyxtera-s73-2:~/workspace/build# ./compare_results.py --develop-pattern "develop_*.csv" --refactor-pattern "refactor_*.csv"
Found 10 develop files and 10 refactor files

Found 10 matching pairs to compare

================================================================================
Comparing: interwave_bf16
  Develop: develop_interwave_bf16.csv
  Refactor: refactor_interwave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 25 (46.3%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.08%

Largest improvement: +2.19%
  M=16, N=64, K=256, DataType=bf16
  0.137 -> 0.140 TFlops

Largest regression: -2.99%
  M=2048, N=5120, K=1024, DataType=bf16
  262.119 -> 254.280 TFlops

Average change by data type:
  bf16: -0.08% (n=54)
================================================================================

================================================================================
Comparing: interwave_bf8
  Develop: develop_interwave_bf8.csv
  Refactor: refactor_interwave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 30 (55.6%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +3.62%
  M=1024, N=4096, K=2048, DataType=bf8
  402.355 -> 416.931 TFlops

Largest regression: -4.46%
  M=1536, N=2816, K=1152, DataType=bf8
  389.701 -> 372.326 TFlops

Average change by data type:
  bf8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp16
  Develop: develop_interwave_fp16.csv
  Refactor: refactor_interwave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 29 (53.7%)
Regressions (refactor < develop): 22 (40.7%)
Neutral (no change): 3 (5.6%)

Average percent change: +0.49%

Largest improvement: +13.77%
  M=16, N=64, K=256, DataType=fp16
  0.138 -> 0.157 TFlops

Largest regression: -2.81%
  M=1024, N=2304, K=4096, DataType=fp16
  252.705 -> 245.602 TFlops

Average change by data type:
  fp16: +0.49% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp8
  Develop: develop_interwave_fp8.csv
  Refactor: refactor_interwave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 29 (53.7%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.22%

Largest improvement: +12.90%
  M=16, N=64, K=128, DataType=fp8
  0.124 -> 0.140 TFlops

Largest regression: -13.87%
  M=16, N=64, K=256, DataType=fp8
  0.238 -> 0.205 TFlops

Average change by data type:
  fp8: -0.22% (n=54)
================================================================================

================================================================================
Comparing: interwave_int8
  Develop: develop_interwave_int8.csv
  Refactor: refactor_interwave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 27 (50.0%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +8.94%
  M=16, N=64, K=128, DataType=int8
  0.123 -> 0.134 TFlops

Largest regression: -9.12%
  M=512, N=768, K=640, DataType=int8
  103.993 -> 94.504 TFlops

Average change by data type:
  int8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf16
  Develop: develop_intrawave_bf16.csv
  Refactor: refactor_intrawave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 30 (55.6%)
Regressions (refactor < develop): 24 (44.4%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.11%

Largest improvement: +2.72%
  M=1536, N=2816, K=1152, DataType=bf16
  215.054 -> 220.908 TFlops

Largest regression: -12.93%
  M=16, N=64, K=256, DataType=bf16
  0.147 -> 0.128 TFlops

Average change by data type:
  bf16: -0.11% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf8
  Develop: develop_intrawave_bf8.csv
  Refactor: refactor_intrawave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 12 (22.2%)
Regressions (refactor < develop): 42 (77.8%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.66%

Largest improvement: +1.16%
  M=16795, N=2304, K=4096, DataType=bf8
  565.446 -> 572.001 TFlops

Largest regression: -10.57%
  M=16, N=64, K=256, DataType=bf8
  0.227 -> 0.203 TFlops

Average change by data type:
  bf8: -1.66% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp16
  Develop: develop_intrawave_fp16.csv
  Refactor: refactor_intrawave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 21 (38.9%)
Regressions (refactor < develop): 32 (59.3%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.13%

Largest improvement: +2.46%
  M=16, N=5120, K=1024, DataType=fp16
  15.963 -> 16.356 TFlops

Largest regression: -5.26%
  M=16, N=64, K=256, DataType=fp16
  0.152 -> 0.144 TFlops

Average change by data type:
  fp16: -0.13% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp8
  Develop: develop_intrawave_fp8.csv
  Refactor: refactor_intrawave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 17 (31.5%)
Regressions (refactor < develop): 37 (68.5%)
Neutral (no change): 0 (0.0%)

Average percent change: +0.23%

Largest improvement: +23.96%
  M=16, N=64, K=128, DataType=fp8
  0.096 -> 0.119 TFlops

Largest regression: -10.21%
  M=512, N=768, K=640, DataType=fp8
  105.923 -> 95.105 TFlops

Average change by data type:
  fp8: +0.23% (n=54)
================================================================================

================================================================================
Comparing: intrawave_int8
  Develop: develop_intrawave_int8.csv
  Refactor: refactor_intrawave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 8 (14.8%)
Regressions (refactor < develop): 46 (85.2%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.79%

Largest improvement: +13.40%
  M=16, N=64, K=128, DataType=int8
  0.097 -> 0.110 TFlops

Largest regression: -9.51%
  M=9, N=5120, K=1024, DataType=int8
  15.002 -> 13.576 TFlops

Average change by data type:
  int8: -1.79% (n=54)
================================================================================

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…tructions

…rk-group level

AviralGoelAMD · 2026-01-13T18:17:24Z

================================================================================
Comparing: compute_v3_intrawave_bf16
  Develop: develop_compute_v3_intrawave_bf16.csv
  Refactor: refactor_compute_v3_intrawave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 32 (59.3%)
Regressions (refactor < develop): 18 (33.3%)
Neutral (no change): 4 (7.4%)

Average percent change: +0.06%

Largest improvement: +7.39%
  M=15433, N=2304, K=4096, DataType=bf16
  479.700 -> 515.133 TFlops

Largest regression: -6.66%
  M=32768, N=2304, K=4096, DataType=bf16
  516.073 -> 481.704 TFlops

Average change by data type:
  bf16: +0.06% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_bf8
  Develop: develop_compute_v3_intrawave_bf8.csv
  Refactor: refactor_compute_v3_intrawave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 19 (35.2%)
Regressions (refactor < develop): 31 (57.4%)
Neutral (no change): 4 (7.4%)

Average percent change: -0.01%

Largest improvement: +4.19%
  M=16795, N=2304, K=4096, DataType=bf8
  871.555 -> 908.042 TFlops

Largest regression: -2.77%
  M=1536, N=2816, K=1152, DataType=bf8
  556.408 -> 541.000 TFlops

Average change by data type:
  bf8: -0.01% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_fp16
  Develop: develop_compute_v3_intrawave_fp16.csv
  Refactor: refactor_compute_v3_intrawave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 18 (33.3%)
Regressions (refactor < develop): 32 (59.3%)
Neutral (no change): 4 (7.4%)

Average percent change: -0.05%

Largest improvement: +3.89%
  M=16795, N=2304, K=4096, DataType=fp16
  487.191 -> 506.147 TFlops

Largest regression: -4.12%
  M=1024, N=4096, K=2048, DataType=fp16
  361.818 -> 346.915 TFlops

Average change by data type:
  fp16: -0.05% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_fp8
  Develop: develop_compute_v3_intrawave_fp8.csv
  Refactor: refactor_compute_v3_intrawave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 8 (14.8%)
Regressions (refactor < develop): 44 (81.5%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.66%

Largest improvement: +2.35%
  M=512, N=768, K=640, DataType=fp8
  62.417 -> 63.881 TFlops

Largest regression: -4.73%
  M=15433, N=4096, K=2048, DataType=fp8
  854.258 -> 813.891 TFlops

Average change by data type:
  fp8: -0.66% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_int8
  Develop: develop_compute_v3_intrawave_int8.csv
  Refactor: refactor_compute_v3_intrawave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 15 (27.8%)
Regressions (refactor < develop): 37 (68.5%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.27%

Largest improvement: +8.24%
  M=16384, N=8192, K=3584, DataType=int8
  925.985 -> 1002.240 TFlops

Largest regression: -6.43%
  M=16384, N=7168, K=8192, DataType=int8
  1046.800 -> 979.494 TFlops

Average change by data type:
  int8: -0.27% (n=54)
================================================================================

(rocprof-venv) root@banff-cyxtera-s73-2:~/workspace/build#```

Copilot

Pull request overview

This PR refactors the GEMM pipeline implementation to unify the Intrawave and Interwave scheduler implementations at the work-group level, eliminating code duplication while maintaining performance neutrality.

Changes:

Unified PipelineImpl template that works for both Intrawave and Interwave schedulers by moving scheduler-specific behavior into BlockGemm specializations
Added DoLocalPrefetch template parameter to BlockGemm operator() for pipeline-controlled prefetch behavior
Added block_sync_lds() at the end of Intrawave BlockGemmImpl to encapsulate scheduler-specific synchronization

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_mem.hpp	Unified PipelineImpl template replacing separate Intrawave and Interwave specializations; scheduler-specific behavior now handled by BlockGemm implementations
include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_comp_v3.hpp	Updated block_gemm calls to explicitly control LocalPrefetch via template parameter; replaced automatic prefetch with explicit control
include/ck_tile/ops/gemm/block/block_universal_gemm_as_bs_cr.hpp	Added DoLocalPrefetch template parameter to operator(); added block_sync_lds() to Intrawave implementation to enable unified pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-13T18:24:59Z

include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_mem.hpp

-    template <GemmPipelineScheduler Scheduler>
+    // Unified pipeline implementation for both Intrawave and Interwave schedulers
+    // Scheduler-specific behavior is encapsulated in BlockGemm (via BlockGemmImpl specializations)
+    template <GemmPipelineScheduler Scheduler_>


The template parameter 'Scheduler_' is declared but never used within the PipelineImpl struct. The scheduler-specific behavior is achieved through the BlockGemm specializations rather than this template parameter. Consider either removing this unused template parameter or adding a static_assert to validate that it matches the scheduler from the Problem definition for consistency.

Copilot · 2026-01-13T18:24:59Z

include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_comp_v3.hpp

-                    block_sync_lds();
+                    block_gemm.template operator()<false>(
+                        c_block_tile, a_lds_gemm_window, b_lds_gemm_window);
+                    // block_sync_lds();


Consider removing the commented-out block_sync_lds() calls rather than leaving them in the code. If these are no longer needed due to the refactoring (where synchronization is now handled within the BlockGemm operator), they should be removed to avoid confusion. If they might be needed for debugging or future reference, add a comment explaining why they are commented out.

Suggested change

// block_sync_lds();

Copilot · 2026-01-13T18:24:59Z

include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_comp_v3.hpp

-                block_sync_lds();
+                block_gemm.template operator()<false>(
+                    c_block_tile, a_lds_gemm_window, b_lds_gemm_window);
+                // block_sync_lds();


Consider removing the commented-out block_sync_lds() calls rather than leaving them in the code. If these are no longer needed due to the refactoring (where synchronization is now handled within the BlockGemm operator), they should be removed to avoid confusion. If they might be needed for debugging or future reference, add a comment explaining why they are commented out.

Suggested change

// block_sync_lds();

Copilot · 2026-01-13T18:25:00Z

include/ck_tile/ops/gemm/block/block_universal_gemm_as_bs_cr.hpp

+                // Note: Interwave scheduler requires LocalPrefetch as part of its design
+                // The DoLocalPrefetch flag is provided for API consistency but should typically be
+                // true
+                if constexpr(DoLocalPrefetch)
+                {
+                    LocalPrefetch<kIter.value>(
+                        a_block_window, b_block_window, a_load_tr, b_load_tr);
+                }


The comment states that LocalPrefetch is required for Interwave scheduler design, but the code still allows it to be disabled via DoLocalPrefetch=false. Consider either enforcing this requirement with a static_assert or updating the comment to clarify what happens if DoLocalPrefetch is false for Interwave.

AviralGoelAMD added 5 commits January 10, 2026 18:09

refactor: remove Default scheduler implementation as it not used anymore

b06f6f5

refactor: remove dead code from gemm universal kernel

09aa536

chore: add descriptive comments about amd intrinsic hardware sync ins…

958c99e

…tructions

fix: label existing memory pipeline for aquant as intrawave

2240aa5

refactor: unify interwave and intrawave pipeline implementation at wo…

0f2c913

…rk-group level

Base automatically changed from aviralgoel/memory_pipeline_refactor to develop January 12, 2026 17:51

AviralGoelAMD added 2 commits January 13, 2026 12:42

Merge branch 'develop' into aviralgoel/memory_pipeline_refactor_2

7c78c99

Merge branch 'develop' into aviralgoel/memory_pipeline_refactor_2

d45b830

AviralGoelAMD marked this pull request as ready for review January 13, 2026 17:39

AviralGoelAMD requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners January 13, 2026 17:40

AviralGoelAMD requested a review from Copilot January 13, 2026 18:18

Copilot started reviewing on behalf of AviralGoelAMD January 13, 2026 18:20 View session

Copilot AI reviewed Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: unify interwave and intrawave pipeline implementation #3552

refactor: unify interwave and intrawave pipeline implementation #3552

AviralGoelAMD commented Jan 12, 2026 •

edited

Loading

Uh oh!

AviralGoelAMD commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor: unify interwave and intrawave pipeline implementation #3552

Are you sure you want to change the base?

refactor: unify interwave and intrawave pipeline implementation #3552

Conversation

AviralGoelAMD commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Discussion

Uh oh!

AviralGoelAMD commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AviralGoelAMD commented Jan 12, 2026 •

edited

Loading