Skip to content

Conversation

@AviralGoelAMD
Copy link
Collaborator

@AviralGoelAMD AviralGoelAMD commented Jan 12, 2026

at work-group level.

I benchmarked the code after refactoring against current develop on MI300 for all supported data_types.

There does not seem to be any significant performance difference. The average % change is neutral.
Problem shapes with large %change seem to be outliers that disappear when benchmarking is run for a second time.

  Comparing: interwave_bf16
  Develop: develop_interwave_bf16.csv
  Refactor: refactor_interwave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 25 (46.3%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.08%

Largest improvement: +2.19%
  M=16, N=64, K=256, DataType=bf16
  0.137 -> 0.140 TFlops

Largest regression: -2.99%
  M=2048, N=5120, K=1024, DataType=bf16
  262.119 -> 254.280 TFlops

Average change by data type:
  bf16: -0.08% (n=54)
================================================================================

================================================================================
Comparing: interwave_bf8
  Develop: develop_interwave_bf8.csv
  Refactor: refactor_interwave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 30 (55.6%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +3.62%
  M=1024, N=4096, K=2048, DataType=bf8
  402.355 -> 416.931 TFlops

Largest regression: -4.46%
  M=1536, N=2816, K=1152, DataType=bf8
  389.701 -> 372.326 TFlops

Average change by data type:
  bf8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp16
  Develop: develop_interwave_fp16.csv
  Refactor: refactor_interwave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 29 (53.7%)
Regressions (refactor < develop): 22 (40.7%)
Neutral (no change): 3 (5.6%)

Average percent change: +0.49%

Largest improvement: +13.77%
  M=16, N=64, K=256, DataType=fp16
  0.138 -> 0.157 TFlops

Largest regression: -2.81%
  M=1024, N=2304, K=4096, DataType=fp16
  252.705 -> 245.602 TFlops

Average change by data type:
  fp16: +0.49% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp8
  Develop: develop_interwave_fp8.csv
  Refactor: refactor_interwave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 29 (53.7%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.22%

Largest improvement: +12.90%
  M=16, N=64, K=128, DataType=fp8
  0.124 -> 0.140 TFlops

Largest regression: -13.87%
  M=16, N=64, K=256, DataType=fp8
  0.238 -> 0.205 TFlops

Average change by data type:
  fp8: -0.22% (n=54)
================================================================================

================================================================================
Comparing: interwave_int8
  Develop: develop_interwave_int8.csv
  Refactor: refactor_interwave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 27 (50.0%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +8.94%
  M=16, N=64, K=128, DataType=int8
  0.123 -> 0.134 TFlops

Largest regression: -9.12%
  M=512, N=768, K=640, DataType=int8
  103.993 -> 94.504 TFlops

Average change by data type:
  int8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf16
  Develop: develop_intrawave_bf16.csv
  Refactor: refactor_intrawave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 30 (55.6%)
Regressions (refactor < develop): 24 (44.4%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.11%

Largest improvement: +2.72%
  M=1536, N=2816, K=1152, DataType=bf16
  215.054 -> 220.908 TFlops

Largest regression: -12.93%
  M=16, N=64, K=256, DataType=bf16
  0.147 -> 0.128 TFlops

Average change by data type:
  bf16: -0.11% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf8
  Develop: develop_intrawave_bf8.csv
  Refactor: refactor_intrawave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 12 (22.2%)
Regressions (refactor < develop): 42 (77.8%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.66%

Largest improvement: +1.16%
  M=16795, N=2304, K=4096, DataType=bf8
  565.446 -> 572.001 TFlops

Largest regression: -10.57%
  M=16, N=64, K=256, DataType=bf8
  0.227 -> 0.203 TFlops

Average change by data type:
  bf8: -1.66% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp16
  Develop: develop_intrawave_fp16.csv
  Refactor: refactor_intrawave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 21 (38.9%)
Regressions (refactor < develop): 32 (59.3%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.13%

Largest improvement: +2.46%
  M=16, N=5120, K=1024, DataType=fp16
  15.963 -> 16.356 TFlops

Largest regression: -5.26%
  M=16, N=64, K=256, DataType=fp16
  0.152 -> 0.144 TFlops

Average change by data type:
  fp16: -0.13% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp8
  Develop: develop_intrawave_fp8.csv
  Refactor: refactor_intrawave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 17 (31.5%)
Regressions (refactor < develop): 37 (68.5%)
Neutral (no change): 0 (0.0%)

Average percent change: +0.23%

Largest improvement: +23.96%
  M=16, N=64, K=128, DataType=fp8
  0.096 -> 0.119 TFlops

Largest regression: -10.21%
  M=512, N=768, K=640, DataType=fp8
  105.923 -> 95.105 TFlops

Average change by data type:
  fp8: +0.23% (n=54)
================================================================================

================================================================================
Comparing: intrawave_int8
  Develop: develop_intrawave_int8.csv
  Refactor: refactor_intrawave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 8 (14.8%)
Regressions (refactor < develop): 46 (85.2%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.79%

Largest improvement: +13.40%
  M=16, N=64, K=128, DataType=int8
  0.097 -> 0.110 TFlops

Largest regression: -9.51%
  M=9, N=5120, K=1024, DataType=int8
  15.002 -> 13.576 TFlops

Average change by data type:
  int8: -1.79% (n=54)
================================================================================

(rocprof-venv) root@banff-cyxtera-s73-2:~/workspace/build# ./compare_results.py --develop-pattern "develop_*.csv" --refactor-pattern "refactor_*.csv"
Found 10 develop files and 10 refactor files

Found 10 matching pairs to compare

================================================================================
Comparing: interwave_bf16
  Develop: develop_interwave_bf16.csv
  Refactor: refactor_interwave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 25 (46.3%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.08%

Largest improvement: +2.19%
  M=16, N=64, K=256, DataType=bf16
  0.137 -> 0.140 TFlops

Largest regression: -2.99%
  M=2048, N=5120, K=1024, DataType=bf16
  262.119 -> 254.280 TFlops

Average change by data type:
  bf16: -0.08% (n=54)
================================================================================

================================================================================
Comparing: interwave_bf8
  Develop: develop_interwave_bf8.csv
  Refactor: refactor_interwave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 30 (55.6%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +3.62%
  M=1024, N=4096, K=2048, DataType=bf8
  402.355 -> 416.931 TFlops

Largest regression: -4.46%
  M=1536, N=2816, K=1152, DataType=bf8
  389.701 -> 372.326 TFlops

Average change by data type:
  bf8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp16
  Develop: develop_interwave_fp16.csv
  Refactor: refactor_interwave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 29 (53.7%)
Regressions (refactor < develop): 22 (40.7%)
Neutral (no change): 3 (5.6%)

Average percent change: +0.49%

Largest improvement: +13.77%
  M=16, N=64, K=256, DataType=fp16
  0.138 -> 0.157 TFlops

Largest regression: -2.81%
  M=1024, N=2304, K=4096, DataType=fp16
  252.705 -> 245.602 TFlops

Average change by data type:
  fp16: +0.49% (n=54)
================================================================================

================================================================================
Comparing: interwave_fp8
  Develop: develop_interwave_fp8.csv
  Refactor: refactor_interwave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 24 (44.4%)
Regressions (refactor < develop): 29 (53.7%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.22%

Largest improvement: +12.90%
  M=16, N=64, K=128, DataType=fp8
  0.124 -> 0.140 TFlops

Largest regression: -13.87%
  M=16, N=64, K=256, DataType=fp8
  0.238 -> 0.205 TFlops

Average change by data type:
  fp8: -0.22% (n=54)
================================================================================

================================================================================
Comparing: interwave_int8
  Develop: develop_interwave_int8.csv
  Refactor: refactor_interwave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_interwave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 27 (50.0%)
Regressions (refactor < develop): 27 (50.0%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.09%

Largest improvement: +8.94%
  M=16, N=64, K=128, DataType=int8
  0.123 -> 0.134 TFlops

Largest regression: -9.12%
  M=512, N=768, K=640, DataType=int8
  103.993 -> 94.504 TFlops

Average change by data type:
  int8: -0.09% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf16
  Develop: develop_intrawave_bf16.csv
  Refactor: refactor_intrawave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 30 (55.6%)
Regressions (refactor < develop): 24 (44.4%)
Neutral (no change): 0 (0.0%)

Average percent change: -0.11%

Largest improvement: +2.72%
  M=1536, N=2816, K=1152, DataType=bf16
  215.054 -> 220.908 TFlops

Largest regression: -12.93%
  M=16, N=64, K=256, DataType=bf16
  0.147 -> 0.128 TFlops

Average change by data type:
  bf16: -0.11% (n=54)
================================================================================

================================================================================
Comparing: intrawave_bf8
  Develop: develop_intrawave_bf8.csv
  Refactor: refactor_intrawave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 12 (22.2%)
Regressions (refactor < develop): 42 (77.8%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.66%

Largest improvement: +1.16%
  M=16795, N=2304, K=4096, DataType=bf8
  565.446 -> 572.001 TFlops

Largest regression: -10.57%
  M=16, N=64, K=256, DataType=bf8
  0.227 -> 0.203 TFlops

Average change by data type:
  bf8: -1.66% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp16
  Develop: develop_intrawave_fp16.csv
  Refactor: refactor_intrawave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 21 (38.9%)
Regressions (refactor < develop): 32 (59.3%)
Neutral (no change): 1 (1.9%)

Average percent change: -0.13%

Largest improvement: +2.46%
  M=16, N=5120, K=1024, DataType=fp16
  15.963 -> 16.356 TFlops

Largest regression: -5.26%
  M=16, N=64, K=256, DataType=fp16
  0.152 -> 0.144 TFlops

Average change by data type:
  fp16: -0.13% (n=54)
================================================================================

================================================================================
Comparing: intrawave_fp8
  Develop: develop_intrawave_fp8.csv
  Refactor: refactor_intrawave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 17 (31.5%)
Regressions (refactor < develop): 37 (68.5%)
Neutral (no change): 0 (0.0%)

Average percent change: +0.23%

Largest improvement: +23.96%
  M=16, N=64, K=128, DataType=fp8
  0.096 -> 0.119 TFlops

Largest regression: -10.21%
  M=512, N=768, K=640, DataType=fp8
  105.923 -> 95.105 TFlops

Average change by data type:
  fp8: +0.23% (n=54)
================================================================================

================================================================================
Comparing: intrawave_int8
  Develop: develop_intrawave_int8.csv
  Refactor: refactor_intrawave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_intrawave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 8 (14.8%)
Regressions (refactor < develop): 46 (85.2%)
Neutral (no change): 0 (0.0%)

Average percent change: -1.79%

Largest improvement: +13.40%
  M=16, N=64, K=128, DataType=int8
  0.097 -> 0.110 TFlops

Largest regression: -9.51%
  M=9, N=5120, K=1024, DataType=int8
  15.002 -> 13.576 TFlops

Average change by data type:
  int8: -1.79% (n=54)
================================================================================

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Base automatically changed from aviralgoel/memory_pipeline_refactor to develop January 12, 2026 17:51
@AviralGoelAMD
Copy link
Collaborator Author

================================================================================
Comparing: compute_v3_intrawave_bf16
  Develop: develop_compute_v3_intrawave_bf16.csv
  Refactor: refactor_compute_v3_intrawave_bf16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_bf16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 32 (59.3%)
Regressions (refactor < develop): 18 (33.3%)
Neutral (no change): 4 (7.4%)

Average percent change: +0.06%

Largest improvement: +7.39%
  M=15433, N=2304, K=4096, DataType=bf16
  479.700 -> 515.133 TFlops

Largest regression: -6.66%
  M=32768, N=2304, K=4096, DataType=bf16
  516.073 -> 481.704 TFlops

Average change by data type:
  bf16: +0.06% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_bf8
  Develop: develop_compute_v3_intrawave_bf8.csv
  Refactor: refactor_compute_v3_intrawave_bf8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_bf8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 19 (35.2%)
Regressions (refactor < develop): 31 (57.4%)
Neutral (no change): 4 (7.4%)

Average percent change: -0.01%

Largest improvement: +4.19%
  M=16795, N=2304, K=4096, DataType=bf8
  871.555 -> 908.042 TFlops

Largest regression: -2.77%
  M=1536, N=2816, K=1152, DataType=bf8
  556.408 -> 541.000 TFlops

Average change by data type:
  bf8: -0.01% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_fp16
  Develop: develop_compute_v3_intrawave_fp16.csv
  Refactor: refactor_compute_v3_intrawave_fp16.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_fp16.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 18 (33.3%)
Regressions (refactor < develop): 32 (59.3%)
Neutral (no change): 4 (7.4%)

Average percent change: -0.05%

Largest improvement: +3.89%
  M=16795, N=2304, K=4096, DataType=fp16
  487.191 -> 506.147 TFlops

Largest regression: -4.12%
  M=1024, N=4096, K=2048, DataType=fp16
  361.818 -> 346.915 TFlops

Average change by data type:
  fp16: -0.05% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_fp8
  Develop: develop_compute_v3_intrawave_fp8.csv
  Refactor: refactor_compute_v3_intrawave_fp8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_fp8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 8 (14.8%)
Regressions (refactor < develop): 44 (81.5%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.66%

Largest improvement: +2.35%
  M=512, N=768, K=640, DataType=fp8
  62.417 -> 63.881 TFlops

Largest regression: -4.73%
  M=15433, N=4096, K=2048, DataType=fp8
  854.258 -> 813.891 TFlops

Average change by data type:
  fp8: -0.66% (n=54)
================================================================================

================================================================================
Comparing: compute_v3_intrawave_int8
  Develop: develop_compute_v3_intrawave_int8.csv
  Refactor: refactor_compute_v3_intrawave_int8.csv
  Develop: 54 configurations
  Refactor: 54 configurations

Comparison results written to: comparison_compute_v3_intrawave_int8.csv

================================================================================
COMPARISON SUMMARY
================================================================================

Total comparisons: 54
Improvements (refactor > develop): 15 (27.8%)
Regressions (refactor < develop): 37 (68.5%)
Neutral (no change): 2 (3.7%)

Average percent change: -0.27%

Largest improvement: +8.24%
  M=16384, N=8192, K=3584, DataType=int8
  925.985 -> 1002.240 TFlops

Largest regression: -6.43%
  M=16384, N=7168, K=8192, DataType=int8
  1046.800 -> 979.494 TFlops

Average change by data type:
  int8: -0.27% (n=54)
================================================================================

(rocprof-venv) root@banff-cyxtera-s73-2:~/workspace/build#```

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the GEMM pipeline implementation to unify the Intrawave and Interwave scheduler implementations at the work-group level, eliminating code duplication while maintaining performance neutrality.

Changes:

  • Unified PipelineImpl template that works for both Intrawave and Interwave schedulers by moving scheduler-specific behavior into BlockGemm specializations
  • Added DoLocalPrefetch template parameter to BlockGemm operator() for pipeline-controlled prefetch behavior
  • Added block_sync_lds() at the end of Intrawave BlockGemmImpl to encapsulate scheduler-specific synchronization

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_mem.hpp Unified PipelineImpl template replacing separate Intrawave and Interwave specializations; scheduler-specific behavior now handled by BlockGemm implementations
include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_comp_v3.hpp Updated block_gemm calls to explicitly control LocalPrefetch via template parameter; replaced automatic prefetch with explicit control
include/ck_tile/ops/gemm/block/block_universal_gemm_as_bs_cr.hpp Added DoLocalPrefetch template parameter to operator(); added block_sync_lds() to Intrawave implementation to enable unified pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

template <GemmPipelineScheduler Scheduler>
// Unified pipeline implementation for both Intrawave and Interwave schedulers
// Scheduler-specific behavior is encapsulated in BlockGemm (via BlockGemmImpl specializations)
template <GemmPipelineScheduler Scheduler_>
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template parameter 'Scheduler_' is declared but never used within the PipelineImpl struct. The scheduler-specific behavior is achieved through the BlockGemm specializations rather than this template parameter. Consider either removing this unused template parameter or adding a static_assert to validate that it matches the scheduler from the Problem definition for consistency.

Copilot uses AI. Check for mistakes.
block_sync_lds();
block_gemm.template operator()<false>(
c_block_tile, a_lds_gemm_window, b_lds_gemm_window);
// block_sync_lds();
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing the commented-out block_sync_lds() calls rather than leaving them in the code. If these are no longer needed due to the refactoring (where synchronization is now handled within the BlockGemm operator), they should be removed to avoid confusion. If they might be needed for debugging or future reference, add a comment explaining why they are commented out.

Suggested change
// block_sync_lds();

Copilot uses AI. Check for mistakes.
block_sync_lds();
block_gemm.template operator()<false>(
c_block_tile, a_lds_gemm_window, b_lds_gemm_window);
// block_sync_lds();
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing the commented-out block_sync_lds() calls rather than leaving them in the code. If these are no longer needed due to the refactoring (where synchronization is now handled within the BlockGemm operator), they should be removed to avoid confusion. If they might be needed for debugging or future reference, add a comment explaining why they are commented out.

Suggested change
// block_sync_lds();

Copilot uses AI. Check for mistakes.
Comment on lines +384 to +391
// Note: Interwave scheduler requires LocalPrefetch as part of its design
// The DoLocalPrefetch flag is provided for API consistency but should typically be
// true
if constexpr(DoLocalPrefetch)
{
LocalPrefetch<kIter.value>(
a_block_window, b_block_window, a_load_tr, b_load_tr);
}
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states that LocalPrefetch is required for Interwave scheduler design, but the code still allows it to be disabled via DoLocalPrefetch=false. Consider either enforcing this requirement with a static_assert or updating the comment to clarify what happens if DoLocalPrefetch is false for Interwave.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants