Implement batched gemm bias permute for RDNA4 #3534

ErwinTerpstra · 2026-01-08T11:56:56Z

Proposed changes

This MR implements batched gemm bias permute for RDNA3/4. In practice, this is a multidimensional contraction operation. The MR contains the following:

Profiler and test infrastructure for the batched contraction instances, as this was not implemented yet for XDL versions
Device struct for batched contraction using WMMA instructions (device_batched_contraction_multiple_d_wmma_cshuffle_v3)
Changes to the GridwiseGemmWmmaCShuffleV3 to allow passing in non-naive grid descriptors

Note that support for different dimensions and D tensor configurations is very limited at the moment. More scaffolding would be needed to add generic support for variable number of dimensions, but with this limited implementation there is at least parity with the XDL versions.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

… e permute)

…ew gridwise op

…rs for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

… overload

…_bias_permute-for-rdna4

EnricoDeg · 2026-01-14T18:30:46Z

Can you also add an example for wmma?

EnricoDeg

Nice work !

.../tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle_v3.hpp

...ermute/device_batched_gemm_bias_permute_m2_n3_k1_wmma_c_shuffle_f16_f16_f16_f16_instance.cpp

ApoorvaKalyani

Great work!
I also think we need more instances and we need to reverify the tests for those.

…e code between platforms

…tances to the test

…_bias_permute-for-rdna4

ErwinTerpstra · 2026-01-15T14:05:16Z

@EnricoDeg @ApoorvaKalyani Thank you for the reviews. I processed the comments, added an example and added a couple of instances for both v1 and v3 pipelines. Let me know if there's still something you'd like to see changed.

EnricoDeg · 2026-01-16T07:58:14Z

@EnricoDeg @ApoorvaKalyani Thank you for the reviews. I processed the comments, added an example and added a couple of instances for both v1 and v3 pipelines. Let me know if there's still something you'd like to see changed.

LGTM

…ptors dependent on the transfer method

ApoorvaKalyani · 2026-01-16T12:41:04Z

@EnricoDeg @ApoorvaKalyani Thank you for the reviews. I processed the comments, added an example and added a couple of instances for both v1 and v3 pipelines. Let me know if there's still something you'd like to see changed.

LGTM

Great!

ErwinTerpstra added 5 commits December 19, 2025 16:03

feat: test setup for batched contraction (aka batched gemm multiple d…

873688f

… e permute)

wip: device struct for WMMA batched contraction multiple d based on n…

c78c353

…ew gridwise op

feat: working batched contraction on RDNA, non-naive tensor descripto…

0fae879

…rs for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

fix: failure to resolve template parameters when calling new function…

82323d2

… overload

fix: passing reference type as parameter instead of underlying types

c67a425

ErwinTerpstra added the organization: streamhpc label Jan 8, 2026

ErwinTerpstra added 4 commits January 8, 2026 12:09

Merge branch 'develop' into eterpstr/96-implement-device_batched_gemm…

ac82e53

…_bias_permute-for-rdna4

fix: merge error caused duplicate definitions

3c6fc61

fix: make sure constness of template and parameters types match

56e9620

fix: don't compile batched contraction test on unsupported architectures

918981d

krithalith requested review from ApoorvaKalyani and EnricoDeg January 9, 2026 12:41

EnricoDeg reviewed Jan 14, 2026

View reviewed changes

ApoorvaKalyani reviewed Jan 15, 2026

View reviewed changes

ErwinTerpstra added 7 commits January 15, 2026 10:22

feat: add example for new wmma implementation, and consolidate exampl…

7a201c9

…e code between platforms

style: return inline instead of with branch

d01b8f6

chore: add extra assert on vector memory access sizes

55959b0

chore: clean up some unused variables

51f8d41

fix: correct tail number calculation, added small cases and extra ins…

96612c1

…tances to the test

Merge branch 'develop' into eterpstr/96-implement-device_batched_gemm…

7759d80

…_bias_permute-for-rdna4

fix: merge caused duplicate function definition

b7e97c8

ErwinTerpstra added 2 commits January 16, 2026 08:31

fix: disable wave transfer for batched contraction

6176d49

fix: properly support wave transfer by generating correct grid descri…

7fb2241

…ptors dependent on the transfer method

ErwinTerpstra marked this pull request as ready for review January 16, 2026 12:41

ErwinTerpstra requested review from carlushuang and illsilin as code owners January 16, 2026 12:41

ErwinTerpstra requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, cgmillette, coderfeli, geyyer, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 16, 2026 12:41

ApoorvaKalyani approved these changes Jan 16, 2026

View reviewed changes

EnricoDeg approved these changes Jan 16, 2026

View reviewed changes

illsilin approved these changes Jan 16, 2026

View reviewed changes

ErwinTerpstra merged commit fe40a5d into develop Jan 17, 2026
22 checks passed

ErwinTerpstra deleted the eterpstr/96-implement-device_batched_gemm_bias_permute-for-rdna4 branch January 17, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement batched gemm bias permute for RDNA4 #3534

Implement batched gemm bias permute for RDNA4 #3534

Uh oh!

ErwinTerpstra commented Jan 8, 2026 •

edited

Loading

Uh oh!

EnricoDeg commented Jan 14, 2026

Uh oh!

EnricoDeg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ApoorvaKalyani left a comment

Uh oh!

ErwinTerpstra commented Jan 15, 2026

Uh oh!

EnricoDeg commented Jan 16, 2026

Uh oh!

ApoorvaKalyani commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement batched gemm bias permute for RDNA4 #3534

Implement batched gemm bias permute for RDNA4 #3534

Uh oh!

Conversation

ErwinTerpstra commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

EnricoDeg commented Jan 14, 2026

Uh oh!

EnricoDeg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ApoorvaKalyani left a comment

Choose a reason for hiding this comment

Uh oh!

ErwinTerpstra commented Jan 15, 2026

Uh oh!

EnricoDeg commented Jan 16, 2026

Uh oh!

ApoorvaKalyani commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ErwinTerpstra commented Jan 8, 2026 •

edited

Loading