Remove unnecessary token padding for MoE in BF16 mode #2255

rakkit · 2026-01-20T06:32:26Z

Context: Solar Open-102B points out in BF16 mode, Expert parallel did unnecessary token padding (ps. also non-EP case).
This PR set TOKEN_GROUP_ALIGN_SIZE_M=1 by Default.

For non-ep case, we skip indices_padding_wrapper
For ep case, _permute takes TOKEN_GROUP_ALIGN_SIZE_M=1 and padded_max_len = x.shape[0] that can avoid any padding.
Now (extremely case) allows no-token experts. For B100 its might need Bug: grouped_mm produces non-zero gradients for zero-size groups on B200 (Blackwell) GPUs pytorch#172439

Test:

Original implement (with TOKEN_GROUP_ALIGN_SIZE_M=8)
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --debug.seed 10

CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --debug.seed 10 --parallelism.expert_parallel_degree=2

And with this PR
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --debug.seed 10

CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --debug.seed 10 --parallelism.expert_parallel_degree=2

…group_gemm

tianyu-l · 2026-01-27T08:13:06Z

torchtitan/models/moe/utils.py


-TOKEN_GROUP_ALIGN_SIZE_M = 8
-ValidTokenGroupAlignmentSize = Literal[8, 16, 32]
+TOKEN_GROUP_ALIGN_SIZE_M = 1


This fix is "soft", in the sense that the padding code path still exists for bf16.

I wonder whether it's viable to go one step further -- remove all padding logic for bf16 and move padding logic to quantized paths only. cc @danielvegamyhre

Either way is fine, the 8 token alignment is needed if we want to use TMA in any kernels operating on each token group (8*2 bytes per elem = 16 byte alignment). However, if we are only doing that in the low precision code path, then there's no reason to pad.

Feel free to remove bf16 padding entirely.

For bf16 8-token alignment is not needed anywhere, see

import torch import torch.nn.functional as F x = torch.randn(2048, 4096, device="cuda", dtype=torch.bfloat16).requires_grad_(True) w = torch.randn(2, 4096, 7168, device="cuda", dtype=torch.bfloat16).requires_grad_(True) # odd offsets offs = torch.tensor([1023, 2048], device="cuda", dtype=torch.int32) out = F.grouped_mm(x, w, offs=offs) gO = torch.rand_like(out) out.backward(gO) # check that gradients are computed print(x.grad.sum(), w.grad.sum())

@danielvegamyhre
Right now we are mixing padding and permutation into one kernel. Since bf16 doesn't require padding, I wonder if it makes sense to move padding to quantization kernel? The argument is that the kernel itself should be general and not require user to do padding from outside.

sure if we agree move the padding logic to quant paths then i will refactor to remove TOKEN_GROUP_ALIGN_SIZE_M in torchtitan.

@tianyu-l we have a version of the permutation and pad/fill kernel in torchao now, used in the MXFP8 EP primitives. It is not fused with quantization though. To clarify, are you asking if we can delete the permute+pad kernel from torchtitan and replace it with fused permute+pad+quantize kernel in torchao?

@danielvegamyhre My request is that we remove padding from torchtitan entirely, while keeping correctness.

In the past we have the permute+pad kernel to avoid d2h sync on the padding part. Now that if we no longer need padding for bf16, I'd hope we remove the kernel altogether, but that requires torchao to handle padding.

That is doable

danielvegamyhre · 2026-01-27T16:55:55Z

Solar Open-102B technical report is very interesting @rakkit, thanks for sharing it!

add group gemm fast path, removing the defualt padding of 8 for bf16 …

97fae6b

…group_gemm

rakkit requested review from fegin, tianyu-l, wconstab and wwwjn as code owners January 20, 2026 06:32

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 20, 2026

remove 1 from ValidTokenGroupAlignmentSize

e44c321

tianyu-l mentioned this pull request Jan 20, 2026

Expert group padding to multiple of 32 #2262

Open

tianyu-l reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary token padding for MoE in BF16 mode #2255

Remove unnecessary token padding for MoE in BF16 mode #2255

rakkit commented Jan 20, 2026

Uh oh!

tianyu-l Jan 27, 2026

Uh oh!

danielvegamyhre Jan 27, 2026 •

edited

Loading

Uh oh!

ngimel Jan 27, 2026

Uh oh!

tianyu-l Jan 27, 2026

Uh oh!

rakkit Jan 27, 2026

Uh oh!

danielvegamyhre Jan 27, 2026

Uh oh!

tianyu-l Jan 27, 2026

Uh oh!

danielvegamyhre Jan 27, 2026

Uh oh!

danielvegamyhre commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Remove unnecessary token padding for MoE in BF16 mode #2255

Are you sure you want to change the base?

Remove unnecessary token padding for MoE in BF16 mode #2255

Conversation

rakkit commented Jan 20, 2026

Uh oh!

tianyu-l Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

rakkit Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielvegamyhre Jan 27, 2026 •

edited

Loading