[NPU]: Add NPU support for the tvd operator #998

TianHao324 · 2026-01-07T06:35:48Z

Summary

Mainly to complete the adaptation of the tvd operator on the NPU：
1、Solving the operator ub overflow problem.
2、Use the chunking strategy to solve the problem where the grid maximum limit of trirton-ascend is 65535.
3、The data type is not supported for bf16, so all of them have been converted to f32.

Testing Done

Verified on Ascend NPU 910B4:

tvc forward and backward pass tests

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

TianHao324 · 2026-01-07T06:52:34Z

Hi @Tcc0403 @zheliuyu @noemotiovon
When you have a moment, could you help take a look at this code? Thanks!

noemotiovon · 2026-01-07T07:53:33Z

src/liger_kernel/ops/backends/_ascend/ops/tvd.py

+        # Fallback to desired block size if no best practice found (no tiling needed)
+        BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(V))
+
+    MAX_BATCH_PER_KERNEL = 65535  # 每个kernel最大处理量


Please use English comments.

noemotiovon · 2026-01-07T07:58:35Z

src/liger_kernel/ops/backends/_ascend/ops/tvd.py

+def tv_distance_forward_triton(p, q, shift_labels, reduction, ignore_index, has_label):
+    BT, V = p.shape
+    # NPU does not support bfloat16 type
+    p = p.to(torch.float32)


In some previous operators, I recall that there are test cases using the bf16 dtype that can run correctly on the Ascend Triton backend. It would be helpful to explicitly document the reason why this operator requires conversion to float32.
Additionally, once the NPU backend adds native support for bf16, this cast should be removed. Consider adding a TODO comment here, for example:

# TODO: Remove float32 conversion after Ascend NPU supports bf16 natively

noemotiovon · 2026-01-07T08:05:03Z

src/liger_kernel/ops/backends/_ascend/ops/tvd.py

+            grads_chunk = grads[start:end]
+            labels_chunk = shift_labels[start:end] if has_label else torch.empty(1, device=p.device)
+
+            _tv_distance_kernel[grid](


It is possible to redesign the kernel to process multiple rows per program using a kernel-side loop with a fixed grid (e.g., 65535 programs).
Given that the current kernel is a simple per-row implementation, the overhead of multiple launches is likely negligible.

Tcc0403 · 2026-01-08T13:05:54Z

src/liger_kernel/ops/backends/_ascend/ops/tvd.py

+REDUCTION_LITERAL = Literal["none", "sum", "mean", "batchmean"]
+
+_REDUCTION_MODE_NONE = tl.constexpr(0)
+_REDUCTION_MODE_SUM = tl.constexpr(1)
+_REDUCTION_MODE_MEAN = tl.constexpr(2)
+_REDUCTION_MODE_BATCHMEAN = tl.constexpr(3)
+
+_str_to_reduction_mode = {
+    "none": _REDUCTION_MODE_NONE.value,
+    "sum": _REDUCTION_MODE_SUM.value,
+    "mean": _REDUCTION_MODE_MEAN.value,
+    "batchmean": _REDUCTION_MODE_BATCHMEAN.value,
+}


You should be able to assign str to tl.constexpr variable directly without having to manually convert it to int. For instance, reduction in liger's cross_entropy is just a str. It doesn't need extra mapping onto int.

This mapping exists in the original tvd implementation, but I think we should remove it for readibility as well.

It has been modified and passed the test.

Tcc0403 · 2026-01-08T13:07:03Z

src/liger_kernel/ops/backends/_ascend/ops/tvd.py

+    if reduction == _REDUCTION_MODE_BATCHMEAN.value:
+        # TODO: Remove float32 conversion after Ascend NPU supports bf16 natively
+        return output_tensor.sum().to(torch.float32) / n_non_ignore, grads.to(torch.float32) / n_non_ignore
+    elif reduction == _REDUCTION_MODE_SUM.value:
+        return output_tensor.sum(dim=0), grads
+    elif reduction == _REDUCTION_MODE_MEAN.value:
+        return output_tensor.sum().to(torch.float32) / (n_non_ignore * V), grads.to(torch.float32) / (n_non_ignore * V)
+    else:
+        return output_tensor, grads


Without str->int mapping, this part will be more readable.

Tcc0403 · 2026-01-08T13:08:17Z

src/liger_kernel/ops/backends/_ascend/ops/tvd.py

+
+                tl.store(grads_row_ptr + offsets, grad_res, mask=mask)
+
+                if reduction == _REDUCTION_MODE_NONE:


And it will be a simple if reduction == "none":

Tcc0403 · 2026-01-12T05:06:57Z

Feel free to re-request review when it's ready

TianHao324 · 2026-01-12T08:28:29Z

On NPU, bfloat16 execution of the tvd operator involves low-precision accumulation in the underlying matmul, and the stricter tolerance may therefore lead to false negatives in correctness tests.
For this reason, the tvd test use the more lenient tolerance (1e-8 -> 5e-5)

test_tvd.py::test_correctness[dtype0-5e-05-1e-06-batchmean-1-4096-32000] PASSED                                                                                                                [  0%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-batchmean-32-4096-1024] PASSED                                                                                                                [  1%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-batchmean-41-401-1271] PASSED                                                                                                                 [  1%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                      [  2%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-batchmean-3-423-32000] PASSED                                                                                                                 [  2%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-sum-1-4096-32000] PASSED                                                                                                                      [  3%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-sum-32-4096-1024] PASSED                                                                                                                      [  3%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-sum-41-401-1271] PASSED                                                                                                                       [  4%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                            [  4%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-sum-3-423-32000] PASSED                                                                                                                       [  5%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-mean-1-4096-32000] PASSED                                                                                                                     [  5%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-mean-32-4096-1024] PASSED                                                                                                                     [  6%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-mean-41-401-1271] PASSED                                                                                                                      [  6%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                           [  7%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-mean-3-423-32000] PASSED                                                                                                                      [  7%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-none-1-4096-32000] PASSED                                                                                                                     [  8%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-none-32-4096-1024] PASSED                                                                                                                     [  8%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-none-41-401-1271] PASSED                                                                                                                      [  9%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                           [  9%]
test_tvd.py::test_correctness[dtype0-5e-05-1e-06-none-3-423-32000] PASSED                                                                                                                      [ 10%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-batchmean-1-4096-32000] PASSED                                                                                                                [ 10%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-batchmean-32-4096-1024] PASSED                                                                                                                [ 11%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-batchmean-41-401-1271] PASSED                                                                                                                 [ 11%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                      [ 12%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-batchmean-3-423-32000] PASSED                                                                                                                 [ 12%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-sum-1-4096-32000] PASSED                                                                                                                      [ 13%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-sum-32-4096-1024] PASSED                                                                                                                      [ 13%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-sum-41-401-1271] PASSED                                                                                                                       [ 14%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                            [ 14%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-sum-3-423-32000] PASSED                                                                                                                       [ 15%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-mean-1-4096-32000] PASSED                                                                                                                     [ 15%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-mean-32-4096-1024] PASSED                                                                                                                     [ 16%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-mean-41-401-1271] PASSED                                                                                                                      [ 16%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                           [ 17%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-mean-3-423-32000] PASSED                                                                                                                      [ 17%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-none-1-4096-32000] PASSED                                                                                                                     [ 18%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-none-32-4096-1024] PASSED                                                                                                                     [ 18%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-none-41-401-1271] PASSED                                                                                                                      [ 19%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                           [ 19%]
test_tvd.py::test_correctness[dtype1-1e-08-1e-06-none-3-423-32000] PASSED                                                                                                                      [ 20%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-batchmean-1-4096-32000] PASSED                                                                                                       [ 20%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-batchmean-32-4096-1024] PASSED                                                                                                       [ 21%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-batchmean-41-401-1271] PASSED                                                                                                        [ 21%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                             [ 22%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-batchmean-3-423-32000] PASSED                                                                                                        [ 22%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-sum-1-4096-32000] PASSED                                                                                                             [ 23%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-sum-32-4096-1024] PASSED                                                                                                             [ 23%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-sum-41-401-1271] PASSED                                                                                                              [ 24%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                   [ 24%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-sum-3-423-32000] PASSED                                                                                                              [ 25%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-mean-1-4096-32000] PASSED                                                                                                            [ 25%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-mean-32-4096-1024] PASSED                                                                                                            [ 26%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-mean-41-401-1271] PASSED                                                                                                             [ 26%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                  [ 27%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-mean-3-423-32000] PASSED                                                                                                             [ 27%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-none-1-4096-32000] PASSED                                                                                                            [ 28%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-none-32-4096-1024] PASSED                                                                                                            [ 28%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-none-41-401-1271] PASSED                                                                                                             [ 29%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                  [ 29%]
test_tvd.py::test_correctness_not_last[dtype0-5e-05-1e-06-none-3-423-32000] PASSED                                                                                                             [ 30%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-batchmean-1-4096-32000] PASSED                                                                                                       [ 30%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-batchmean-32-4096-1024] PASSED                                                                                                       [ 31%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-batchmean-41-401-1271] PASSED                                                                                                        [ 31%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                             [ 32%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-batchmean-3-423-32000] PASSED                                                                                                        [ 32%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-sum-1-4096-32000] PASSED                                                                                                             [ 33%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-sum-32-4096-1024] PASSED                                                                                                             [ 33%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-sum-41-401-1271] PASSED                                                                                                              [ 34%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                   [ 34%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-sum-3-423-32000] PASSED                                                                                                              [ 35%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-mean-1-4096-32000] PASSED                                                                                                            [ 35%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-mean-32-4096-1024] PASSED                                                                                                            [ 36%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-mean-41-401-1271] PASSED                                                                                                             [ 36%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                  [ 37%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-mean-3-423-32000] PASSED                                                                                                             [ 37%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-none-1-4096-32000] PASSED                                                                                                            [ 38%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-none-32-4096-1024] PASSED                                                                                                            [ 38%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-none-41-401-1271] PASSED                                                                                                             [ 39%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                                  [ 39%]
test_tvd.py::test_correctness_not_last[dtype1-1e-08-1e-06-none-3-423-32000] PASSED                                                                                                             [ 40%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-batchmean-1-4096-32000] PASSED                                                                                         [ 40%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-batchmean-32-4096-1024] PASSED                                                                                         [ 41%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-batchmean-41-401-1271] PASSED                                                                                          [ 41%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                               [ 42%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-batchmean-3-423-32000] PASSED                                                                                          [ 42%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-sum-1-4096-32000] PASSED                                                                                               [ 43%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-sum-32-4096-1024] PASSED                                                                                               [ 43%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-sum-41-401-1271] PASSED                                                                                                [ 44%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                     [ 44%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-sum-3-423-32000] PASSED                                                                                                [ 45%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-mean-1-4096-32000] PASSED                                                                                              [ 45%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-mean-32-4096-1024] PASSED                                                                                              [ 46%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-mean-41-401-1271] PASSED                                                                                               [ 46%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                    [ 47%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-mean-3-423-32000] PASSED                                                                                               [ 47%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-none-1-4096-32000] PASSED                                                                                              [ 48%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-none-32-4096-1024] PASSED                                                                                              [ 48%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-none-41-401-1271] PASSED                                                                                               [ 49%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                    [ 49%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype0-5e-05-1e-06-none-3-423-32000] PASSED                                                                                               [ 50%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-batchmean-1-4096-32000] PASSED                                                                                         [ 50%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-batchmean-32-4096-1024] PASSED                                                                                         [ 51%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-batchmean-41-401-1271] PASSED                                                                                          [ 51%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                               [ 52%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-batchmean-3-423-32000] PASSED                                                                                          [ 52%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-sum-1-4096-32000] PASSED                                                                                               [ 53%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-sum-32-4096-1024] PASSED                                                                                               [ 53%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-sum-41-401-1271] PASSED                                                                                                [ 54%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                     [ 54%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-sum-3-423-32000] PASSED                                                                                                [ 55%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-mean-1-4096-32000] PASSED                                                                                              [ 55%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-mean-32-4096-1024] PASSED                                                                                              [ 56%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-mean-41-401-1271] PASSED                                                                                               [ 56%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                    [ 57%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-mean-3-423-32000] PASSED                                                                                               [ 57%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-none-1-4096-32000] PASSED                                                                                              [ 58%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-none-32-4096-1024] PASSED                                                                                              [ 58%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-none-41-401-1271] PASSED                                                                                               [ 59%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                    [ 59%]
test_tvd.py::test_correctness_with_ignore_index[-100-dtype1-1e-08-1e-06-none-3-423-32000] PASSED                                                                                               [ 60%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-batchmean-1-4096-32000] PASSED                                                                                            [ 60%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-batchmean-32-4096-1024] PASSED                                                                                            [ 61%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-batchmean-41-401-1271] PASSED                                                                                             [ 61%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                  [ 62%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-batchmean-3-423-32000] PASSED                                                                                             [ 62%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-sum-1-4096-32000] PASSED                                                                                                  [ 63%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-sum-32-4096-1024] PASSED                                                                                                  [ 63%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-sum-41-401-1271] PASSED                                                                                                   [ 64%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                        [ 64%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-sum-3-423-32000] PASSED                                                                                                   [ 65%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-mean-1-4096-32000] PASSED                                                                                                 [ 65%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-mean-32-4096-1024] PASSED                                                                                                 [ 66%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-mean-41-401-1271] PASSED                                                                                                  [ 66%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 67%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-mean-3-423-32000] PASSED                                                                                                  [ 67%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-none-1-4096-32000] PASSED                                                                                                 [ 68%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-none-32-4096-1024] PASSED                                                                                                 [ 68%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-none-41-401-1271] PASSED                                                                                                  [ 69%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 69%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype0-5e-05-1e-06-none-3-423-32000] PASSED                                                                                                  [ 70%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-batchmean-1-4096-32000] PASSED                                                                                            [ 70%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-batchmean-32-4096-1024] PASSED                                                                                            [ 71%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-batchmean-41-401-1271] PASSED                                                                                             [ 71%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                  [ 72%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-batchmean-3-423-32000] PASSED                                                                                             [ 72%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-sum-1-4096-32000] PASSED                                                                                                  [ 73%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-sum-32-4096-1024] PASSED                                                                                                  [ 73%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-sum-41-401-1271] PASSED                                                                                                   [ 74%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                        [ 74%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-sum-3-423-32000] PASSED                                                                                                   [ 75%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-mean-1-4096-32000] PASSED                                                                                                 [ 75%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-mean-32-4096-1024] PASSED                                                                                                 [ 76%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-mean-41-401-1271] PASSED                                                                                                  [ 76%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 77%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-mean-3-423-32000] PASSED                                                                                                  [ 77%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-none-1-4096-32000] PASSED                                                                                                 [ 78%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-none-32-4096-1024] PASSED                                                                                                 [ 78%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-none-41-401-1271] PASSED                                                                                                  [ 79%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 79%]
test_tvd.py::test_correctness_with_ignore_index[0-dtype1-1e-08-1e-06-none-3-423-32000] PASSED                                                                                                  [ 80%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-batchmean-1-4096-32000] PASSED                                                                                            [ 80%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-batchmean-32-4096-1024] PASSED                                                                                            [ 81%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-batchmean-41-401-1271] PASSED                                                                                             [ 81%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                  [ 82%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-batchmean-3-423-32000] PASSED                                                                                             [ 82%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-sum-1-4096-32000] PASSED                                                                                                  [ 83%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-sum-32-4096-1024] PASSED                                                                                                  [ 83%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-sum-41-401-1271] PASSED                                                                                                   [ 84%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                        [ 84%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-sum-3-423-32000] PASSED                                                                                                   [ 85%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-mean-1-4096-32000] PASSED                                                                                                 [ 85%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-mean-32-4096-1024] PASSED                                                                                                 [ 86%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-mean-41-401-1271] PASSED                                                                                                  [ 86%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 87%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-mean-3-423-32000] PASSED                                                                                                  [ 87%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-none-1-4096-32000] PASSED                                                                                                 [ 88%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-none-32-4096-1024] PASSED                                                                                                 [ 88%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-none-41-401-1271] PASSED                                                                                                  [ 89%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 89%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype0-5e-05-1e-06-none-3-423-32000] PASSED                                                                                                  [ 90%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-batchmean-1-4096-32000] PASSED                                                                                            [ 90%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-batchmean-32-4096-1024] PASSED                                                                                            [ 91%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-batchmean-41-401-1271] PASSED                                                                                             [ 91%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-batchmean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                  [ 92%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-batchmean-3-423-32000] PASSED                                                                                             [ 92%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-sum-1-4096-32000] PASSED                                                                                                  [ 93%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-sum-32-4096-1024] PASSED                                                                                                  [ 93%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-sum-41-401-1271] PASSED                                                                                                   [ 94%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-sum-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                        [ 94%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-sum-3-423-32000] PASSED                                                                                                   [ 95%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-mean-1-4096-32000] PASSED                                                                                                 [ 95%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-mean-32-4096-1024] PASSED                                                                                                 [ 96%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-mean-41-401-1271] PASSED                                                                                                  [ 96%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-mean-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 97%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-mean-3-423-32000] PASSED                                                                                                  [ 97%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-none-1-4096-32000] PASSED                                                                                                 [ 98%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-none-32-4096-1024] PASSED                                                                                                 [ 98%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-none-41-401-1271] PASSED                                                                                                  [ 99%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-none-1-4096-128256] SKIPPED (This test requires a GPU with at least 36GB of memory)                                       [ 99%]
test_tvd.py::test_correctness_with_ignore_index[1-dtype1-1e-08-1e-06-none-3-423-32000] PASSED                                                                                                  [100%]

TianHao324 · 2026-01-12T08:31:20Z

Meanwhile, in the checkstyle check, I did not modify any related files. This might be caused by other commits.

I001 [*] Import block is un-sorted or un-formatted
  --> src/liger_kernel/transformers/model/gemma3.py:1:1
   |
 1 | / from typing import Optional
 2 | | from typing import Tuple
 3 | | from typing import Union
 4 | |
 5 | | import torch
 6 | | import torch.nn as nn
 7 | |
 8 | | from transformers.cache_utils import Cache
 9 | | from transformers.cache_utils import Cache
10 | | from transformers.utils import logging
11 | |
12 | | from liger_kernel.transformers.fused_linear_cross_entropy import LigerFusedLinearCrossEntropyLoss
13 | | from liger_kernel.transformers.model.loss_utils import LigerForCausalLMLoss
14 | | from liger_kernel.transformers.model.loss_utils import unpack_cross_entropy_result
15 | | from liger_kernel.transformers.model.output_classes import LigerCausalLMOutputWithPast
16 | | from liger_kernel.transformers.model.output_classes import LigerGemma3CausalLMOutputWithPast
   | |____________________________________________________________________________________________^
17 |
18 |   logger = logging.get_logger(__name__)
   |
help: Organize imports

F811 [*] Redefinition of unused `Cache` from line 8
  --> src/liger_kernel/transformers/model/gemma3.py:8:38
   |
 6 | import torch.nn as nn
 7 |
 8 | from transformers.cache_utils import Cache
   |                                      ----- previous definition of `Cache` here
 9 | from transformers.cache_utils import Cache
   |                                      ^^^^^ `Cache` redefined here
10 | from transformers.utils import logging
   |
help: Remove definition: `Cache`

Found 2 errors.
[*] 2 fixable with the `--fix` option.
224 files already formatted
Found 1 error (1 fixed, 0 remaining).
224 files left unchanged
make: *** [Makefile:20: checkstyle] Error 1

TianHao324 · 2026-01-12T08:33:35Z

@Tcc0403 It's ready for review, if you have time.

src/liger_kernel/ops/backends/_ascend/ops/tvd.py

Tcc0403

Thanks!

TianHao324 force-pushed the tvd branch 3 times, most recently from 1a31a76 to d8740a3 Compare January 7, 2026 06:51

TianHao324 changed the title ~~Tvd~~ Add NPU support for the tvd operator Jan 7, 2026

noemotiovon reviewed Jan 7, 2026

View reviewed changes

TianHao324 force-pushed the tvd branch 7 times, most recently from 5ae19a8 to d7b5254 Compare January 8, 2026 02:19

Tcc0403 reviewed Jan 8, 2026

View reviewed changes

TianHao324 force-pushed the tvd branch from d7b5254 to fd9c0cc Compare January 9, 2026 02:06

TianHao324 changed the title ~~Add NPU support for the tvd operator~~ [NPU]: Add NPU support for the tvd operator Jan 9, 2026

TianHao324 force-pushed the tvd branch from fd9c0cc to a91059a Compare January 12, 2026 03:24

TianHao324 force-pushed the tvd branch 2 times, most recently from cdae03e to 7b7f071 Compare January 12, 2026 08:21

Add NPU support for the tvd operator

889af34

TianHao324 force-pushed the tvd branch from 7b7f071 to 889af34 Compare January 12, 2026 08:23

Tcc0403 reviewed Jan 12, 2026

View reviewed changes

src/liger_kernel/ops/backends/_ascend/ops/tvd.py Outdated Show resolved Hide resolved

Fix typo

344972c

Tcc0403 approved these changes Jan 12, 2026

View reviewed changes

Tcc0403 merged commit 70117b9 into linkedin:main Jan 12, 2026
4 of 7 checks passed

TianHao324 mentioned this pull request Jan 22, 2026

[NPU]: optimize tvd implementation #1039

Merged

3 tasks


		tl.store(grads_row_ptr + offsets, grad_res, mask=mask)

		if reduction == _REDUCTION_MODE_NONE:

[NPU]: Add NPU support for the tvd operator #998

[NPU]: Add NPU support for the tvd operator #998

Conversation

TianHao324 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

TianHao324 commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tcc0403 commented Jan 12, 2026

Uh oh!

TianHao324 commented Jan 12, 2026

Uh oh!

TianHao324 commented Jan 12, 2026

Uh oh!

TianHao324 commented Jan 12, 2026

Uh oh!

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TianHao324 commented Jan 7, 2026 •

edited

Loading

Tcc0403 Jan 8, 2026 •

edited

Loading