Solucionado: Unexpected behavior of torch.scatter on Gaudi-2

taesukim_squeezebits · ‎07-02-2025

Hello,

we’ve discovered that torch.scatter can produce incorrect results on Gaudi-2. When using a mask tensor to filter a random tensor (as in top-p sampling from the Transformers library), Gaudi-2 intermittently returns the wrong output. The minimal repro code is shown below.

import torch
torch.manual_seed(42)

device = "cuda" if torch.cuda.is_available() else "hpu"

num = 1011
for i in range(20):
    a_cpu, b_cpu = torch.arange(0, num).unsqueeze(0), torch.zeros(1, num, dtype=torch.bool)

    idx = torch.randperm(a_cpu.nelement())
    a_cpu = a_cpu.view(-1)[idx].view(a_cpu.size())
    b_cpu[:,-2:] = True

    a, b = a_cpu.to(device), b_cpu.to(device)
    a_cpu = b_cpu.scatter(1, a_cpu, b_cpu)
    a = b.scatter(1, a, b)

    assert torch.all(a.cpu() == a_cpu)

This assertion does not fail on NVIDIA GPUs (tested on an A6000), but on Gaudi-2 it occasionally trips—especially as the num variable grows larger. We tested the code with eager mode.

Our environment is as follows:

HL-SMI Version: hl-1.21.1-fw-59.2.3.0

Driver Version: 1.21.0-ca59b5a

Nic Driver Version: 1.21.0-732bcf3

docker image: vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

MyLinhG · ‎08-05-2025

Hello taesukim_squeezebits,

Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix. The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor.

Thank you for bringing the issue to attention.

Ver la solución en mensaje original publicado

MyLinhG · ‎07-25-2025

Hello taesukim_squeezebits.

We have had our engineering reproduce the issue on the 1.21.x releases and they do confirm they are seeing the same issues on the torch.scatter. They will confirm for a future fix in the next release.

Thank you for your patience.

MyLinhG · ‎08-05-2025

Hello taesukim_squeezebits,

Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix. The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor.

Thank you for bringing the issue to attention.

taesukim_squeezebits · ‎08-06-2025

Thank you for the update!