- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hello,
we’ve discovered that torch.scatter can produce incorrect results on Gaudi-2. When using a mask tensor to filter a random tensor (as in top-p sampling from the Transformers library), Gaudi-2 intermittently returns the wrong output. The minimal repro code is shown below.
import torch
torch.manual_seed(42)
device = "cuda" if torch.cuda.is_available() else "hpu"
num = 1011
for i in range(20):
a_cpu, b_cpu = torch.arange(0, num).unsqueeze(0), torch.zeros(1, num, dtype=torch.bool)
idx = torch.randperm(a_cpu.nelement())
a_cpu = a_cpu.view(-1)[idx].view(a_cpu.size())
b_cpu[:,-2:] = True
a, b = a_cpu.to(device), b_cpu.to(device)
a_cpu = b_cpu.scatter(1, a_cpu, b_cpu)
a = b.scatter(1, a, b)
assert torch.all(a.cpu() == a_cpu)
This assertion does not fail on NVIDIA GPUs (tested on an A6000), but on Gaudi-2 it occasionally trips—especially as the num variable grows larger. We tested the code with eager mode.
Our environment is as follows:
HL-SMI Version: hl-1.21.1-fw-59.2.3.0
Driver Version: 1.21.0-ca59b5a
Nic Driver Version: 1.21.0-732bcf3
docker image: vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hello taesukim_squeezebits,
Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix. The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor.
Thank you for bringing the issue to attention.
Enlace copiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hello taesukim_squeezebits.
We have had our engineering reproduce the issue on the 1.21.x releases and they do confirm they are seeing the same issues on the torch.scatter. They will confirm for a future fix in the next release.
Thank you for your patience.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hello taesukim_squeezebits,
Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix. The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor.
Thank you for bringing the issue to attention.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado

- Suscribirse a un feed RSS
- Marcar tema como nuevo
- Marcar tema como leído
- Flotar este Tema para el usuario actual
- Favorito
- Suscribir
- Página de impresión sencilla