Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
18 Discusiones

Unexpected behavior of torch.scatter on Gaudi-2

taesukim_squeezebits
10.734 Vistas

Hello,

we’ve discovered that torch.scatter can produce incorrect results on Gaudi-2. When using a mask tensor to filter a random tensor (as in top-p sampling from the Transformers library), Gaudi-2 intermittently returns the wrong output. The minimal repro code is shown below.

import torch
torch.manual_seed(42)

device = "cuda" if torch.cuda.is_available() else "hpu"

num = 1011
for i in range(20):
    a_cpu, b_cpu = torch.arange(0, num).unsqueeze(0), torch.zeros(1, num, dtype=torch.bool)

    idx = torch.randperm(a_cpu.nelement())
    a_cpu = a_cpu.view(-1)[idx].view(a_cpu.size())
    b_cpu[:,-2:] = True

    a, b = a_cpu.to(device), b_cpu.to(device)
    a_cpu = b_cpu.scatter(1, a_cpu, b_cpu)
    a = b.scatter(1, a, b)

    assert torch.all(a.cpu() == a_cpu)

This assertion does not fail on NVIDIA GPUs (tested on an A6000), but on Gaudi-2 it occasionally trips—especially as the num variable grows larger. We tested the code with eager mode.

 

Our environment is as follows:

HL-SMI Version: hl-1.21.1-fw-59.2.3.0

Driver Version: 1.21.0-ca59b5a

Nic Driver Version: 1.21.0-732bcf3

docker image: vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

0 kudos
1 Solución
MyLinhG
Empleados
6.231 Vistas

Hello taesukim_squeezebits,

 

Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix.  The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor. 

 

Thank you for bringing the issue to attention.

 

Ver la solución en mensaje original publicado

3 Respuestas
MyLinhG
Empleados
7.334 Vistas

Hello taesukim_squeezebits.

 

We have had our engineering reproduce the issue on the 1.21.x releases and they do confirm they are seeing the same issues on the torch.scatter. They will confirm for a future fix in the next release.

 

Thank you for your patience.

 

MyLinhG
Empleados
6.232 Vistas

Hello taesukim_squeezebits,

 

Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix.  The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor. 

 

Thank you for bringing the issue to attention.

 

taesukim_squeezebits
6.169 Vistas
Responder