Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
Announcements
For support on Altera products please visit the Altera Community Forums.
27 Discussions

Unexpected behavior of torch.scatter on Gaudi-2

taesukim_squeezebits
13,420 Views

Hello,

we’ve discovered that torch.scatter can produce incorrect results on Gaudi-2. When using a mask tensor to filter a random tensor (as in top-p sampling from the Transformers library), Gaudi-2 intermittently returns the wrong output. The minimal repro code is shown below.

import torch
torch.manual_seed(42)

device = "cuda" if torch.cuda.is_available() else "hpu"

num = 1011
for i in range(20):
    a_cpu, b_cpu = torch.arange(0, num).unsqueeze(0), torch.zeros(1, num, dtype=torch.bool)

    idx = torch.randperm(a_cpu.nelement())
    a_cpu = a_cpu.view(-1)[idx].view(a_cpu.size())
    b_cpu[:,-2:] = True

    a, b = a_cpu.to(device), b_cpu.to(device)
    a_cpu = b_cpu.scatter(1, a_cpu, b_cpu)
    a = b.scatter(1, a, b)

    assert torch.all(a.cpu() == a_cpu)

This assertion does not fail on NVIDIA GPUs (tested on an A6000), but on Gaudi-2 it occasionally trips—especially as the num variable grows larger. We tested the code with eager mode.

 

Our environment is as follows:

HL-SMI Version: hl-1.21.1-fw-59.2.3.0

Driver Version: 1.21.0-ca59b5a

Nic Driver Version: 1.21.0-732bcf3

docker image: vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

0 Kudos
1 Solution
MyLinhG
Employee
8,917 Views

Hello taesukim_squeezebits,

 

Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix.  The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor. 

 

Thank you for bringing the issue to attention.

 

View solution in original post

0 Kudos
3 Replies
MyLinhG
Employee
10,020 Views

Hello taesukim_squeezebits.

 

We have had our engineering reproduce the issue on the 1.21.x releases and they do confirm they are seeing the same issues on the torch.scatter. They will confirm for a future fix in the next release.

 

Thank you for your patience.

 

0 Kudos
MyLinhG
Employee
8,918 Views

Hello taesukim_squeezebits,

 

Per our engineering team, the issue will be fixed in version 1.22. Please upgrade to the next version when available for the fix.  The issue was caused by incorrect handling of TPC input re-use in Eager mod, leading to changing both the output and update tensor. 

 

Thank you for bringing the issue to attention.

 

0 Kudos
taesukim_squeezebits
8,855 Views
0 Kudos
Reply