topic Intel Arc GPU in GPU Compute Software

topic Intel Arc GPU in GPU Compute Software https://community.intel.com/t5/GPU-Compute-Software/Intel-Arc-GPU/m-p/1749479#M2372 PyTorch XPU backward pass crash with Transformer/SDPA on Intel Arc iGPU Environment: - CPU: Intel Core Ultra 9 285H (Meteor Lake) - GPU: Intel Arc iGPU (8 Xe-core, shared memory, 128GB DDR5) - OS: Linux (Ubuntu 24.04) - PyTorch: 2.12.0+xpu - Intel oneAPI XPU driver: latest I'm experiencing a crash during the backward pass of nn.TransformerEncoderLayer (or F.scaled_dot_product_attention) when running on Intel XPU. The forward pass works fine, but loss.backward() crashes with memory allocation errors or segfaults. Minimal repro: import torch, torch.nn as nn m = nn.TransformerEncoderLayer(2048, 16, batch_first=True).to('xpu') x = torch.randn(8, 512, 2048, device='xpu') m(x).sum().backward() # crash Error message (varies each run, values like -7.9e16 to -5.0e17, looks like integer overflow): RuntimeError: Trying to create tensor with negative dimension -79243236477491020: [-79243236477491020] Sometimes also: IndexError: select(): index -1 out of range for tensor of size [0] at dimension 0 In severe cases (e.g., when AMP BF16 is enabled), the entire system freezes and requires a hard reboot -- the GPU driver itself crashes, not just the Python process. Observations: 1. Same code runs perfectly on CPU (device='cpu'). 2. CNN operations (Conv2d, Linear, BatchNorm) work fine on XPU -- only attention backward triggers this. 3. Forward pass is always fine, only loss.backward() crashes. 4. Not always reproducible with tiny models (batch=2, hidden=512), but almost guaranteed with larger sizes (batch=8, hidden=2048). 5. System freeze (driver crash) happens with AMP BF16 enabled. Things I've tried that didn't help: - Replacing nn.MultiheadAttention with F.scaled_dot_product_attention - AMP BF16 (made it worse -- system freeze) - Periodic torch.xpu.empty_cache() + gc.collect() (delays but doesn't prevent) - torch.xpu.synchronize() before/after backward Is this a known PyTorch XPU backend bug, an Intel oneAPI driver issue, or something wrong with my setup? Any known fixes or workarounds would be greatly appreciated. Fri, 29 May 2026 15:45:36 GMT PlanteAmigor 2026-05-29T15:45:36Z Intel Arc GPU https://community.intel.com/t5/GPU-Compute-Software/Intel-Arc-GPU/m-p/1749479#M2372 PyTorch XPU backward pass crash with Transformer/SDPA on Intel Arc iGPU Environment: - CPU: Intel Core Ultra 9 285H (Meteor Lake) - GPU: Intel Arc iGPU (8 Xe-core, shared memory, 128GB DDR5) - OS: Linux (Ubuntu 24.04) - PyTorch: 2.12.0+xpu - Intel oneAPI XPU driver: latest I'm experiencing a crash during the backward pass of nn.TransformerEncoderLayer (or F.scaled_dot_product_attention) when running on Intel XPU. The forward pass works fine, but loss.backward() crashes with memory allocation errors or segfaults. Minimal repro: import torch, torch.nn as nn m = nn.TransformerEncoderLayer(2048, 16, batch_first=True).to('xpu') x = torch.randn(8, 512, 2048, device='xpu') m(x).sum().backward() # crash Error message (varies each run, values like -7.9e16 to -5.0e17, looks like integer overflow): RuntimeError: Trying to create tensor with negative dimension -79243236477491020: [-79243236477491020] Sometimes also: IndexError: select(): index -1 out of range for tensor of size [0] at dimension 0 In severe cases (e.g., when AMP BF16 is enabled), the entire system freezes and requires a hard reboot -- the GPU driver itself crashes, not just the Python process. Observations: 1. Same code runs perfectly on CPU (device='cpu'). 2. CNN operations (Conv2d, Linear, BatchNorm) work fine on XPU -- only attention backward triggers this. 3. Forward pass is always fine, only loss.backward() crashes. 4. Not always reproducible with tiny models (batch=2, hidden=512), but almost guaranteed with larger sizes (batch=8, hidden=2048). 5. System freeze (driver crash) happens with AMP BF16 enabled. Things I've tried that didn't help: - Replacing nn.MultiheadAttention with F.scaled_dot_product_attention - AMP BF16 (made it worse -- system freeze) - Periodic torch.xpu.empty_cache() + gc.collect() (delays but doesn't prevent) - torch.xpu.synchronize() before/after backward Is this a known PyTorch XPU backend bug, an Intel oneAPI driver issue, or something wrong with my setup? Any known fixes or workarounds would be greatly appreciated. Fri, 29 May 2026 15:45:36 GMT https://community.intel.com/t5/GPU-Compute-Software/Intel-Arc-GPU/m-p/1749479#M2372 PlanteAmigor 2026-05-29T15:45:36Z