topic Re: Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629) in Graphics

Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

RyanR3 — Thu, 16 Apr 2026 03:11:03 GMT

Full disclosure, I've been working with Codex and Claude on this generation/upscaling pipeline project. This current issue and following message about it has been drafted with assistance from Claude.

Subject: Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629) — DPC access violation during sustained PyTorch XPU compute

## Project Background

I'm running a local AI image generation and upscaling pipeline on Windows 10, powered by an Intel Arc Pro B70. The stack includes:

- **ComfyUI** (embedded Python 3.13.11) for image generation via Flux.1/Flux.2 models
- **SUPIR** (SDXL-based image restoration/upscaling) running as a subprocess bridge
- **PyTorch 2.9.1+xpu / torchvision 0.24.1+xpu / torchaudio 2.9.1+xpu** (oneDNN + Level Zero backend)
- Mixed precision: VAE in bf16, UNet in fp16
- Tiled VAE encode/decode to manage VRAM, chunked SDPA (512-token blocks) to avoid OOM on large sequences

The B70's 32 GB VRAM is a key reason I chose this card — it handles models that won't fit on consumer GPUs.

---

## The Problem

After several minutes of sustained high GPU utilization during SUPIR inference (tiled VAE → EDM denoising loop), the system hard crashes with **BSOD 0xD1 (DRIVER_IRQL_NOT_LESS_OR_EQUAL)**. This has occurred **6+ times within a 24-hour window**, all on the same driver build. No Python exception precedes the crash — the process simply stops producing output, then the OS reboots.

Driver **32.0.101.8629** (released 2026-04-02) is confirmed installed and reported as **up to date** by Intel Pro Graphics Software. This is not a "update your driver" situation.

---

## System Configuration

| Component | Detail |
|---|---|
| **GPU** | Intel Arc Pro B70 (32 GB VRAM) |
| **CPU** | AMD Ryzen 9 7900X 12-Core, 4.70 GHz |
| **Motherboard** | ASUS ROG STRIX B650E-F |
| **RAM** | G.Skill Flare X5 32 GB (2×16 GB) DDR5-6000 |
| **Storage** | Crucial P3 2 TB PCIe Gen3 NVMe M.2 |
| **PSU** | Seasonic Prime 750W Platinum |
| **OS** | Windows 10 Pro 22H2 (build 19045) |
| **GPU Driver** | 32.0.101.8629 (DriverDate 2026-04-01) |
| **PyTorch** | 2.9.1+xpu |
| **Level Zero runtime** | 1.14.37111 (from oneDNN log) |

---

## Dump Analysis — Root Cause

I captured a full kernel dump and analyzed it with `cdb.exe` (WinDbg/Microsoft Debugging Tools) with Microsoft symbols. The results are unambiguous:

```
BUGCHECK_CODE: D1 (DRIVER_IRQL_NOT_LESS_OR_EQUAL)
BUGCHECK_P1: ffffe301ecde483c ← invalid kernel pointer (read)
BUGCHECK_P2: b ← IRQL 11 (DISPATCH_LEVEL+)
BUGCHECK_P3: 0 ← read operation
BUGCHECK_P4: fffff807a0ce42c9 ← faulting RIP

FAULTING_MODULE: igdkmdnd64.sys
IMAGE_VERSION: 32.0.101.8629
SYMBOL_NAME: igdkmdnd64+0x3a4c29
FAILURE_BUCKET_ID: AV_igdkmdnd64!unknown_function
FAILURE_ID_HASH: {f72986a3-e8f9-3600-9d7c-dd8a40f557df}

Faulting instruction:
fffff807`a0ce42c9 4181 3c81 fd100011 cmp dword ptr [r9+rax*4], 110010FDh
← reads [r9+rax*4] at an invalid/freed kernel address
```

**Call stack (bottom → top):**
```
nt!KiIdleLoop
nt!KiRetireDpcList
nt!KiExecuteAllDpcs
dxgkrnl!DpiFdoDpcForIsr ← graphics ISR completion DPC
igdkmdnd64+0x39be8
igdkmdnd64+0x17cc3
igdkmdnd64+0x45bca0
igdkmdnd64+0x45bbca
igdkmdnd64+0x45a85a
igdkmdnd64+0x45c731
dxgkrnl!DpSynchronizeExecution
nt!KeSynchronizeExecution
igdkmdnd64+0x45c854
igdkmdnd64+0x3b2dd9
igdkmdnd64+0x395566
igdkmdnd64+0x3969ed
igdkmdnd64+0x3a07e6
igdkmdnd64+0x3a3bcd
igdkmdnd64+0x3a4c29 ← ACCESS VIOLATION
nt!KiPageFault
nt!KiBugCheckDispatch
nt!KeBugCheckEx
```

The crash is inside the **graphics ISR completion DPC** — the code path exercised by sustained Level Zero command-list submission. The KMD is dereferencing what appears to be a stale or freed object pointer (`r9`) during DPC synchronization after prolonged compute submission.

---

## Reproduction Pattern

1. Load SDXL-class model to XPU in bf16/fp16 mixed precision (~9 GB resident)
2. Run tiled VAE encode over ~956×1399 image at tile_size=1024 → ~546 kernel submissions, ~10 s **(succeeds)**
3. Run tiled VAE decode of latent (352×240) at decoder_tile_size=128 → ~738 kernel submissions, ~10 s **(succeeds)**
4. Re-encode stage-1 output → ~546 submissions, ~3 s **(succeeds)**
5. Enter 35-step EDM denoise loop (full SDXL UNet forward per step, chunked attention, many oneDNN matmul/conv primitives); GPU sustains ~100% utilization
6. After several minutes (typically between step 3 and step 20 by elapsed-time correlation), system dies — no Python exception, bridge process stops, OS reboots

**Crash timeline — all same bugcheck 0xD1, same driver 32.0.101.8629:**
```
2026-04-15 01:45:22 (MEMORY.DMP attached from this crash)
2026-04-15 00:51:05
2026-04-15 00:04:50
2026-04-14 22:40:59
2026-04-14 19:40:06
2026-04-14 17:35:32
```

---

## What Has Been Ruled Out

| Hypothesis | Test | Result |
|---|---|---|
| TDR timeout | Set `TdrDelay=60`, `TdrDdiDelay=60`, rebooted | Still crashed — not a timeout |
| Python/user-space bug | Moved UNet denoise entirely to CPU (fp32) | Still BSODed — crash is in KMD regardless |
| Out of memory | Monitor VRAM during run | <12 GB used; no OOM exceptions |
| Stale driver | Checked Intel Pro Graphics Software | 32.0.101.8629 = latest available |

---

## Associated User-Space Errors (Possibly Related)

Prior runs logged oneDNN errors before the crash path matured:
```
oneDNN error: CL_INVALID_BINARY at src/gpu/intel/ocl/engine.cpp:269
RuntimeError: could not create a primitive
```

This suggests the SPIR-V JIT / Level Zero kernel binary pipeline can produce a binary the runtime rejects. We've applied env-var cache suppression as a local mitigation (`NEO_CACHE_PERSISTENT=0`, `SYCL_CACHE_PERSISTENT=0`, `ZE_ENABLE_LOADER_CACHE=0`, `MKLDNN_PRIMITIVE_CACHE_CAPACITY=0`, `ONEDNN_PRIMITIVE_CACHE_CAPACITY=0`) but BSODs continue.

---

## Attachments

- **`MEMORY.DMP`** — full kernel dump, 2026-04-15 01:45:10 (~2.4 GB, since this exceeds upload limit, I cannot attach it but it is available)
- **`analyze_20260415.txt`** — complete `!analyze -v` + `lmkv` output from `cdb.exe`
- Bridge process log with `[supir-phase]` breadcrumbs available on request — gives the exact Level Zero submission call site at the moment of crash on the next repro

---

## Ask

1. Please match `igdkmdnd64+0x3a4c29` against Intel private symbols to identify the freed/stale object being dereferenced
2. If a fix exists in a beta/Xe2 insider branch or a newer Pro B-series driver build, please share a build number to test against
3. If a known workaround exists (Level Zero command-list batching knob, KMD feature flag, submission frequency limit), please advise

Happy to run any additional diagnostic captures or test against a driver drop. This is a reproducible, kernel-confirmed defect on the latest shipping driver.

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Jonzyl_Intel — Thu, 16 Apr 2026 22:14:50 GMT

Hi RyanR3,

Thank you for reaching out.

I appreciate your transparency about working with Codex and Claude on your generation/upscaling pipeline project.

To help us better understand and troubleshoot the issue you're experiencing, would it be possible for you to share a sample project or some code snippets? Having something we can reproduce on our end would significantly help us identify what's causing the problem and work toward a solution.

Additionally, are you working as a developer or is this more of a personal/research project?

This information will help me determine the cause and provide appropriate troubleshooting steps.

We're looking forward to your response.

Best regards

Jonzyl B.

Intel Customer Support Technician

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

RyanR3 — Fri, 17 Apr 2026 00:21:24 GMT

Thank you for the quick response. I hope you don't mind but I used Claude to try to get the exact information on what you're needing. This is a personal/enthusiast project and my code experience is long out of date (think Web 1.0 😂) so any issues with code structure are purely on Codex/Claude. I've used those models to build a local AI image generation and upscaling pipeline for home use because I didn't like the limitations of a lot of stuff out there, like Comfy's visual spaghetti mess, and wanted something that was more intuitive for my use while making use of Comfy's mature backend.

I had been working with a 4060ti 16GB previously running Forge UI with SDXL and Flux.1. I decided to upgrade hardware with the launch of the B70, and expand into working with Flux.2, SUPIR, hopefully Wan video, and possibly other LLMs with the extra freedom the 32GB VRAM in B70 offers.

Here's what Claude helped me put together for you. If it left anything out, please let me know and I'll try to provide it:

1. Why Code Snippets May Not Be the Full Picture

I want to be transparent: the crash is in igdkmdnd64.sys kernel-mode code, confirmed by full dump analysis with cdb.exe and Microsoft symbols. The Python application is not throwing an exception — the OS crashes inside a graphics ISR completion DPC. That said, I can provide both a minimal standalone reproduction script and the actual modified source files so your team can understand the submission pattern that triggers it.

2. Minimal Standalone Reproduction Script

This does not require SUPIR or any AI models. It mimics the same sustained Level Zero compute pattern (chunked attention + convolution) that reliably triggers the crash on my system. Expected runtime before crash: 3–10 minutes at ~100% GPU utilization.

""" Minimal XPU stress repro for igdkmdnd64.sys BSOD (0xD1). Tested on: Intel Arc Pro B70, driver 32.0.101.8629, PyTorch 2.9.1+xpu, Windows 10 22H2. Expected: system crashes with DRIVER_IRQL_NOT_LESS_OR_EQUAL after several minutes. """ import torch import time import gc assert hasattr(torch, "xpu") and torch.xpu.is_available(), "XPU not available" device = torch.device("xpu") print(f"Device : {torch.xpu.get_device_name(0)}") print(f"PyTorch: {torch.__version__}") print("Starting sustained XPU compute -- expect BSOD within ~3-10 minutes on affected driver.\n") start = time.time() for step in range(500): # --- Chunked multi-head attention (mirrors SDXL UNet CrossAttention on XPU) --- # seq_len=84480 is the actual token count for a 352x240 latent; use 8192 here # for a faster repro that still exercises the same kernel submission path. B, H, N, D = 1, 8, 8192, 160 # batch, heads, seq_len, head_dim q = torch.randn(B, H, N, D, dtype=torch.float16, device=device) k = torch.randn(B, H, N, D, dtype=torch.float16, device=device) v = torch.randn(B, H, N, D, dtype=torch.float16, device=device) CHUNK = 512 scale = D ** -0.5 chunks = [] for i in range(0, N, CHUNK): q_c = q[:, :, i:i + CHUNK] attn = torch.einsum("bhid,bhjd->bhij", q_c, k) * scale attn = attn.softmax(dim=-1) chunks.append(torch.einsum("bhij,bhjd->bhid", attn, v)) del attn, q_c out = torch.cat(chunks, dim=2) del q, k, v, chunks, out # --- Conv2d block (mirrors UNet ResBlock) --- x = torch.randn(1, 512, 64, 64, dtype=torch.float16, device=device) conv = torch.nn.Conv2d(512, 512, 3, padding=1).half().to(device) y = conv(x) del x, y, conv elapsed = time.time() - start print(f"Step {step+1:>4} | {elapsed:6.1f}s elapsed", flush=True) if step % 20 == 19: torch.xpu.empty_cache() gc.collect() print("\nDone -- no crash this run.")

To Run

pip install torch==2.9.1+xpu torchvision==0.24.1+xpu torchaudio==2.9.1+xpu \ --index-url https://download.pytorch.org/whl/xpu python xpu_stress_repro.py

3. Actual Application Code — Key Modified Files

These are the three files our pipeline modifies from the stock SUPIR repository that are relevant to this crash:

3.1 tilevae.py — XPU VRAM Detection + oneDNN Primitive Fallback

# XPU-aware VRAM detection (added; stock code was CUDA-only) def _get_gpu_total_memory_mb(): if torch.cuda.is_available(): return torch.cuda.get_device_properties(devices.device).total_memory // 2**20 if hasattr(torch, "xpu") and torch.xpu.is_available(): return torch.xpu.get_device_properties(devices.device).total_memory // 2**20 return 0 # Decoder tile size fix (CRITICAL -- stock code passed pixel-space tile_size to decoder, # which operates in latent space at 1/8 scale, causing tiling to be skipped entirely # and triggering one massive untiled forward pass -> earlier BSOD path) def get_recommend_decoder_tile_size(): total_memory = _get_gpu_total_memory_mb() if total_memory > 30*1000: return 256 # B70 with 32 GB lands here elif total_memory > 16*1000: return 192 elif total_memory > 12*1000: return 128 elif total_memory > 8*1000: return 96 else: return 64 # oneDNN primitive failure fallback (added in attn_forward) try: h_ = _compute_attention(q, k, v) except RuntimeError as err: msg = str(err) if ("could not create a primitive" not in msg) and ("CL_INVALID_BINARY" not in msg): raise # Intel XPU oneDNN intermittently fails primitive creation for bmm. # Fall back to CPU for this block. print("[Tiled VAE] attn_forward fallback to CPU after XPU primitive creation failure") h_ = _compute_attention(q.float().cpu(), k.float().cpu(), v.float().cpu() ).to(device=v.device, dtype=v.dtype)

3.2 attention.py — Chunked SDPA for XPU (No Flash Attention Backend)

# XPU has no flash/efficient SDPA backend. Naive path for large sequences # (e.g. 84,480 tokens from a 352x240 latent) tries to allocate a 49.85 GB # attention matrix and OOMs. We chunk along the query dimension instead. _N = q.shape[2] _is_xpu = q.device.type == "xpu" if _is_xpu and mask is None and _N > 2048: _CHUNK = 512 # ~3.2 GB per chunk on B70; well within free VRAM scale = q.shape[-1] ** -0.5 out_chunks = [] for _i in range(0, _N, _CHUNK): q_c = q[:, :, _i:_i + _CHUNK] attn_c = torch.einsum("b h i d, b h j d -> b h i j", q_c, k) * scale attn_c = attn_c.softmax(dim=-1) out_c = torch.einsum("b h i j, b h j d -> b h i d", attn_c, v) out_chunks.append(out_c) del attn_c, q_c out = torch.cat(out_chunks, dim=2) del out_chunks else: # CUDA path unchanged; nullcontext used on XPU (sdp_kernel is CUDA-only) with sdp_kernel(**BACKEND_MAP[self.backend]): out = F.scaled_dot_product_attention(q, k, v, attn_mask=mask)

3.3 supir_worker.py — Staged Model Load + Decoder Tile Size Fix

# Decoder tile size fix applied at load time: # pixel-space tile_size (e.g. 1024) -> latent-space (//8), clamped [64, 256] _decoder_ts = max(64, min(normalized_tile_size // 8, 256)) model.init_tile_vae( encoder_tile_size=normalized_tile_size, # 1024 -> encoder stays pixel-space decoder_tile_size=_decoder_ts, # 1024 -> 128 for decoder ) # VAE in bf16, UNet/denoiser in fp16 (matches runtime YAML config). # Blanket .half() previously put the VAE in fp16 while autocast wrapped # it in bf16 -> dtype mismatch -> unnecessary oneDNN kernel recompilation. _bf16_names = {"first_stage_model"} for name, child in model.named_children(): target_dtype = torch.bfloat16 if name in _bf16_names else torch.float16 child.to(dtype=target_dtype) # Staged device transfer: move submodules individually with XPU cache flushes # rather than a single model.to(device) for ~12 GB. for name, child in model.named_children(): mb = sum(p.numel() * p.element_size() for p in child.parameters()) / 1024**2 child.to(device) if mb > 100: torch.xpu.empty_cache() time.sleep(0.5)

4. Cache Suppression Environment Variables

Applied as a mitigation — does not prevent the crash, but reduces poisoned JIT cache risk:

NEO_CACHE_PERSISTENT=0 SYCL_CACHE_PERSISTENT=0 ZE_ENABLE_LOADER_CACHE=0 MKLDNN_PRIMITIVE_CACHE_CAPACITY=0 ONEDNN_PRIMITIVE_CACHE_CAPACITY=0

5. Attachments Available

File	Description
MEMORY.DMP (~2.4 GB)	Full kernel dump from confirmed crash, 2026-04-15 01:45:10
analyze_20260415.txt	Complete !analyze -v output pinpointing igdkmdnd64+0x3a4c29
Bridge log with [supir-phase] breadcrumbs	Available on next repro — correlates exact Level Zero submission with fault moment

Please advise on your preferred method to transfer the dump file, if needed, as it will exceed forum upload limits.

6. Original Bug Report (INTEL_BUG_REPORT.md)

Title

Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys 32.0.101.8629 — DPC AV during sustained PyTorch 2.9.1+xpu (oneDNN/Level Zero) compute

Severity

High — reproducible system crash on current latest driver. Kernel-mode fault, not recoverable by application restart.

Summary

On Windows 10 (19045) with an Intel Arc Pro B70, running sustained PyTorch 2.9.1+xpu compute (SUPIR image upscaling: tiled VAE encode/decode + 35-step EDM denoise loop), the system reliably blue-screens with bugcheck 0xD1 (DRIVER_IRQL_NOT_LESS_OR_EQUAL) after several minutes of high GPU utilization. Full kernel dump captured; !analyze -v pinpoints igdkmdnd64.sys reading an invalid kernel pointer inside a DPC off the graphics ISR.

This reproduces on driver 32.0.101.8629 (released 2026-04-02), reported by Intel Pro Graphics Software as the latest available for this GPU.

Hardware / OS

Component	Detail
GPU	Intel(R) Arc(TM) Pro B70 Graphics
PCI ID	VEN_8086 DEV_E223 SUBSYS_17018086 REV_00
GPU Driver	32.0.101.8629 (DriverDate 2026-04-01)
VRAM	32 GB
CPU	AMD Ryzen 9 7900X 12-Core, 4.70 GHz
Motherboard	ASUS ROG STRIX B650E-F
RAM	G.Skill Flare X5 32 GB (2x16 GB) DDR5-6000
Storage	Crucial P3 2 TB PCIe Gen3 NVMe M.2
PSU	Seasonic Prime 750W Platinum
OS	Windows 10 Pro 22H2 64-bit, 10.0.19045
BIOS	AMI SMBIOS 3842 (2026-03-10)

Software Stack

Item	Value
Python	3.13.11 (Comfy embedded)
PyTorch	2.9.1+xpu
torchvision	0.24.1+xpu
torchaudio	2.9.1+xpu
Level Zero runtime	1.14.37111 (from oneDNN log)
Framework	SUPIR (SDXL-based image restoration)
Attention path	F.scaled_dot_product_attention, chunked by 512-token query blocks on XPU
VAE dtype	bf16 (ae_dtype)
UNet dtype	fp16 (diffusion_dtype)

Bugcheck Details (from !analyze -v)

BUGCHECK_CODE: D1 (DRIVER_IRQL_NOT_LESS_OR_EQUAL) BUGCHECK_P1: ffffe301ecde483c (read address -- invalid kernel pointer) BUGCHECK_P2: b (IRQL 11 -- DISPATCH_LEVEL+) BUGCHECK_P3: 0 (read) BUGCHECK_P4: fffff807a0ce42c9 (faulting RIP) PROCESS_NAME: System (DPC context) FAULTING_THREAD: ffffe301e9ecb640 READ_ADDRESS: ffffe301ecde483c IP_IN_PAGED_CODE: igdkmdnd64+0x3a4c29 fffff807`a0ce42c9 4181 3c81 fd100011 cmp dword ptr [r9+rax*4], 110010FDh SYMBOL_NAME: igdkmdnd64+0x3a4c29 MODULE_NAME: igdkmdnd64 IMAGE_NAME: igdkmdnd64.sys IMAGE_VERSION: 32.0.101.8629 FAILURE_BUCKET_ID: AV_igdkmdnd64!unknown_function FAILURE_ID_HASH: {f72986a3-e8f9-3600-9d7c-dd8a40f557df}

Call Stack (bottom to top)

nt!KiIdleLoop+0x9e nt!KiRetireDpcList+0x1f4 nt!KiExecuteAllDpcs+0x30e dxgkrnl!DpiFdoDpcForIsr+0x66 <- graphics ISR completion DPC igdkmdnd64+0x39be8 igdkmdnd64+0x17cc3 igdkmdnd64+0x45bca0 igdkmdnd64+0x45bbca igdkmdnd64+0x45a85a igdkmdnd64+0x45c731 dxgkrnl!DpSynchronizeExecution+0xac nt!KeSynchronizeExecution+0x48 igdkmdnd64+0x45c854 igdkmdnd64+0x3b2dd9 igdkmdnd64+0x395566 igdkmdnd64+0x3969ed igdkmdnd64+0x3a07e6 igdkmdnd64+0x3a3bcd igdkmdnd64+0x3a4c29 <- ACCESS VIOLATION here nt!KiPageFault+0x478 nt!KiBugCheckDispatch+0x69 nt!KeBugCheckEx

Reproduction Pattern

Load SDXL-class model to XPU in bf16/fp16 mixed precision (~9 GB resident).
Run tiled VAE encode over a ~956x1399 image at tile_size=1024 -- ~546 kernel submissions, ~8-12s (succeeds).
Run tiled VAE decode of latent (352x240) at decoder_tile_size=128 -- ~738 kernel submissions, ~10-12s (succeeds).
Re-encode stage-1 image -- ~546 submissions, ~3s (succeeds).
Enter 35-step EDM denoise loop; each step runs a full SDXL UNet forward (chunked attention, many oneDNN matmul/conv primitives). GPU stays at ~100% utilization.
After several minutes of denoising (usually between step 3 and step 20), the system dies with the bugcheck above. No Python exception precedes it.

Crash timeline — all same bugcheck 0xD1, same driver 32.0.101.8629:

2026-04-15 01:45:22 (MEMORY.DMP attached from this crash) 2026-04-15 00:51:05 2026-04-15 00:04:50 2026-04-14 22:40:59 2026-04-14 19:40:06 2026-04-14 17:35:32

What Has Been Ruled Out

Hypothesis	Test	Result
Not a TDR timeout	Set TdrDelay=60 and TdrDdiDelay=60, rebooted	Still crashed — not a timeout
Not a Python/user-space bug	Moved UNet denoise entirely to CPU (fp32)	Still BSODed — crash is in KMD regardless
Out of memory	Monitored VRAM during run	Less than 12 GB used; no OOM exceptions
Stale driver	Checked Intel Pro Graphics Software	32.0.101.8629 = latest available

Associated User-Space Errors (Possibly Related)

Prior runs logged oneDNN errors before the crash path matured:

oneDNN error: CL_INVALID_BINARY at src/gpu/intel/ocl/engine.cpp:269 RuntimeError: could not create a primitive

This suggests the SPIR-V JIT / Level Zero kernel binary pipeline can produce a binary the runtime rejects. Cache suppression env vars have been applied as a local mitigation but BSODs continue.

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

RyanR3 — Fri, 17 Apr 2026 14:41:31 GMT

Some additional findings I've come up with I thought I'd pass along in case it's helpful. In further personal testing, I have found that using SUPIR to upscale a 730x730 image 2x to 1472x1472 will not crash the system but, attempting to upscale that same 730x730 4x will BSOD, as will attempting to upscale a 1472x1472 image only 2x.

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

RyanR3 — Sun, 19 Apr 2026 23:58:30 GMT

Further testing this weekend has identified what would be considered "high risk" upscaling runs leading to BSOD condition.

Report compiled by Claude over a couple test runs:

High-Risk Workload Profile

Master Forge / SUPIR on Intel Arc Pro B70

Conditions under which the Intel Arc Pro B70 KMD (igdkmdnd64.sys) has been observed to BSOD with DPC AV (0xD1) in this application.

Hardware / Driver Context

GPU: Intel Arc Pro B70 (Battlemage), 32 GB
Driver: igdkmdnd64.sys — faulting module across all observed crashes
Stack: PyTorch 2.9.1+xpu, Level Zero / oneDNN, Python 3.12, Windows 11
Fault signature: PAGE_FAULT_IN_NONPAGED_AREA (0xD1) at igdkmdnd64+0x3a4c29 inside a DPC

Workload Class

SDXL-family UNet (SUPIR fork) with CFG-batched chunked SDPA
Dtype: bf16 for both UNet and latent tensors (previously fp16, but bf16-switch did not resolve the crash)
Attention path: naive (non-flash / non-xformers) chunked scaled-dot-product implemented in fp32 with stabilized softmax — XPU has no flash/efficient SDPA backend
VAE: tiled decode/encode, tile_size ≤ 768

Submission-Rate Risk Factors (the smoking gun)

Attention sequence length N > ~130,000 tokens (latent 368×368 or larger)
Per-attention-call chunk count > ~256 when running the chunked-SDPA fallback
Observed crash: ~530 chunks × CFG-batch × ~70 attention blocks × 35 steps ≈ ~1.3 M submissions across a single image
Sustained Level Zero submissions over several minutes — faults typically surface 1–5 minutes into a heavy run
Observed latency anomaly: denoise step 0 took 113.6 s on a 368×368 latent vs 4.2 s on a 168×168 latent in the same session (27× slowdown = GPU saturation signal preceding the BSOD)

Output-Size Risk Thresholds (empirical)

High-risk output (warn): output max_dim ≥ 1792 OR output pixels ≥ 4 Mpix
Crash-zone output (block): output max_dim > 2048 OR output pixels > 5 Mpix
Known-crashing inputs:
655×655 → 1344×1344 at tile_size=1024, upscale=2, steps=35 (§52 of our logs — 1.8 Mpix output, below the old “high-risk” gate)
1472×1472 → 2944×2944 at tile_size=768, upscale=2, steps=35 (§55 — 8.66 Mpix output, crash on first image of batch)

Tiling / Memory Pressure

tile_size=1024 triggers the fault even on sub-2-Mpix outputs
tile_size=768 mitigates only the VAE path — UNet attention still saturates the KMD on large latents
Occurs well below VRAM limits (~9.3 GB resident, ~30% of 32 GB) — not an OOM, it’s a submission-rate / queue-depth issue

Temporal / Batch Patterns That Elevate Risk

Second consecutive run within ~5 s of a previous completion (resolved in-app with an 8 s inter-run cool-down, but the vulnerability is driver-side)
Back-to-back high-risk images in a batch with no inter-image settle (resolved in-app with a 6 s per-image cool-down after any job with output ≥ 1792 px / 4 Mpix)
Model reload + immediate heavy attention call — the second run after a reload has been observed to fault more often than the first

What Does NOT Trigger It (ruled out)

Not an fp16 overflow (reproduced on bf16)
Not an OOM (VRAM stays ~30%)
Not specific to 4× upscale (2× is sufficient when the output is large enough)
Not specific to a single image content (reproduced across unrelated source images)
Not ComfyUI / Flux generation — SUPIR is the observed reproducer, but the common factor is sustained chunked-SDPA submissions on XPU, so any SDXL/SDXL-derivative workload without a flash-attention backend on B70 is a theoretical candidate

Minimal Repro Recipe

SUPIR (or any SDXL CrossAttention fork) with naive chunked-SDPA fallback on XPU
Input latent ≥ 168×168 (N ≥ 28,224 tokens) is sufficient for §52’s crash at tile_size=1024
bf16 UNet + bf16 latent
35 denoise steps, CFG-batched (cond + uncond), num_samples=1
Run twice back-to-back; second run reliably crashes within ~60 s on the affected machine

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Jonzyl_Intel — Mon, 20 Apr 2026 19:21:04 GMT

Hi RyanR3,

Thank you for providing this information.

I will do further research on this matter and post the response on this thread once it is available.

If you have questions, please let us know. Thank you.

Best regards

Jonzyl B.

Intel Customer Support Technician

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

RyanR3 — Wed, 22 Apr 2026 06:14:20 GMT

Wanted to update with additional information that may potentially be useful.

My reported information, so far, has revolved around image upscaling. While waiting on word on that current issue, I decided to try some other AI related tasks. After some trial and error with OOM and memory management, I've gotten successful image generation in Flux.1 Dev FP8 (I was only using NF4 on my 4060ti) and Flux.2 Dev FP4. Moving on from that I decided to try some LoRA training starting with Flux.2 since it is the more demanding of the two.

I'm now running into the same BSOD issue under that as I was with upscaling, and the underlying issue seems to again be BSOD after sustained XPU load.

I understand that the B70 is brand new tech and there's always a bit of a "penalty" that early adopters pay with running into new issues, and I know you all are working hard on this issue (and others) but, the B70 is billed as a pro level workhorse and if I'm being honest, I feel like it's letting me down with what should be fairly basic, mainstream AI stuff, and I haven't even tried video generation yet. All of which should be fairly easy with 32GB VRAM to work with. For perspective, I could pretty easily run 100 batch, 45 step, 2MP Flux.1 image generation runs on my 4060ti 16GB for hours on end with zero issue, while trying to get what I'd consider performance commensurate with a 32GB card seems to just be a continual fight because it seems like the B70 can't handle sustained workload.

Regardless, just wanted to provide some customer feedback on my B70 experience so far before getting on to the latest information which I had Codex help compile into a new report after the failed LoRA training:

Driver Stability Update: Escalation from User-Mode AV to BSOD (0xD1)

Prepared April 22, 2026 | Driver branch in use: 32.0.101.x | Faulting kernel module: igdkmdnd64.sys

We have new telemetry that strengthens the case for a driver/runtime stability defect under sustained XPU compute load.

New Incident Summary (April 21-22, 2026)

A) Repeated user-mode access violations

- python.exe crashes with Exception code 0xC0000005

- Faulting module: torch\lib\c10.dll

- Repeated fault signature:

- c10.dll fault offset: 0x000000000008f514

- Example event times:

- April 21, 2026 11:13:34 PM (Application Error 1000)

- April 22, 2026 12:21:45 AM (Application Error 1000)

B) Follow-on kernel bugcheck

- System later BSODs with 0x000000D1 (DRIVER_IRQL_NOT_LESS_OR_EQUAL)

- WER SystemErrorReporting confirms dump creation:

- April 22, 2026 12:50:13 AM

- Bugcheck params:

- P1=0xffff89820710a83c

- P2=0x000000000000000b

- P3=0x0000000000000000

- P4=0xfffff8018d2f42c9

- Prior matching D1 event also present:

- April 20, 2026 5:42:32 PM

Observed Failure Pattern

Under sustained XPU workload, failure now appears as a two-stage progression:

1) User-mode AV in PyTorch runtime (c10.dll, 0xC0000005)

2) System instability escalation to kernel bugcheck 0xD1 (igdkmdnd64 path)

This is materially different from normal application exceptions and strongly indicates a low-level runtime/driver fault path.

Correlated Runtime Telemetry

- Long-running compute phase proceeds normally for several minutes.

- Logs then end abruptly during active compute (no graceful Python exception path).

- In the latest run, workload entered active iteration and continued until abrupt termination; system subsequently recorded unexpected shutdown and bugcheck events.

- This behavior is consistent with runtime/device state corruption rather than deterministic script failure.

Why This Points to Driver/Runtime (Not App Logic)

- Crash signatures are stable across runs and dates.

- User-mode AV occurs inside c10.dll (native runtime), not Python-level tracebacks.

- Kernel crash class is consistent and recurring (0xD1).

- Same machine has now produced multiple D1 incidents over separate sessions.

Artifacts Available

- MEMORY.DMP

- 042226-272031-01.dmp

- WER System bugcheck report ID:

- 0d951bab-690c-4693-ac8f-4de6fb0a6d5b

- App crash report ID (python.exe / c10.dll):

- 412a16dc-2f53-4fb9-9164-90eecd4203c2

- Additional earlier bugcheck report ID:

- 49b93ebc-1f05-4362-b468-ba73da016f26

Request to Intel

Please treat this as an escalation of the existing issue:

- Investigate the repeated user-mode AV in c10.dll as a potential precursor to KMD failure.

- Correlate with repeated 0xD1 incidents in igdkmdnd64.sys under sustained XPU compute.

Re: Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

gadget — Thu, 23 Apr 2026 23:49:18 GMT

I am doing a similar use case for the B70 but have found a different issue that limits on dual and single card with the ASrocks Thaichi creator which supports PCIE 5 /16

Arc Pro B70 (BMG-G31) advertises LnkCap Gen 1 x1 on dual-card configuration — card-level ceiling, not platform

Summary

I have two Intel Arc Pro B70 GPUs in a dual-card AI inference workstation. Both cards train at PCIe Gen 1 x1 (2.5 GT/s × 1) regardless of platform configuration. After extensive platform-side diagnostics — including a full motherboard BIOS update from AGESA 1.2.0.3e to AGESA 1.3.0.0a — the LnkCap register on both cards continues to advertise a maximum of Gen 1 x1, meaning the ceiling is at the card / silicon / firmware level rather than the motherboard or riser. Platform-side variables have been exhaustively ruled out. This is blocking production deployment because tensor parallelism (required for Intel LLM-Scaler vLLM multi-card serving) would be crippled at this link speed.

Requesting engineering engagement to determine whether this is a known launch-firmware issue on BMG-G31, whether a newer firmware exists beyond the currently-installed BMG__31.1058, or whether this represents a hardware defect requiring RMA.

Hardware

GPUs under test: 2x Intel Arc Pro B70 (32 GB GDDR6, BMG-G31 silicon)
- Card 1: PCI 0000:03:00.0, MEI device /dev/mei0, firmware BMG__31.1058
- Card 2: PCI 0000:08:00.0, MEI device /dev/mei1, firmware BMG__31.1058
- Device ID: 8086:e223
- Subsystem ID: 8086:1701
- Both cards stock Intel reference design, purchased Q1 2026
Motherboard: ASRock X870 Taichi Creator
- BIOS pre-test: 3.33 (AGESA 1.2.0.3e)
- BIOS post-update: 4.10 (AGESA 1.3.0.0a, released 2026-02-10)
CPU: AMD Ryzen 9 9900X (Zen 5, 12C/24T)
Memory: 30 GB DDR5
Power: Adequate PSU for 2× 250W B70s + CPU (not a power-limit issue)

Operating System / Software Stack

Ubuntu 24.04.4 LTS
Kernel: 6.17.0-20-generic
Kernel driver: xe (both cards claimed correctly, lspci -k confirms)
Intel compute-runtime: 26.09.37435.1 (latest from intel/compute-runtime GitHub)
Intel Graphics Compiler (IGC): v2.30.1 build 20950
Intel oneAPI DPC++ Compiler: 2025.3.3 (2025.3.3.20260319)
GuC/HuC firmware: Latest HEAD from linux-firmware.git
- /lib/firmware/xe/bmg_guc_70.bin.zst
- /lib/firmware/xe/bmg_huc.bin.zst
SYCL runtime confirms both cards as SYCL devices:
- [level_zero:gpu][0] Intel(R) Graphics [0xe223]
- [level_zero:gpu][1] Intel(R) Graphics [0xe223]
igsc version 0.9.3 installed and successfully enumerates both cards

The Specific Symptom

lspci -vv output for both cards (identical behavior):

03:00.0 VGA compatible controller [0300]: Intel Corporation Device [8086:e223]    ...    LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us    LnkSta: Speed 2.5GT/s, Width x1

08:00.0 VGA compatible controller [0300]: Intel Corporation Device [8086:e223]    ...    LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us    LnkSta: Speed 2.5GT/s, Width x1

Critical observation: LnkCap (the advertised maximum capability) is Speed 2.5GT/s, Width x1. This is not a link-down negotiation — the cards are telling the system they cannot go faster. At the LnkCap level, the ceiling is set by the downstream endpoint (the B70 card itself), not the upstream root complex.

Upstream path is healthy. dmesg reports:

pci 0000:01:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 linkpci 0000:06:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link

The CPU-to-slot path provides full PCIe 4.0 x8 bandwidth (≈16 GB/s per card). Motherboard and slot training to Gen 4 x8 is fine. The B70 cards are the link-training constraint.

Variables Ruled Out (Exhaustive Platform-Side Testing)

Riser cable signal integrity
- Tested with Conbull PCIe 5.0 riser (brand A) → Gen 1 x1
- Replaced with different-brand PCIe 5.0 riser → Gen 1 x1
- Tested both cards direct in motherboard PCIE1 and PCIE2 slots, no risers → still Gen 1 x1
BIOS auto-negotiation
- Forced PCIe Gen 4 in BIOS (pre-update 3.33) → Gen 1 x1
- Default/Auto PCIe negotiation (post-update 4.10) → Gen 1 x1
Motherboard BIOS / AGESA
- Updated from 3.33 (AGESA 1.2.0.3e) to 4.10 (AGESA 1.3.0.0a) → Gen 1 x1
- This is a major AGESA revision boundary. No change in B70 link state.
Idle power saving / ASPM
- Link state monitored via /sys/class/drm/card*/device and lspci -vv at 0.5-second intervals during active LLM inference
- Link state remains Gen 1 x1 across all observed activity levels
Driver / compute-runtime versions
- Multiple combinations tested; currently on latest upstream (26.09.37435.1 + IGC v2.30.1)
- Battlemage GuC/HuC firmware loaded from current linux-firmware.git
Slot bifurcation
- BIOS 4.10 exposes explicit PCIe Gen 5 x16 and x8/x8 bifurcation options (new vs 3.33)
- Neither affects B70 link training — both cards report the same LnkCap regardless

Variables NOT Yet Ruled Out

B70 on-card firmware. Currently BMG__31.1058 on both cards. No newer firmware has been published on Intel's Linux support channels. Intel's Linux FW support article (000096950) states: "The Linux driver package does not update FW." Community workaround of extracting firmware from the Windows driver package is available but unsupported and risks voiding warranty. Not attempted.
Silicon-level defect on both cards. Low probability (identical behavior on two cards from likely-different production batches suggests pattern, not random defect) but cannot rule out entirely.

Performance Impact

Steady-state inference: unaffected at this link speed because weights stay in VRAM
- Current benchmark on Qwen3.5-9B Q8_0 via llama.cpp SYCL: 47 tok/s generation
Model load latency: Extended (~60 s for a ~9 GB model at Gen 1 x1, vs. <1 s at Gen 5 x8)
Multi-GPU tensor parallelism: Effectively blocked. oneCCL/Level Zero cross-card transfers at 250 MB/s per card make tensor parallelism worse than single-card independent serving.
Business impact: Cannot migrate to Intel LLM-Scaler vLLM container (26.18.8.2, released 2026-04-22) for multi-user concurrent serving, which is the documented production path for dual-B70 deployments at the advertised 140+ tok/s throughput (reference: Hal9000AIML 2×B70 setup, also Intel Project Battlematrix validation).

Requested Engineering Response

Is BMG__31.1058 the current production firmware for B70?
Is there a known link-training issue on early BMG-G31 cards that would cause the endpoint to advertise LnkCap Gen 1 x1? A launch-batch firmware regression would be consistent with two cards from the same SKU exhibiting identical behavior.
If a newer firmware exists, what is the supported path for updating on Linux? Can Intel provide the firmware binary (.bin file) for direct application with igsc fw update, as currently needed by workstation Linux deployments? The "install Windows to update" path is impractical for production AI inference servers.
If firmware is current and behavior is expected at launch, what is the expected firmware release window for a link-training fix?
If neither of the above applies, please initiate RMA evaluation for both cards.

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Jonzyl_Intel — Mon, 27 Apr 2026 23:16:39 GMT

Hi RyanR3,

Thanks for patiently waiting! I wanted to give you a quick update on your case.

We're actively looking into the issue you reported, and our team is working hard to get to the bottom of it. To help us investigate further, we'll need a dump file from when the issue occurs.

Steps to Create a DMP File:

1) Enable Memory Dump Collection:

Right-click "This PC" and select "Properties"
Click "Advanced system settings"
Under "Startup and Recovery," click "Settings"
In the "Write debugging information" dropdown, select "Complete memory dump"
Click "OK" to save

2) Saving Dump Files:

The system will automatically create a dump file (usually saved in C:\Windows\MEMORY.DMP)
After the system restarts, locate this file and send it to us

3) Alternative Method (if the above doesn't work):

Download and install WinDbg from Microsoft
Use Task Manager to create a dump when the issue happens
Go to Task Manager > Details tab > Right-click the problematic process > "Create dump file"

Let me know once you've got that dump file ready, and we'll dive right into analyzing it!

Once you have the dump file ready, please share it using your preferred file sharing service (Google Drive, OneDrive, Dropbox, etc.) and send us the download link. These files can be quite large, so using a cloud service will make it much easier for both of us!

Thanks for your patience, and feel free to reach out if you need any help with these steps.

Best regards

Jonzyl B.

Intel Customer Support Technician

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Jonzyl_Intel — Mon, 27 Apr 2026 23:17:12 GMT

Hi gadget,

To address your concern about that - I'd recommend opening up a separate case for it. This way, we can make sure it gets the proper focus it deserves without getting mixed up with what we're currently working on. It'll help us stay organized and ensure both issues get the attention they need to be resolved properly.

Best regards

Jonzyl B.

Intel Customer Support Technician

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

RyanR3 — Tue, 28 Apr 2026 00:33:06 GMT

Hi Jonzyl,
I have the dump file from April 15, when I opened this ticket, uploaded to my Google Drive account.

Where can I send the link without posting it publicly? I did not see an option to DM you directly in your profile.

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Jonzyl_Intel — Tue, 28 Apr 2026 17:33:04 GMT

Hi RyanR3,

Thank you for your response. I've sent an email to your active email address you.

Please check both your Inbox and Spam folder for my email, and once received, kindly send the file to us.

Best regards

Jonzyl B.

Intel Customer Support Technician

Re: Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

RyanR3 — Wed, 29 Apr 2026 00:02:46 GMT

I wanted to add additional information that since my April 22 post, I've updated the Intel driver to the April 16 released 32.0.101.8724 version. After that update, I attempted a LORA training run that did not result in a BSOD but did result in an error. Per Claude and Codex it may be related to the initial issue or the new driver released.

Here's the report on it I had Claude compile. I will upload the mentioned files in a .zip to my Google Drive and send the link privately as before:

Intel Arc Pro B70 — Userspace stoull failure pinpointed to sycl::device::ext_oneapi_supports_cl_extension in sycl8.dll 2025.3.0.0

Date: 2026-04-28. Posted as a follow-up to the existing BSOD thread ("Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys 32.0.101.8629") because the new finding below was captured on the same hardware, after applying the 32.0.101.8724 driver update, and may be related to either the original kernel-mode defect Intel is already working on or to a separate userspace defect introduced or exposed by the 8724 driver / oneAPI runtime stack on this GPU.

Headline Result

A user-mode process dump captured on 2026-04-28 at the moment of the first C++ exception throw, combined with WinDbg/cdb stack reconstruction, identifies the failing API as:

sycl::_V1::device::ext_oneapi_supports_cl_extension in sycl8.dll version 2025.3.0.0 (Intel oneAPI DPC++ Library, build 2026-01-12)

The throw originates inside sycl8.dll, propagates as sycl::_V1::exception (observed 12 times in rapid succession by the dump tool), and is eventually re-raised by torch-xpu as a Python RuntimeError carrying the message "invalid stoull argument". The C++ stack establishes that the upstream trigger is torch's XPU device-properties query (c10::xpu::get_raw_device → at::xpu::getDeviceProperties), called as part of an ordinary tensor.contiguous() / convolution forward path on Arc Pro B70.

The std::stoull failure inside sycl8.dll is consistent with the SYCL extension-query path parsing a driver-returned extension-version string and encountering an empty or non-numeric token. We did not observe a clean PI / Level Zero error code in the throw chain (no PI_ERROR_* or ze_result_t string in the dump's exception path); the behavior matches an internal SYCL parse failure on metadata returned through the extension-support API.

Context: Existing BSOD Thread on This Hardware

For continuity with the in-progress investigation, the relevant facts from the original BSOD report on this hardware:

Bugcheck	0xD1 DRIVER_IRQL_NOT_LESS_OR_EQUAL
Faulting module	igdkmdnd64.sys (Intel Graphics Kernel-Mode Driver)
Driver image version in dump	32.0.101.8629
Fault site (per WinDbg)	igdkmdnd64+0x3a4c29
Failure bucket	AV_igdkmdnd64!unknown_function
Failure hash	{f72986a3-e8f9-3600-9d7c-dd8a40f557df}
Stack shape	KiPageFault followed by multiple igdkmdnd64 frames and dxgkrnl DPC/ISR synchronization frames (invalid kernel-address access in DPC context, not application exception)
Trigger conditions	Sustained Intel XPU compute from PyTorch / oneDNN / Level Zero workloads. Increasing TDR delay did not prevent the BSOD.
Mitigation attempted	Updated graphics driver to 32.0.101.8724 (released 2026-04-16). This is the driver under test for the new finding below.

The original kernel-mode MEMORY.DMP that produced the data above is already provided to Intel through the existing thread channel. The new evidence in this post is a USER-MODE process dump from a different failure mode that began appearing after the 8724 driver update (see below).

Driver and Software State at Capture

Graphics driver	32.0.101.8724 (Intel graphics driver, released 2026-04-16). Installed prior to the 2026-04-28 capture.
Prior graphics driver	32.0.101.8629 (released 2026-04-02). Replaced by 8724 before this capture.
Level Zero driver version	1.15.37669 (reported by oneDNN verbose info banner during this capture).
GPU	Intel Arc Pro B70 Graphics, binary_kernels:enabled
PyTorch XPU stack	torch 2.11.0+xpu / torchvision 0.26.0+xpu / torchaudio 2.11.0+xpu. Embedded Python 3.13.11 inside the ComfyUI portable distribution. ComfyUI 0.16.4.
Workload	Flux-style transformer LoRA training, batch size 1, BF16 mixed precision, AdamW, gradient checkpointing on, block-swap CPU<->XPU, SDPA + split_attn, LoRA rank/alpha clamped to 16. Throw reproduces in the cache-latents phase (VAE encode) before training begins.

Capture Methodology (this report)

A full user-mode process dump was captured of the failing python.exe training subprocess at the moment of the first C++ exception throw, with the following procedure:

Master Forge launched in PowerShell with MF_ONEDNN_VERBOSE=1 set on the parent shell, propagating ONEDNN_VERBOSE=1 to the training subprocess. Confirmed by master_forge.log banner showing "MF_ONEDNN_VERBOSE='1' (oneDNN verbose tracing requested)" and by subsequent oneDNN verbose output in the training subprocess stdout.
ProcDump (Sysinternals v11.1) launched in a separate elevated PowerShell, attached to the cache-latents subprocess by PID. Filter: -ma -e 1 -f sycl -f invalid_argument (full dump on first-chance C++ exception whose mangled type name contains "sycl" or "invalid_argument").
Training run started against a 30-image, 1024-bucket image-caption dataset that the Arc Pro B70 / sycl8 had not previously processed at those bucket sizes. The cache-latents phase (VAE encode, no training yet) reproduced the throw within ~5 seconds of dispatch.
ProcDump observed 12 sycl::_V1::exception throws in rapid succession, then std::invalid_argument throws, all at the same wall-clock second. Dump triggered on the first sycl::exception and written to disk at 4.16 GB.
Dump analyzed with cdb (Microsoft Console Debugger), Microsoft public symbol server enabled. Microsoft public symbols resolved KERNELBASE and VCRUNTIME140 frames cleanly; sycl8.dll, c10_xpu.dll, torch_xpu.dll frames showed nearest-exported-symbol approximations because Intel and PyTorch private PDBs are not on the public symbol server.

ProcDump exception-monitor record at the moment of capture (verbatim):

[time]Exception: E06D7363.?AVexception@_V1@sycl@@ (12 occurrences) [time]Exception: E06D7363.?AVinvalid_argument@std@@ (4 occurrences) [time]Exception: E06D7363.msc [time]Process Exit: PID NNNNN, Exit Code 0x00000001

C++ Stack at the First sycl::exception Throw

Frames marked "+0xNNNN" against an exported symbol such as getBorderColor are nearest-exported-symbol approximations because sycl8.dll private PDBs are not on Microsoft's public symbol server. The module name, the publicly-named ext_oneapi_supports_cl_extension caller, and the torch-xpu / c10-xpu caller frames are unambiguous and constitute the actionable triage signal.

00 KERNELBASE!RaiseException+0x69 ; OS exception raise 01 VCRUNTIME140!_CxxThrowException+0x97 ; MSVC C++ throw 02 sycl8!sycl::_V1::detail::getBorderColor+0xa402 ; THROW INSIDE sycl8.dll 03 sycl8!sycl::_V1::detail::getBorderColor+0xa87b ; sycl8 anonymous internals 04 sycl8!sycl::_V1::detail::getBorderColor+0xae7d 05 sycl8!sycl::_V1::detail::getBorderColor+0xfc89 06 sycl8!sycl::_V1::device::ext_oneapi_supports_cl_extension+0x77 ; THE CULPRIT API 07 c10_xpu!c10::xpu::get_raw_device+0xa85 ; torch-xpu device facade 08 torch_xpu!at::xpu::getDeviceProperties+0x99 ; device-properties query 09 torch_xpu!at::native::xpu::copysign_kernel+0x65df2 ; (anon, near copysign) 0a torch_xpu!at::native::xpu::copy_kernel+0xbe1ed ; copy_kernel internals 0b torch_xpu!at::native::xpu::copy_kernel+0xb0a4e 0c torch_xpu!at::native::xpu::copy_kernel+0xa9ab5 0d torch_xpu!at::native::xpu::copy_kernel+0x114b 0e torch_xpu!at::native::xpu::copy_kernel+0xa86 0f torch_xpu!at::native::xpu::copy_kernel+0xab 10 torch_xpu!at::native::structured_max_pool2d_with_indices_out_xpu::impl+0x184e 11 torch_xpu!at::native::structured_max_pool2d_with_indices_out_xpu::impl+0xa40 12 torch_cpu!at::native::copy_ignoring_overlaps+0xaad ; cpu-side copy bridge 13 torch_cpu!at::native::copy_+0x10d 14 torch_cpu!at::_ops::copy_::call+0x188 15 torch_cpu!at::native::clone+0x21e 16 torch_cpu!at::compositeexplicitautograd::view_copy_symint_outf+0x37ce 17 torch_cpu!at::compositeexplicitautograd::bucketize_outf+0x594c7 18 torch_cpu!at::_ops::clone::call+0x17e 19 torch_cpu!at::native::contiguous+0xf5 ; .contiguous() inner 1a torch_cpu!at::compositeimplicitautograd::where+0x213e 1b torch_cpu!at::compositeimplicitautograd::broadcast_to_symint+0x49c87 1c torch_cpu!at::_ops::contiguous::call+0x17e ; Tensor.contiguous() entry 1d torch_cpu!at::TensorBase::__dispatch_contiguous+0x29 1e torch_cpu!at::TensorBase::contiguous+0xb3 1f torch_cpu!at::Tensor::contiguous+0x1c 20 torch_xpu!at::native::structured_mm_out_xpu::impl+0x6b6b ; matmul on XPU 21 torch_xpu!at::native::structured_mm_out_xpu::impl+0xbc7f 22 torch_xpu!at::native::structured_mm_out_xpu::impl+0x951f 23 torch_cpu!at::_ops::convolution_overrideable::call+0x52c ; convolution forward 24 torch_cpu!at::native::_convolution+0x2173 25 torch_cpu!at::compositeexplicitautograd::view_copy_symint_outf+0x1c9e 26 torch_cpu!at::compositeexplicitautograd::bucketize_outf+0x56d47 27 torch_cpu!at::_ops::_convolution::call+0x3cf 28 torch_cpu!at::native::convolution+0x1f4 ... (continuing up through autograd dispatch and into the Python C API)

Modules Implicated (versions for triage)

sycl8.dll	Path: <embedded_python>\Library\bin\sycl8.dll. Version 2025.3.0.0. Build timestamp Mon Jan 12 17:05:33 2026. Intel oneAPI DPC++ Library. This is the module that throws.
ze_loader.dll	Path: C:\Windows\System32\ze_loader.dll. Version 1.28.2.0. Build timestamp Fri Feb 20 14:42:08 2026. oneAPI Level Zero Loader for Windows. Loaded but no PI / Level Zero error code observed in the throw chain itself.
torch_xpu.dll	Path: <embedded_python>\Lib\site-packages\torch\lib\torch_xpu.dll. PyTorch 2.11.0+xpu. Build timestamp Sat Mar 21 01:22:23 2026. Caller of c10::xpu::get_raw_device.
c10_xpu.dll	Path: <embedded_python>\Lib\site-packages\torch\lib\c10_xpu.dll. PyTorch 2.11.0+xpu. Build timestamp Fri Mar 20 23:39:19 2026. Provides c10::xpu::get_raw_device, which calls into sycl::device::ext_oneapi_supports_cl_extension.
torch_cpu.dll	PyTorch 2.11.0+xpu. Build timestamp Sat Mar 21 00:02:28 2026.
torch_python.dll	PyTorch 2.11.0+xpu. Build timestamp Sat Mar 21 01:24:17 2026.

Python-Level Symptom vs C++ Root Cause (this capture)

The Python-level traceback for this capture surfaces the failure inside F.group_norm (torch.nn.functional.group_norm → torch.group_norm → C++ engine) with the message "invalid stoull argument". The dump shows that the actual C++ throw is upstream of group_norm: it occurs during the tensor.contiguous() / clone / copy chain that runs inside the convolution forward pass earlier in the VAE encoder block. That chain calls into torch-xpu's at::xpu::getDeviceProperties and c10::xpu::get_raw_device, which in turn calls sycl::device::ext_oneapi_supports_cl_extension on sycl8.dll. The extension-support query throws sycl::_V1::exception, the exception propagates up the C++ stack, eventually a downstream torch wrapper catches a related std::invalid_argument and re-raises as the Python RuntimeError. The Python op at the top of the stack at re-raise (group_norm) is incidental — it is simply the op that happened to be dispatching when the cached exception state surfaced.

Possible Relationship to the Existing BSOD Investigation

I cannot establish a causal link between the kernel-mode 0xD1 BSOD on 32.0.101.8629 and the userspace sycl::exception throw on 32.0.101.8724 from the evidence alone. They are different layers (kernel-mode driver vs userspace oneAPI runtime) and different failure mechanisms (IRQL violation in DPC context vs std::stoull parse failure on extension metadata). They do, however, share enough context to warrant Intel's attention as a single investigation:

Same hardware (Arc Pro B70).
Same workload class (PyTorch XPU training, sustained Level Zero submissions).
The 8724 driver update is the only major environmental change between the BSOD reproductions on 8629 and the userspace stoull reproductions on 8724.
The userspace stoull crash occurs early enough in training that sustained Level Zero compute is never reached, which is why I cannot currently confirm or deny whether 8724 fixes the original 0xD1 KMD defect. The userspace defect is gating the BSOD retest.
If the 8724 driver returns malformed extension-version metadata that trips sycl8's std::stoull parser, that same metadata change might also be a symptom of, or co-located with, the kernel-mode change that is the subject of the existing thread.

Practical implication: even if the two issues are owned by different Intel teams (KMD vs oneAPI DPC++), resolving the userspace stoull is a prerequisite for confirming whether the KMD fix on 8724 actually landed.

Hypothesis on the std::stoull Parse Failure

Without sycl8.dll private PDBs we cannot point at the exact std::stoull call inside getBorderColor offsets. Based on the public API name (ext_oneapi_supports_cl_extension), the most likely scenario is:

ext_oneapi_supports_cl_extension takes a cl_intel_* or cl_khr_* extension name and returns whether the device supports it, potentially with a minimum version.
Internally, sycl8 likely queries the device's reported extension list and version metadata via Level Zero / OpenCL (clGetDeviceInfo CL_DEVICE_EXTENSIONS_WITH_VERSION or equivalent Level Zero property).
The returned metadata contains version tokens that sycl8 parses with std::stoull (or std::stoi/strtoul wrapped into invalid_argument).
On Intel Arc Pro B70 with the current Level Zero driver build, one or more extension entries returns an empty token (or a non-numeric token, e.g. an empty version field after a delimiter). std::stoull on an empty input throws std::invalid_argument. SYCL's extension subsystem catches and re-throws as sycl::exception.
12 throws in succession suggest the query iterates over a list of extensions, all of which trip the same parse, OR a single query iterates internally and counts each parse failure as one throw.

Either way, the fix surface is in two places: (a) sycl8 should treat empty / non-numeric extension version tokens as benign (fail-soft, treat as version 0 or unsupported, not throw); (b) the Level Zero driver for Arc Pro B70 should not return malformed extension version tokens. Intel can confirm which side owns the fix.

Mitigations Already Applied (for completeness)

None of the following resolve the bug, but they eliminate confounds and give the report a clean reproducer:

Inherited environment scrub before training subprocess launch: PYTORCH_*, TORCH_*, CUDA_*, IPEX_*, INTEL_*, SYCL_*, ZE_*, L0_*, LEVEL_ZERO_*, ONEAPI_*, NEO_*, ONEDNN_*, MKL_DEBUG, MKLDNN_* prefixes stripped on child startup.
Allocator-config keys actively pop'd from child env (PYTORCH_ALLOC_CONF, PYTORCH_XPU_ALLOC_CONF, PYTORCH_CUDA_ALLOC_CONF) — earlier failure mode on torch 2.9.1+xpu was an unrelated allocator-parser stoull that has not recurred on torch 2.11.0+xpu.
Single heavy-GPU-job coordination so generation, training, SUPIR, and captioning never contend for the XPU.
Verified system-RAM release at every workflow boundary (post-completion ComfyUI /free + gc.collect + poll-until-target).
DiT compatibility check now reads only the safetensors JSON header (no full mmap), avoiding pagefile-commit failures on Windows.
Compatibility shim around the Flux patchify rearrange to avoid non-contiguous tensor.reshape on XPU (view + permute + .contiguous() + view chain).
Pre-loop XPU cooldown sleep before first training step.
DataLoader workers forced to 0 on Windows+XPU.
LoRA rank clamped to 16 for 32 GB Arc Pro B70 headroom.

Sanitized Repro Shape

The cache-latents phase reproduces the throw without any training. Command shape (paths sanitized):

<embedded_python>\python.exe -u src/musubi_tuner/flux_1_dev_cache_latents.py \ --dataset_config <workspace>\configs\<lora_name>.toml \ --vae <comfy_models>\vae\ae.safetensors \ --model_version dev \ --vae_dtype bfloat16 \ --skip_existing \ --device xpu

Dataset shape: a small image-caption dataset (30 images, 1024-bucket, two aspect-ratio sub-buckets), batch size 1. Specific image content and captions are not relevant — the throw fires during the encoder forward pass, before any image data is consumed beyond the first tensor allocation.

What I Can and Cannot Claim

Can claim:

The throw originates inside sycl8.dll 2025.3.0.0 at sycl::_V1::device::ext_oneapi_supports_cl_extension. Confirmed by WinDbg/cdb stack walk of a captured user-mode dump.
The upstream caller is c10::xpu::get_raw_device → at::xpu::getDeviceProperties from torch 2.11.0+xpu.
The exception is sycl::_V1::exception (12 in succession at the same wall-clock second), then std::invalid_argument(s), then surfaced to Python as RuntimeError("invalid stoull argument").
The std::stoull failure is internal to sycl8 — no PI / Level Zero error code appears in the throw chain.
Subsequent runs against the same input shape work because the compiled kernel binary is cached in the persistent SYCL kernel cache; fresh shapes that have never been compiled re-trigger the throw.
Driver 32.0.101.8629 produced the original BSOD (kernel-mode dump available); driver 32.0.101.8724 has been installed and has not produced a recurrence, but training cannot reach sustained-compute to confirm.

Cannot claim:

The exact line inside sycl8.dll where std::stoull is called or what string it parses. Nearest-exported-symbol approximations (the getBorderColor +0xNNNN pattern in the stack) are caused by sycl8 private PDBs not being on the Microsoft public symbol server.
Whether the malformed extension-version metadata originates in sycl8's parser being too strict or in the Level Zero driver returning bad metadata. Intel guidance on which component owns the fix is requested.
Whether 32.0.101.8724 fixed the original igdkmdnd64.sys 0xD1 defect.

Materials Available via Private Channel

All items below are from the 2026-04-28 capture session. They are available on request through Intel's private support upload path. Not attached to the public forum thread.

User-mode crash dump	python.exe_260428_184651.dmp, 4.16 GB. Captured 2026-04-28 by Sysinternals ProcDump v11.1 on the cache-latents subprocess. Triggered on first sycl::_V1::exception throw (filter: sycl OR invalid_argument). Full process dump (-ma).
WinDbg/cdb analysis output	analyze_v2.txt, 223 KB. Output of cdb script run against the above dump with Microsoft public symbols. Contains .lastevent / .exr -1 / .ecxr / kbn 200 / ~* kbn 50 / lmDvm for sycl8 / ze_loader / xpu / c10 / torch* modules.
oneDNN verbose trace	Captured during the same training session that produced the dump (ONEDNN_VERBOSE=1 set on the training subprocess). Confirms 100 percent of oneDNN dispatches on this workload were GPU matmul jit:gemm:any (885+ exec events, all successful). Establishes that oneDNN is not the suspect.
master_forge.log section	Log excerpt of the 2026-04-28 session covering app startup, ComfyUI initialization, LoRA-server boot, training launch, cache-latents subprocess execution, and the failure traceback.

Questions for Intel Engineering

Which Intel team owns sycl8.dll (Intel oneAPI DPC++ Library) version 2025.3.0.0? Specifically, the team responsible for sycl::_V1::device::ext_oneapi_supports_cl_extension and the extension-version-string parsing path. Should this be triaged in the existing thread, or split into a sibling thread that can be cross-referenced?
Is the std::stoull failure in ext_oneapi_supports_cl_extension a known issue on Arc-class GPUs with Level Zero driver 1.15.37669 / graphics driver 32.0.101.8724? If so, is a fix available in a newer sycl8.dll release?
If the parse failure is caused by malformed extension-version metadata returned by the Level Zero driver for Arc Pro B70, would the appropriate fix be: (a) sycl8 fail-soft on empty version tokens, or (b) the L0 driver returning well-formed tokens, or (c) both?
Does graphics driver 32.0.101.8724 (released 2026-04-16) carry a fix for the igdkmdnd64.sys 0xD1 bucket cited earlier in this thread (failure bucket AV_igdkmdnd64!unknown_function, hash {f72986a3-e8f9-3600-9d7c-dd8a40f557df})? If so, the original MEMORY.DMP already provided in this thread should be a useful confirmation; if not, please advise on a target driver release. The userspace stoull crash described above is currently gating any sustained-compute retest on 8724.
Are debug builds of sycl8.dll (with private PDBs) available through Intel's NDA / support channels so we could pinpoint the exact std::stoull caller line and the offending extension name? If yes we can run the reproducer once more with the debug build loaded.

Requested Next Step

Continue the existing kernel-mode triage on the 8629 BSOD bucket and, in parallel, route the userspace sycl8 finding above to the appropriate oneAPI DPC++ team. Please advise whether to keep both items in this thread or split the userspace finding into a sibling thread cross-referenced from here. We are available to run additional diagnostics on request, including ONEDNN_VERBOSE=2, Level Zero / SYCL trace, ETW / GPUView, a debug-build sycl8.dll repro, or a live cdb attach with Intel-provided private symbols.

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

DCGo — Wed, 29 Apr 2026 21:30:32 GMT

Hi @Jonzyl_Intel and @gadget :

I can confirm the issue has happened on two out of four cards for me so far. In my case the affected cards are irreversibly dowgrading to gen4 affecting P2P as can be easily seen in synthetic measurements using ze_peak. No clear trigger, no fault of my own, user cannot intentionally cause the behavior or revert it. Boards returned. Needs a hotfix right now. Please provide the link here to the new issue so I can add data to it (I will try to fish for it now).

Thanks,

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

DCGo — Wed, 29 Apr 2026 22:41:23 GMT

Reported here

topic Re: Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629) in Graphics

Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

1. Why Code Snippets May Not Be the Full Picture

2. Minimal Standalone Reproduction Script

To Run

3. Actual Application Code — Key Modified Files

3.1 tilevae.py — XPU VRAM Detection + oneDNN Primitive Fallback

3.2 attention.py — Chunked SDPA for XPU (No Flash Attention Backend)

3.3 supir_worker.py — Staged Model Load + Decoder Tile Size Fix

4. Cache Suppression Environment Variables

5. Attachments Available

6. Original Bug Report (INTEL_BUG_REPORT.md)

Title

Severity

Summary

Hardware / OS

Software Stack

Bugcheck Details (from !analyze -v)

Call Stack (bottom to top)

Reproduction Pattern

What Has Been Ruled Out

Associated User-Space Errors (Possibly Related)

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

High-Risk Workload Profile

Hardware / Driver Context

Workload Class

Submission-Rate Risk Factors (the smoking gun)

Output-Size Risk Thresholds (empirical)

Tiling / Memory Pressure

Temporal / Batch Patterns That Elevate Risk

What Does NOT Trigger It (ruled out)

Minimal Repro Recipe

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Driver Stability Update: Escalation from User-Mode AV to BSOD (0xD1)

New Incident Summary (April 21-22, 2026)

Observed Failure Pattern

Correlated Runtime Telemetry

Why This Points to Driver/Runtime (Not App Logic)

Artifacts Available

Request to Intel

Re: Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Summary

Hardware

Operating System / Software Stack

The Specific Symptom

Variables Ruled Out (Exhaustive Platform-Side Testing)

Variables NOT Yet Ruled Out

Performance Impact

Requested Engineering Response

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re: Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Intel Arc Pro B70 — Userspace stoull failure pinpointed to sycl::device::ext_oneapi_supports_cl_extension in sycl8.dll 2025.3.0.0

Headline Result

Context: Existing BSOD Thread on This Hardware

Driver and Software State at Capture

Capture Methodology (this report)

C++ Stack at the First sycl::exception Throw

Modules Implicated (versions for triage)

Python-Level Symptom vs C++ Root Cause (this capture)

Possible Relationship to the Existing BSOD Investigation

Hypothesis on the std::stoull Parse Failure

Mitigations Already Applied (for completeness)

Sanitized Repro Shape

What I Can and Cannot Claim

Materials Available via Private Channel

Questions for Intel Engineering

Requested Next Step

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)

Re: Re:Arc Pro B70 BSOD 0xD1 in igdkmdnd64.sys (32.0.101.8629)