Intel® Arc™ Discrete Graphics
Get answers to your questions or issues when gaming on the world’s best discrete video cards with the latest news surrounding Intel® Arc™ Discrete Graphics.
4122 Discussions

Driver ends compute due to timeout on (dual) Arc Pro B50

Micah_II
Beginner
483 Views

Hi, I was hoping someone could help me with weird timeout errors that i have while running compute on my box.

 

I have a AM4 3600 with an X570 Phantom Gaming 4 chipset, this has two x16 slots which are both occupied by Intel arc Pro B50s, though one of the slots runs in x4. Should not matter, in theory.

 

I dual boot Win11 and Linux (kubuntu LTS 24.04) and under Linux i get the following errors while running ollama (vulkan backend) , llama.cpp (SYCL or Vulkan) , loading a model of about 14G, split between the two cards:

 

[ 1974.084104] xe 0000:06:00.0: [drm] GT0: Timedout job: seqno=17319, lrc_seqno=17319, guc_id=2, flags=0x0 in ollama [5245]
[ 1974.167593] xe 0000:06:00.0: [drm] Xe device coredump has been created
[ 1974.167599] xe 0000:06:00.0: [drm] Check your /sys/class/drm/card1/device/devcoredump/data

 

This coincides with my application side error:

apr 01 19:46:58 desktop ollama[2317]: [Inferior 1 (process 5245) detached]
apr 01 19:46:58 desktop ollama[2317]: terminate called after throwing an instance of 'vk::DeviceLostError'
apr 01 19:46:58 desktop ollama[2317]: what(): vk::Device::waitForFences: ErrorDeviceLost
apr 01 19:46:58 desktop ollama[2317]: SIGABRT: abort

 

This error happens time and again, after a few rounds of conversation, but not necessarily in the same conversation!

All the usual suspects are eliminated, ReBAR is on, SR-IOV as well otherwise the kernel would run out of address space. CSM is disabled,firmware up to date. This seems to be a driver related issue, since running the same workload on ollama on my windows install does not give this error. I tried forcing loading of HuC on GT0, since it only loads on GT1, but to no avail.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=realloc amd_iommu=on i915.enable_guc=3 i915.force_huc=1"

 

Could someone please look into why the driver decides to kill the calculation? And if possible, is there any way of stretching the timeout in the meantime, short of switching back to the OSS Xe driver and hacking it myself?

 

I have enclosed all the (kernel/driver/app) diagnostics and dump and hope someone can tell me what to tweak. If more info is needed I'll happily oblige tomorrow.

0 Kudos
10 Replies
Raymund_Intel
Moderator
475 Views

Thank you for contacting Intel Customer Support and for trusting us with the issue you are experiencing on your system.

 

We understand that you have already performed several troubleshooting steps, and we appreciate the effort you’ve put into isolating the issue. To help us further narrow down the root cause and provide the most effective support, we would like to request some additional information from you.

 

Please confirm the following details:

 

  • The exact Linux kernel version you are using
  • Mesa, Vulkan, and Level Zero package versions
  • Whether the issue occurs when running on only one GPU
  • Whether the timeout happens during model loading or during inference
  • Whether both GPUs report the same firmware and GuC/HuC status


If available, providing the Xe device coredump would also be helpful for our investigation.

 

For reference, the supported systems are listed below:

 

 

Once we receive your feedback, we will continue working with you to resolve the issue.

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Raymund_Intel
Moderator
414 Views

Hello Micah_II,

 

I wanted to check if you had the chance to review the questions I posted. Please let me know at your earliest convenience so that we can determine the best course of action to resolve this matter.

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Micah_II
Beginner
407 Views

Hi Raymund,

 

I have gone over the questions and have come up with the latest list of the state of my pc when i uploaded the first post; i have since upgraded my pc, but the problem persists after the updates.

 

  • The exact Linux kernel version you are using - 6.17.0-20-generic - my dmesg.txt has been uploaded for more specifics.
  • Mesa, Vulkan, and Level Zero package versions 
    • mesa-vulkan-driver 25.2.8-0ubuntu0.25.10.1
    • libze-intel-gpu1 25.31.34666.3-1ubuntu1
    • If i missed one it should be listed in packages.txt
  • Whether the issue occurs when running on only one GPU
    • when i run the model on a single GPU, the problem persists and performance is low.
  • Whether the timeout happens during model loading or during inference
    • During inference, supposedly the model loads ok.
  • Whether both GPUs report the same firmware and GuC/HuC status
    • The GPU data is identical, see also fw.txt

 

 

Kind regards,

Micah

0 Kudos
Raymund_Intel
Moderator
351 Views

Hi Micah_II,

 

Thank you for your response, I will do further research on this matter and post the response on this thread once it is available.

 

If you have questions, please let us know. Thank you.

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Raymund_Intel
Moderator
298 Views

Hello Micah_II


Thank you for patiently waiting regarding to our response regarding to this issue.

 

We would like to set expectations that Ubuntu is the only Linux operating system we officially support. While some Ubuntu‑based derivatives may work, they are not fully validated and therefore may have limited or unpredictable support. That said, we are still willing to investigate the issue to the best of our ability and provide guidance where possible.

 

To help us better understand and troubleshoot the problem, could you please clarify the following:

  • Are you currently running the Hardware Enablement (HWE) kernel on your system?
  • Was the GPU installed and configured according to the steps outlined in the Installing Client GPUs — Intel® software for general purpose GPU capabilities documentation?
  • Does the issue occur only when running intensive or high‑load workloads, or does it also happen under normal usage?
  • If applicable, could you share a sample project along with step‑by‑step instructions that reliably reproduce the issue?

 

This information will help us narrow down the root cause and proceed more effectively.

 

We appreciate your cooperation and look forward to your response.

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Raymund_Intel
Moderator
277 Views

Hello Micah_II,

 

Could you please confirm if you have reviewed the information I posted? Your feedback at your earliest convenience would be greatly appreciated so we can decide on the best way to proceed with resolving this matter

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Raymund_Intel
Moderator
277 Views

Hello Micah_II,

 

Could you please confirm if you have reviewed the information I posted? Your feedback at your earliest convenience would be greatly appreciated so we can decide on the best way to proceed with resolving this matter

 

Best regards,

 

Raymund P.

Intel Customer Support Technician



0 Kudos
Raymund_Intel
Moderator
276 Views

Hello Micah_II,

 

Could you please confirm if you have reviewed the information I posted? Your feedback at your earliest convenience would be greatly appreciated so we can decide on the best way to proceed with resolving this matter

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Raymund_Intel
Moderator
276 Views

Hello Micah_II,

 

Could you please confirm if you have reviewed the information I posted? Your feedback at your earliest convenience would be greatly appreciated so we can decide on the best way to proceed with resolving this matter

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Raymund_Intel
Moderator
250 Views


Hello Micah_II, 


Since I haven't received a response from you, I will be closing this inquiry. If you need further assistance, please submit a new question, as this thread will no longer be monitored.

 

Best regards,

 

Raymund P.

Intel Customer Support Technician


0 Kudos
Reply