- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I was hoping someone could help me with weird timeout errors that i have while running LLM inference on my box.
I have a AM4 3600 with an X570 Phantom Gaming 4 chipset, this has two x16 slots which are both occupied by Intel arc Pro B50s, though one of the slots runs in x4. Should not matter, in theory.
I dual boot Win11 and Linux (kubuntu LTS 24.04) and under Linux i get the following errors while running ollama (vulkan backend) , llama.cpp (SYCL or Vulkan) , loading a model of about 14G, split between the two cards:
[ 1974.084104] xe 0000:06:00.0: [drm] GT0: Timedout job: seqno=17319, lrc_seqno=17319, guc_id=2, flags=0x0 in ollama [5245]
[ 1974.167593] xe 0000:06:00.0: [drm] Xe device coredump has been created
[ 1974.167599] xe 0000:06:00.0: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
This coincides with my application side error:
apr 01 19:46:58 desktop ollama[2317]: [Inferior 1 (process 5245) detached]
apr 01 19:46:58 desktop ollama[2317]: terminate called after throwing an instance of 'vk::DeviceLostError'
apr 01 19:46:58 desktop ollama[2317]: what(): vk::Device::waitForFences: ErrorDeviceLost
apr 01 19:46:58 desktop ollama[2317]: SIGABRT: abort
All the usual suspects are eliminated, ReBAR is on, SR-IOV as well otherwise the kernel would run out of address space. CSM is disabled,firmware up to date. This seems to be a driver related issue, since running the same workload on ollama on my windows install does not give this error. The weird thing is that the issue is intermittent, it will do inference for a while and then suddenly start crashing and not recover , i.e. not being able to resume compute succesfully. The issue also seems to start when i run two seperate inferences simultaneously, everything will start timing out on the GPU side, even though ollama should be able to batch this, since they are on the same model.
I have since updated my kernel to 6.19.10, but the issue persists, though less frequently.
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=realloc amd_iommu=on"
Could someone please look into why the driver decides to kill the calculation? And if possible, is there any way of stretching the timeout, short of switching back to the OSS Xe driver and hacking it myself?
I have enclosed all the (kernel/driver/app) diagnostics and dump (kernel 6.17) and hope someone can tell me if there is anything I can do.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Micah_II,
Thank you for reaching out to the Intel Community.
I would like to kindly check whether you have tested the setup using a single Arc Pro B50 graphics card. Additionally, please let us know if any passing scenario was observed during this testing.
Your confirmation on the above details will help us analyze the issue further and assist you more effectively.
Thank you for your continued cooperation. We look forward to your response.
Best regards,
Nikhil
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Micah_II,
I hope you are doing well.
I’m writing to follow up on our previous request, could you please confirm whether the setup was tested using a single Arc Pro B50 graphics card, and if any passing scenario was observed?
Your feedback will help us determine the next steps and continue assisting you effectively.
Best regards,
Nikhil
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Micah_II,
I hope you are doing well.
As I have not received a response to our previous message, I will proceed with closing this inquiry for now.
If you still require assistance or have any additional questions, please feel free to submit a new support request, and we will be happy to assist you. Please note that this thread will no longer be actively monitored.
Thank you for your understanding.
Best regards,
Nikhil
Intel Customer Support Technician
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page