- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm guessing this issue is probably similar or identical to the one described in https://community.intel.com/t5/Intel-DevCloud/Gen12LP-device-selection-timeout/td-p/1235540
When I submit anything that uses the IrisXe MAX GPU, the executable hangs. The node that does this is `s012-n001` which happens to be the first node PBS picks when you do `neednodes=1:iris_xe_max:ppn=2,nodes=1:iris_xe_max:ppn=2,walltime=00:01:00` so I have to carefully avoid using that node. I haven't looked into whether other Xe nodes are having the same issue but it may be worth checking as well.
dmesg on that node is flooded with GPU HANG messages:
[Mon Jan 18 18:06:31 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
[Mon Jan 18 18:06:38 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:38 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
[Mon Jan 18 18:06:44 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:44 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
[Mon Jan 18 18:06:52 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:52 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
[Mon Jan 18 18:06:59 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:59 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
[Mon Jan 18 18:07:05 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:05 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
[Mon Jan 18 18:07:13 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:13 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
[Mon Jan 18 18:07:19 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:19 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in [0]
I've attached the full output of my job as attachment.
The submitted executable (miniapp) runs fine on Gen9 nodes and it was also tested on our own UoB Zoo IrisPro580 node without any problems.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
We are able to login to the node (s012-n001) without any issue. Could you please check again. Make sure that the node is free before running the command.
You can check the state of the node by using below command:
pbsnodes
If the issue still persists, please let us know. Also share the commands you used.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Athira,
Thanks for getting back to me.
The issue isn't that I can't login to the node. Whenever I run anything that uses the GPU, the executable hangs forever. The kernel log attachment from the original post shows the driver was stuck and unable to reset.
I can't secure that node at the moment as it seems to be in use.
You can reproduce this by running `clinfo` on that node and observe the program hangs.
Cheers,
Tom
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Sorry for the inconvenience caused. We have contacted the concerned team regarding this issue. Will keep you posted on updates.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are checking on the issue, will keep you posted on the updates.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
DevCloud admin team informed that s012-n001 has been set offline for troubleshooting.
We suggest you to use generic method to login to available iris_xe_max nodes that will guarantee your workload execution.
qsub -l nodes=1:ppn=2:iris_xe_max -d . run.sh
Could you please let us know whether we can close this case?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please give us an update? Can we close this thread?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you, we wont be monitoring this thread. If you need further assistance, please post a new thread.
Thanks.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page