Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1626 Discussions

iris_xe_max node hangs

tom91136
Beginner
2,015 Views

Hello,

I'm guessing this issue is probably similar or identical to the one described in https://community.intel.com/t5/Intel-DevCloud/Gen12LP-device-selection-timeout/td-p/1235540

When I submit anything that uses the IrisXe MAX GPU, the executable hangs. The node that does this is `s012-n001` which happens to be the first node PBS picks when you do `neednodes=1:iris_xe_max:ppn=2,nodes=1:iris_xe_max:ppn=2,walltime=00:01:00` so I have to carefully avoid using that node. I haven't looked into whether other Xe nodes are having the same issue but it may be worth checking as well.

dmesg on that node is flooded with GPU HANG messages:

[Mon Jan 18 18:06:31 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:38 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:38 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:44 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:44 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:52 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:52 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:59 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:59 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:07:05 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:05 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:07:13 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:13 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:07:19 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:19 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]

I've attached the full output of my job as attachment.

The submitted executable (miniapp) runs fine on Gen9 nodes and it was also tested on our own UoB Zoo IrisPro580 node  without any problems.

Labels (1)
0 Kudos
7 Replies
AthiraM_Intel
Moderator
1,997 Views

Hi,


Thanks for reaching out to us.

We are able to login to the node (s012-n001) without any issue. Could you please check again. Make sure that the node is free before running the command.

You can check the state of the node by using below command:

pbsnodes

If the issue still persists, please let us know. Also share the commands you used.


Thanks.


0 Kudos
tom91136
Beginner
1,990 Views

Hi Athira,

Thanks for getting back to me.

The issue isn't that I can't login to the node. Whenever I run anything that uses the GPU, the executable hangs forever. The kernel log attachment from the original post shows the driver was stuck and unable to reset.  

I can't secure that node at the moment as it seems to be in use.

You can reproduce this by running `clinfo` on that node and observe the program hangs.

Cheers,

Tom

0 Kudos
AthiraM_Intel
Moderator
1,978 Views

Hi,


Sorry for the inconvenience caused. We have contacted the concerned team regarding this issue. Will keep you posted on updates.


Thanks.



0 Kudos
AthiraM_Intel
Moderator
1,875 Views

Hi,


We are checking on the issue, will keep you posted on the updates.


Thanks.


0 Kudos
AthiraM_Intel
Moderator
1,844 Views

Hi,


DevCloud admin team informed that s012-n001 has been set offline for troubleshooting.

We suggest you to use generic method to login to available iris_xe_max nodes that will guarantee your workload execution.


qsub -l nodes=1:ppn=2:iris_xe_max -d . run.sh


Could you please let us know whether we can close this case?


Thanks.



0 Kudos
AthiraM_Intel
Moderator
1,799 Views

Hi,


Could you please give us an update? Can we close this thread?


Thanks.


0 Kudos
AthiraM_Intel
Moderator
1,770 Views

Hi,


We have not heard back from you, we wont be monitoring this thread. If you need further assistance, please post a new thread.


Thanks.


0 Kudos
Reply