Community
cancel
Showing results for 
Search instead for 
Did you mean: 
tom91136
Beginner
275 Views

iris_xe_max node hangs

Hello,

I'm guessing this issue is probably similar or identical to the one described in https://community.intel.com/t5/Intel-DevCloud/Gen12LP-device-selection-timeout/td-p/1235540

When I submit anything that uses the IrisXe MAX GPU, the executable hangs. The node that does this is `s012-n001` which happens to be the first node PBS picks when you do `neednodes=1:iris_xe_max:ppn=2,nodes=1:iris_xe_max:ppn=2,walltime=00:01:00` so I have to carefully avoid using that node. I haven't looked into whether other Xe nodes are having the same issue but it may be worth checking as well.

dmesg on that node is flooded with GPU HANG messages:

[Mon Jan 18 18:06:31 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:38 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:38 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:44 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:44 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:52 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:52 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:06:59 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:06:59 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:07:05 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:05 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:07:13 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:13 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]
[Mon Jan 18 18:07:19 2021] i915 0000:1b:00.0: [drm] Resetting bcs0 for preemption time out
[Mon Jan 18 18:07:19 2021] i915 0000:1b:00.0: [drm] GPU HANG: ecode 12:2:18800102, in  [0]

I've attached the full output of my job as attachment.

The submitted executable (miniapp) runs fine on Gen9 nodes and it was also tested on our own UoB Zoo IrisPro580 node  without any problems.

Labels (1)
0 Kudos
7 Replies
AthiraM_Intel
Moderator
257 Views

Hi,


Thanks for reaching out to us.

We are able to login to the node (s012-n001) without any issue. Could you please check again. Make sure that the node is free before running the command.

You can check the state of the node by using below command:

pbsnodes

If the issue still persists, please let us know. Also share the commands you used.


Thanks.


tom91136
Beginner
250 Views

Hi Athira,

Thanks for getting back to me.

The issue isn't that I can't login to the node. Whenever I run anything that uses the GPU, the executable hangs forever. The kernel log attachment from the original post shows the driver was stuck and unable to reset.  

I can't secure that node at the moment as it seems to be in use.

You can reproduce this by running `clinfo` on that node and observe the program hangs.

Cheers,

Tom

AthiraM_Intel
Moderator
238 Views

Hi,


Sorry for the inconvenience caused. We have contacted the concerned team regarding this issue. Will keep you posted on updates.


Thanks.



AthiraM_Intel
Moderator
135 Views

Hi,


We are checking on the issue, will keep you posted on the updates.


Thanks.


AthiraM_Intel
Moderator
104 Views

Hi,


DevCloud admin team informed that s012-n001 has been set offline for troubleshooting.

We suggest you to use generic method to login to available iris_xe_max nodes that will guarantee your workload execution.


qsub -l nodes=1:ppn=2:iris_xe_max -d . run.sh


Could you please let us know whether we can close this case?


Thanks.



AthiraM_Intel
Moderator
64 Views

Hi,


Could you please give us an update? Can we close this thread?


Thanks.


AthiraM_Intel
Moderator
35 Views

Hi,


We have not heard back from you, we wont be monitoring this thread. If you need further assistance, please post a new thread.


Thanks.