Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1628 Discussions

OpenMP GPU Offloading fails on particular compute node

atuft
Novice
1,936 Views

I am attempting to practice GPU offloading via OpenMP using the example here. I have been able to compile and execute this on some gen9 GPU-equipped nodes, however I am seeing the offloading fail when the job is allocated to a particular compute node.

 

This is my job script:

 

#!/bin/bash
# Compile matmul example
icx -g -qopenmp -fopenmp-targets=spir64 src/matmul-offload.c -o matmul
export OMP_TARGET_OFFLOAD=MANDATORY
export LIBOMPTARGET_DEBUG=2
echo HOSTNAME=${HOSTNAME} >&2
./matmul

 

 

 Submitted with this command:

 

qsub -l nodes=1:gen9:ppn=2 -d . -o out.txt -e err.txt job.sh

 

 

When the job executes on node s001-n140 I see the following errors in err.txt:

 

/var/spool/torque/mom_priv/prologue.d//100-resetpcie.prologue: line 6: echo: write error: No such device
...
/var/spool/torque/mom_priv/jobs/1867180.v-qsvr-1.aidevcloud.SC: line 18: 1599332 Aborted                 ./matmul
/var/spool/torque/mom_priv/epilogue.d//100-resetpcie.epilogue: line 10: echo: write error: No such device

 

 

I've attached the full err.txt file in which it appears that Libomptarget fails to offload to the GPU despite this node having a gen9 GPU:

 

$ pbsnodes | grep n140 -A 12
s001-n140
     state = free
     power_state = Running
     np = 2
     properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9
     ntype = cluster
     status = rectime=1647469474,macaddr=ac:1f:6b:ad:88:60,cpuclock=Fixed,varattr=,jobs=,state=free,netload=5332538412953,gres=,loadave=1.01,ncpus=12,physmem=65672432kb,availmem=62430876kb,totmem=67671276kb,idletime=318047,nusers=3,nsessions=3,sessions=1090726 1090738 2362987,uname=Linux s001-n140 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

 

 

What is causing these errors? Have I made a mistake somewhere which is causing this job to fail? Is there some way I can exclude my job from being allocated to this particular node?

 

Many thanks,

 

Adam

0 Kudos
7 Replies
MadhuK_Intel
Moderator
1,891 Views

Hi,

 

Thank you for posting in Intel communities.

 

We were able to reproduce your issue on a particular node (s001-n140) you have mentioned. Thank you for notifying us. We are looking into the issue.

 

>>” Have I made a mistake somewhere which is causing this job to fail?”

What you are doing is correct.

 

>>” Is there some way I can exclude my job from being allocated to this particular node?”

There is no command to exclude the allocation to the particular node. But, you can try with the below command that will show available free gen9 featured nodes you can use.

 

pbsnodes | grep gen9 -B 4 | grep free -B 1

 

Example output:

MadhuK_Intel_0-1647521350703.png

 

 

you can select any particular node by its name from the above output by using the below command:

qsub […] -l nodes=[node_name]:ppn=2

 

For example: 

qsub -I -l nodes=s001-n228:ppn=2

 

For more information on devcloud command please refer to this URL: https://devcloud.intel.com/oneapi/documentation/job-submission/#accessing-compute-nodes

 

Thanks and regards,

Madhu

 

MadhuK_Intel
Moderator
1,834 Views

Hi,

 

We have not heard back from you. Could you please provide an update on your issue? If your issue is resolved, can we go ahead and close this thread?

 

Best regards,

Madhu

 

0 Kudos
atuft
Novice
1,791 Views

Hi,

 

I do not have an update, I was waiting for you to reply with a solution as you said you had reproduced the error and were looking into it. Has the issue with GPU offloading on this node been resolved?

 

Thanks,

 

Adam

0 Kudos
JyothisV_Intel
Moderator
1,530 Views

Hi,


Good day to you.


Sorry for the delay on our part.


We have informed the Intel DevCloud development team regarding this issue and will update you when it is resolved.


We also apologize for the inconvenience caused and request you to use the workaround solution mentioned above until our development team resolves this.


Thanks and Regards,

Jyothis V James


0 Kudos
JyothisV_Intel
Moderator
1,359 Views

Hi,

 

Good day to you.

 

Sorry for the delay on our side.

 

Our development team has informed us that the node s001-140 is having some hardware issues and is not expected to come back into the node queue in the future. We recommend running the jobs on the other nodes that are available. You can get the list of the other free gen9 nodes by running the following command:

 

 

pbsnodes | grep gen9 -B 4 | grep free -B 1

 

 

Thanks for letting us and the community know regarding this, and we apologize for the inconvenience caused.

 

If this resolves your issue, kindly mark this as a solution as it will help others with a similar issue.

 

Regards,

Jyothis V James

 

0 Kudos
JyothisV_Intel
Moderator
1,151 Views

Hi,


Good day to you.


We have not received any update from you. We hope that you were able perform GPU offloading successfully on all other nodes. If not, do get back to us.


Thanks and Regards,

Jyothis V James


0 Kudos
JyothisV_Intel
Moderator
1,134 Views

Hi,


Good day to you.


We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel. 


Thanks and Regards,

Jyothis V James


0 Kudos
Reply