Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.
1216 Discussions

home directory not available on devcloud nodes

Brice1
Beginner
1,059 Views

Hello
I couldn't run a single job on devcloud recently, at least because my home directory seems to be unavailable on compute nodes (see below).
Thanks

$ export PBS_DEFAULT=v-qsvr-nda
$ qsub -l nodes=1:single_gpu:ppn=2 -I
qsub: waiting for job 28195.v-qsvr-nda.aidevcloud to start
qsub: job 28195.v-qsvr-nda.aidevcloud ready


########################################################################
# Date: Wed 26 Jan 2022 12:15:25 AM PST
# Job ID: 28195.v-qsvr-nda.aidevcloud
# User: uxxxxx# Resources: neednodes=1:single_gpu:ppn=2,nodes=1:single_gpu:ppn=2,walltime=06:00:00
########################################################################

PBS: chdir to '/home/uxxxxx' failed: No such file or directory

qsub: job 28195.v-qsvr-nda.aidevcloud completed 

 

0 Kudos
15 Replies
RahulU_Intel
Moderator
1,039 Views

Hi,


Thanks for posting in Intel Communities. We looked into your case. We tried the command you gave, we were not able to reproduce your error. Could you please confirm the below points mentioned:

  1. Are you running the commands from the login node.
  2. Are you able to get the compute nodes in the Jupyter lab
  3. Are you able to see home directory by giving the 'pwd' command in both devcloud gui and Jupyter lab terminal
  4. Are you able to access generic queue
  5. Do you have nda q access


You can try the below commands to list the nodes and requesting a particular node based on the node name:

  1. To request an interactive node - qsub -I
  2. To list the nodes - pbsnodes and for listing the free nodes - pbsnodes -l free
  3. To request a node by node property - qsub […] -l nodes=1:[property]:ppn=2 for e.g. if you are requesting for CPU then qsub […] -l nodes=1:cpu:ppn=2
  4. To request a node based on node name - qsub […] -l nodes=[node_name]:ppn=2 for e.g. if your node name is s001- s001 you can give

qsub -I -l nodes=s001-s001:ppn=2

Hope this helps


Thanks

Rahul


Brice1
Beginner
1,036 Views

Yes, I am on the login node. And I have access to the NDA queue. The same commands have been working fine for several months and my account was recently renewed for another year. Things broke recently. I don't use jupyter, I only use ssh access and qsub. Some node of the generic queue seems to work better, I just got node s001-n054 from "qsub -I" and my home is available there. Can you check which node was allocated to job 28258.v-qsvr-nda.aidevcloud and check if my home is available there? There's likely an issue with some nodes and not with others.

RahulU_Intel
Moderator
1,028 Views

Hi,


We are checking on this internally. Will get back to you with an update soon.


Thanks

Rahul


Brice1
Beginner
1,008 Views

Hello,
You asked me to try again in another thread, but I am replying here since it's more appropriate. I still have the problem, for instance with job 28370.v-qsvr-nda.aidevcloud that I just submitted:

uxxxxx@login-2:~$ export PBS_DEFAULT=v-qsvr-nda

uxxxxx@login-2:~$ qsub -l nodes=1:single_gpu:ppn=2 -I
qsub: waiting for job 28370.v-qsvr-nda.aidevcloud to start
qsub: job 28370.v-qsvr-nda.aidevcloud ready


########################################################################
# Date: Mon 31 Jan 2022 11:56:50 PM PST
# Job ID: 28370.v-qsvr-nda.aidevcloud
# User: xxxxx
# Resources: neednodes=1:single_gpu:ppn=2,nodes=1:single_gpu:ppn=2,walltime=06:00:00
########################################################################

PBS: chdir to '/home/uxxxxx' failed: No such file or directory

qsub: job 28370.v-qsvr-nda.aidevcloud completed


Brice

Brice1
Beginner
977 Views

Hello
NDA nodes s004-n003 and s004-n001 have the problem now. If I run an interactive job, it gets killed with "PBS: chdir to '/home/u49077' failed: No such file or directory". If I submit a batch, it doesn't seem to ever run, I don't see it in qstat and I don't seem to ever get slurm output/error files. The same reservation command lines on s003-n001 for work fine for both interactive and batch.

JananiC_Intel
Moderator
951 Views

Hi,

 

Thanks for the update .

 

Sorry for the delay. We are checking on this. We will get back to you soon.

 

Regards,

Janani Chandran

 

JananiC_Intel
Moderator
921 Views

Hi,


Our DevCloud team has fixed the issue. Could you check and let us know whether you are still facing the issue?


Regards,

Janani Chandran


Brice1
Beginner
896 Views

All NDA nodes that were recently affected (SPR nodes) are currently down/offline, I'll check when they are back online.

JananiC_Intel
Moderator
798 Views

Hi,


Did you check that ? Do you have any update?


Regards,

Janani Chandran


Brice1
Beginner
745 Views

Hello.
s003-n001 is still marked "down" in qnodes since last time, and s004-n001 and s004-n003 are still marked "down,offline". I couldn't re-test anything on these previously-affected nodes.

Brice

 

JananiC_Intel
Moderator
730 Views

Hi,


Could you check in other available nodes for home directory and let us know?


Regards,

Janani Chandran


Brice1
Beginner
716 Views

I tested 6 different nodes in the main queue, no problem.
In the NDA queue, s004-n003 and s004-n001 disappeared and the new s017-n001 and n002 work fine. s003-n001 is still down, don't know about this one.
So things look better, even if I can't be 100% sure.

Thanks

Brice

 

 

JananiC_Intel
Moderator
693 Views

Hi,


Thanks for the update.


Please let us know if we can go ahead and close this case.


Regards,

Janani Chandran


Brice1
Beginner
674 Views

Hello

If s003-n001 doesn't get brought back online, and s004-n003 and s004-n001 aren't put back in devcloud, there is no point in keeping this case open.

Brice

 

JananiC_Intel
Moderator
602 Views

Hi,


Thanks for the immediate response.


Now s003-n001 is online and for your information s004-n001 and s004-n003 were experimental nodes that was brought online and then taken down.


Also the recommended way to use these nodes is by property name instead of actual node names.


s003-n001's properties = xeon,spr,ram1024gb,netgbe,batch


Example to target Sapphire Rapids use : qsub -l nodes=1:spr:ppn=2 my_script.sh


Since this issue is been addressed we are closing this case. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Regards,

Janani Chandran


Reply