Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Gregosh
Beginner
637 Views

DevCloud node allocation limits

Jump to solution

Hello.

I'm trying to learn the DevCloud infrastructure and ht qsub utility. On the offical web page  about the pbs this instruction is given:

 

[<user>@login-2 ~]$ echo cat \$PBS_NODEFILE | qsub -l nodes=4:ppn=2

 

I' m trying to tun this command form login node and from computational node, taken with the command:

qsub -I 

 

And always I have communicate:

qsub: submit error (Job exceeds queue resource limits MSG=job violates queue/server max resource limits)

 

I want to ask - is there any limit of nodes?

If this command is given as an example it should be executed in proper way?

What i how I can check to find any reason why this command is not executed properly?

Thanks in advance.
Regards,
Grzegorz

0 Kudos
1 Solution
Gopika_Intel
Moderator
296 Views

Hi,

Sorry for the delay in response. To answer your question:

> But, because the workaround is very less comfortable that using the only qsub utility and one command I would like to ask. Are You going to work about this allocation method with the qsub? And if Your response will be positive in this question. When do You expect that allocation of more than one node with the qsub natively will be working again?

 

A: Our config allowed only for 2 jobs(multi node - not allowed), each running on 1 node. (2x1). We have updated that temporarily to 2 jobs, each with up to 2 nodes. (2x2).

Hope this helps

Regards

Gopika


View solution in original post

10 Replies
Gopika_Intel
Moderator
598 Views

Hi,

Thank you for posting in Intel Communities and pointing this issue out. We were able to reproduce the issue and we are sorry for the inconvenience caused. As a workaround, you can request multiple nodes using the command:

echo cat \$PBS_NODEFILE | qsub -l nodes=s00X-nXXX+s00X-nXXX+s00X-nXXX:ppn=2

given the nodes, s00X-nXXX, are free

To know about the free nodes, please execute the command:

pbsnodes

 

Hope this helps

Regards

Gopika

 

ghnunes
Beginner
544 Views

Hi Gopika,

I'm running my master's experiments in Python, they are convolutional neural networks for image classification. But I'm always being disconnected throughout the executions, as shown in this print. Would running with more nodes solve my problem? Out of every 10 times I run the codes in 9 I get disconnected, they take an average of 3 hours to run the 30 repetitions.

To run I'm using these commands:

ssh devcloud

qsub -I

ssh s001-n0XX.aidevcloud

 

problemaConexao.jpegI tried to run these commands you mentioned to request more nodes, but I got this error: "error: unable to send message to qmaster using port 6444 on host "gustavo": can't resolve host name"

Gregosh
Beginner
446 Views

Hello.

Thank You for Your nice response.

The workaround which You suggested works and I will be able to use them.

But, because the workaround is very less comfortable that using the only qsub utility and one command I would like to ask.
Are You going to work about this allocation method with the qsub? And if Your response will be positive in this question: 
when do You expect that allocation of more than one node with the qsub natively will be working again?

 

Regards.
Grzegorz

Gopika_Intel
Moderator
518 Views

Hi,

 

Thank you for the update. We saw that you raised this question here: https://community.intel.com/t5/Intel-DevCloud/Problem-with-disconnect/m-p/1310806/emcs_t/S2h8ZW1haWx... and that thread is active. As the query is being answered in the thread link mentioned, can we discontinue monitoring this thread?

Regards

Gopika


ghnunes
Beginner
509 Views

Of course, thank you very much!

Gopika_Intel
Moderator
297 Views

Hi,

Sorry for the delay in response. To answer your question:

> But, because the workaround is very less comfortable that using the only qsub utility and one command I would like to ask. Are You going to work about this allocation method with the qsub? And if Your response will be positive in this question. When do You expect that allocation of more than one node with the qsub natively will be working again?

 

A: Our config allowed only for 2 jobs(multi node - not allowed), each running on 1 node. (2x1). We have updated that temporarily to 2 jobs, each with up to 2 nodes. (2x2).

Hope this helps

Regards

Gopika


View solution in original post

Gopika_Intel
Moderator
238 Views

Hi,

We have not heard from you. Is your query resolved? If yes, can we discontinue monitoring this thread

Regards

Gopika


Gregosh
Beginner
173 Views

Hello.

Thank You for kind response.

At this moment I accept the restrictions with submitting the jobs in the 2x2 configuration.
So thank You for Your help and we can close this topic.

Regards.

Grzegorz

ghnunes
Beginner
223 Views

Hi @Gopika_Intel , sorry, forgot to answer, everything worked out, thank you so much for your help!

Gopika_Intel
Moderator
113 Views

Hi,

 

Thank you for the confirmation and accepting our response as solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Regards

Gopika

 

Reply