Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
992 Discussions

Problem with disconnect

ghnunes
Beginner
1,001 Views

Hello how are you?

I'm running my master's experiments in Python, they are convolutional neural networks for image classification, but I'm having some problems. I am getting disconnected from my run after some time, I need to run the experiment 30 times so I take an average of 3 hours to complete. The problem is that when it's around run 25 I'm always being disconnected, as shown in the screenshot below. I've already tested several internet connections, but the problem persists.

 

To run the codes I'm using these commands:

ssh devcloud

qsub -I

ssh s001-n000.aidevcloud

 

I was reading this post, https://community.intel.com/t5/Intel-DevCloud/Login-Node-Versus-Compute-Node-in-Intel-DevCloud/mp/12..., and I don't know if I can be doing it something wrong. Could I have a greater processing capacity or is there any other way to solve my problem? I had to try 10 times to get one, the other 9 I was disconnected

 

Thanks!!

problemaConexao.jpeg

0 Kudos
11 Replies
ArunJ_Intel
Moderator
972 Views

Hi Ghnunes,

 

By default the maximum wall clock time the job may run in devcloud is 6 hours (this could be increased up to 24 hours if required, eg (qsub -I -l walltime=24:00:00). Since your job takes only 3 hours, you don't seem to be having an issue with wall time exceeding.

 

The error(client_loop::send disconnect: broken pipe) is because the server is not kept alive due to inactivity and is getting killed before 3 hours in your case.

 

To resolve open your ssh config file(~/.ssh/config). And in ssh configurations, for the below hosts

1)devcloud

2)*.aidevcloud

3)devcloud-vscode 

 

add the below lines to the ssh configuration

 

 

Host *

  ServerAliveInterval 300

  TCPKeepAlive no

 

 

eg:

 

 

Host devcloud
User uXXXX
IdentityFile ~/.ssh/devcloud-access-key-uXXXX.txt
ProxyCommand ssh -T -i ~/.ssh/devcloud-access-key-uXXXX.txt guest@ssh.devcloud.intel.com
ServerAliveInterval 300
TCPKeepAlive no 

 

 

 

 

Once the configuration is added please try again and let me know if you are still facing connection timed out issues. 

 

Thanks

Arun

 

 

ghnunes
Beginner
948 Views


Hi @ArunJ_Intel , thanks so much for the help.

I made the changes you recommended, but when I ran my code after about 2-3 hours of processing it disconnected me again. On the terminal gave this message:

 

WhatsApp Image 2021-08-30 at 10.27.10(1).jpeg

 

ghnunes
Beginner
947 Views

My file ssh is:

 

################################################################################################
# oneAPI DevCloud SSH config
################################################################################################
Host devcloud
User uXXXXXXXX
IdentityFile ~/.ssh/devcloud-access-key-xxxxx.txt
ProxyCommand ssh -T -i ~/.ssh/devcloud-access-key-xxxxxx.txt guest@ssh.devcloud.intel.com
ServerAliveInterval 300
TCPKeepAlive no

Host devcloud.proxy
User uXXXXXXX
Port 4022
IdentityFile ~/.ssh/devcloud-access-key-xxxxxx.txt
ProxyCommand ssh -T devcloud-via-proxy

# If you must route outgoing SSH connection via a corporate proxy,
# replace PROXY_HOSTNAME and PORT below with the values provided by
# your network administrator.
Host devcloud-via-proxy
User guest
Hostname ssh.devcloud.intel.com
IdentityFile ~/.ssh/devcloud-access-key-xxxxxx.txt
LocalForward 4022 c009:22
ProxyCommand nc -x PROXY_HOSTNAME:PORT %h %p
################################################################################################

################################################################################################
# DevCloud VSCode config
################################################################################################
Host devcloud-vscode
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
Hostname localhost
User uXXXXXX
Port 5022
IdentityFile ~/.ssh/devcloud-access-key-XXXXXXX.txt
ServerAliveInterval 300
TCPKeepAlive no
################################################################################################

################################################################################################
# SSH Tunnel config
################################################################################################
Host *.aidevcloud
User uXXXXX
IdentityFile ~/.ssh/devcloud-access-key-xxxxxxxx.txt
ProxyCommand ssh -T devcloud nc %h %p
LocalForward 5022 localhost:22
LocalForward 5901 localhost:5901
ServerAliveInterval 300
TCPKeepAlive no
################################################################################################

ArunJ_Intel
Moderator
905 Views

Hi Ghnunes,

 

Could you please make sure your connection was stable during the period. I was able to connect to nodes and run processes > 4 hours without any issues with the configuration I had shared.

From your screenshots I can see that the session of devcloud where you submit interactive job (ie run qsub -I) is the one that is getting timed out, A workaround to this would be submitting a batch job using qsub, so the job does not get killed even if your connection to devcloud terminal times out.

To do this instead of the command qsub -I run qsub Jobfile.txt

Jobfile.txt should have the code you would need to execute on the node. In this case if you just want to request a node and run your workload from vscode, create Jobfile.txt with the below content to keep the job running for 6 hours. 

 

 

 

echo "job started"
sleep 6h

 

 

 

After the job is submitted, To get the node name, type the command below:

 

 

 

 qstat -xf <your_job_id> 

 

 

 

Node name will be in the parameter called <exec_host> in the output of the above command. Once you have the node name you can connect with the second terminal(terminal on the right) as in your screenshot.

 

Thanks

Arun

 

ghnunes
Beginner
883 Views

Hi @ArunJ_Intel , thanks so much for the help.

I will contact my internet provider and check if I have any problems.

I managed to create the Job file on my server and run it in this new way you recommended and right away I was able to run my experiment. A question, this new way even if my vscode is disconnected my code will continue running server-side?

And if that's right I can reconnect to this node again or would it make the process that is running be stopped?

 

Thanks!!

ArunJ_Intel
Moderator
879 Views

Hi Ghnunes,

 

Yes the content of jobfile will continue to run on server side even if you are disconnected from the node when you submit as a batch job.

To reconnect to the node once you have the node name from qstat-xf command you could connect to the machine using ssh,

 

if you are connecting from your host machine you can connect with

ssh hostname.aidevcloud

eg

ssh s001-n007.aidevcloud

 

Or you could ssh to the node from devcloud login node using the node name

ssh s001-n007

 

Replace s001-n007 with your hostname from qstat-xf command.

 

 

Thanks

Arun Jose

 

 

 

ghnunes
Beginner
869 Views

Just to confirm that I understand.

I'm running my code by vscode, so in my JobFile I put the commands you mentioned:

echo...
sleep...

so if my vscode is disconnected will the code continue running on the server and the outputs will be saved in the folders I requested?

 

Sorry for the number of questions, I'm new to this field of access to servers.

ghnunes
Beginner
833 Views

It worked @ArunJ_Intel, I put it to run and turned off the pc on purpose and when I came back I had saved my output.

I don't even know how to thank you, thank you very much for being available in time to help me.

ArunJ_Intel
Moderator
795 Views

Hi Ghnunes


Glad to know your issue has been resolved. We would discontinue monitoring this issue, please raise a new thread if you have further issues.


Thanks

Arun Jose


ghnunes
Beginner
740 Views

Hi @ArunJ_Intel 

 

I have another problem now, it is no longer returning the exec_host when I run the commands.

ssh devcloud
qsub JobFile.txt
xxxxx.v-qsvr-1.aidevcloud
qstat -xf xxxxx.v-qsvr-1.aidevcloud

In the output there is no longer the <exec_host>


When I try with qsub -I sometimes I get this error:

qsub: submission error (maximum number of jobs already queued for MSG user = total number of jobs for current user exceeds queue limit: user u80023@login-2.aidevcloud, queue batch)

ArunJ_Intel
Moderator
638 Views

Hi Ghnunes,

 

I could see you have raised your follow up issue in the below thread. The issue would be addressed there.

https://community.intel.com/t5/Intel-DevCloud/Problem-with-qsub/m-p/1313092#M2883

 

Thanks

Arun Jose

 

 

Reply