- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello again.
Sorry for the millionth post here. I am probably cursed with discovering every bug and issue that exists.
I am using the Tiber developer cloud (ITDC, console.cloud.intel.com), I am using the training nodes, JupyterLab, and connecting via VS Code.
I saw a very similar (or probably the same) problem here (https://community.intel.com/t5/Intel-Developer-Cloud/JupyterHub-Error-messages-restart-cycle-and-server/m-p/1622450), but I was asked to create a separate thread.
The problem is, that the JupyterLap stops responding frequently. The error message is the same as if I reach the 4-hour timeout on the job. It looks like the server crashes or something. VS Code also diconnects. I have to close the tab with the JupyterLab, and launch a new JupyterLab, which again crashes after a while.
It crashes always in quarter-hour intervals, that is, at :00, :15, :30, :45.
Another note is, that viewing all the running slurm jobs using the squeue command, I can see that there are jobs longer that 15 minutes, people are seemingly using it just fine for up to 4 hours. Except for me.
Thanks in advance for help.
Jakub
-----
Right now, I started the JupyterLab at approx. 11:02 UTC (13:02 CEST), and it is 11:18 UTC, and the JupyterLab still did not crash. I did not connect VS Code this time, I only let it run idling to see what happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
yes, the issue seems to have resolved itself. It works fine now without any problems.
No idea what was wrong.
Anyway, thanks,
Jakub
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The connection is still OK at 11:30 UTC. I then started the VS Code server (./vscode tunnel) (I only started the server, did not connect a client), and at 11:45 UTC it still did not crash. Then I connected my VS Code client, and at 12:00 UTC, it crashed. Not only the VS Code connection, but the whole JupyterLab instance.
So perhaps there is some weird interaction with VS Code. But it worked fine until probably last thursday/friday.
This is what appears when it crashes:
Clicking restart does not work, this is what appears then:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now, I witnessed this message written to the terminal right before the connection and the JupyterHub crashed:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd-idc-training-gpu-compute-09: error: *** STEP 34498.3 ON idc-training-gpu-compute-09 CANCELLED AT 2024-08-28T12:30:19 ***
So something is probably calling scancel on my slurm job. It did not time out, it did not lose connection. The slurm job was simply cancelled. Furthermore, when I attempt to cancel the job myself (scancel JOBID), it says I don't even have permissions to execute scancel (it has permissions 700 and is owned by root). So it is not me, nor it most definitely can't be the VS Code doing it, since it is running under my user.
Can someone from the devcloud please look at this, what is going on? It is really extremely frustrating, having to reconnect every 15 minutes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jakub,
Thank you for reaching out to us.
We apologize for the inconvenience. Please let us know what your last action was in JupyterHub when you encountered the error. Additionally, kindly provide your Cloud Account ID, the URL of your JupyterHub, and the country from which you are connecting.
Regards,
Thivagar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The only actions I do in JupyterHub is open a terminal and start a VS Code server (./vscode tunnel), according to the docs (https://console.cloud.intel.com/docs/tutorials/vs_code.html)
Cloud Account ID: 309713633000
JupyterHub URL: https://jupyter-batch-us-region-1.cloud.intel.com/user/<USERNAME>/lab? . I will post the <USERNAME> in the internal ticket that got created (#06337298).
Country: Czechia
Anyway, for the past ~3 hours (since approx. 16:00 UTC), the JupyterHub itself is not crashing anymore. Did the team fix it?
However, the VS Code connection still often crashes (quite randomly now), but I am able to reconnect without restarting the JupyterHub. This is what the VS Code server writes when it disconnects:
[2024-08-28 18:56:41] info [tunnels::connections::ws] error reading websocket: WebSocket protocol error: Connection reset without closing handshake
[2024-08-28 18:56:41] warn Tunnel exited unexpectedly but gracefully, reconnecting
[2024-08-28 18:56:41] info [rpc.11] Disposed of connection to running server.
Still, this was not happening before approx. last thursday.
Edit: it was disconnecting approx. once in 5-10 minutes. Now (~21:30 UTC), in the last 40 minutes it has been OK without any issue. You don't have to deal with this disconnect issue, I think it was just some random thing, will create new ticket if necessary.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jakub,
Thank you for the information provided. I’m glad to hear that the JupyterHub disconnection issue has been resolved. Please let us know if the issue persists and we’ll be happy to assist further.
Regards,
Thivagar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
yes, the issue seems to have resolved itself. It works fine now without any problems.
No idea what was wrong.
Anyway, thanks,
Jakub
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jakub,
This thread will no longer be monitored since this issue has been resolved. If you need any additional information from Intel, please submit a new question.
Regards,
Thivagar
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page