This particular problem is likely not an Intel problem, but may be one experienced here by someone, and who has some advice for resolution.
I have a compute intensive application that is written as MPI distributed, OpenMP threaded. I can run this program in my office directly (no MPI) or distributed (mpirun), I can run on 1 or 2 nodes using MPI. The systems locally are Xeon host (Cent OS) and KNL host (Cent OS). I've also have run this successfully by ssh-ing into the Colfax Cluster using 1 to 8 KNL's (couldn't get 16 KNLs to schedule).
I am now running (attempting to run) test on a hardware vendor's setup.
After resolving configuration issues and installation issues I can
ssh into their login server (Xeon host)
su to super user
ssh to KNL node (Xeon KNL)
mpirun ... using two KNL's
When I mpirun'd the application, it started up as expected (periodically emitting progress information to the console). Several minutes into the run it hung. I thought this was a programming error resulting in deadlock, or maybe a watchdog timer killed a thread or process without killing the mpi process manager.
To eliminate possible causes I started the application as stand alone (without mpirun).
Several minutes into this, the program hung as well. So not mpi messaging issue.
Pressing Ctrl-C on the keyboard (through two ssh connections) yielded no response (application not killed). I thought one of the systems in the ssh connections went down. Prior to doing anything on those ssh connections, I wrote an email to my client explaining the hang issue. Several minutes passed.
Now for the interesting part.
After this several minutes "hang" 100 to 150 lines of progress output from the application came out on the console window then the program terminated by Ctrl-C message came out. What appears to have happened was the application was running fine during the console hang, but the terminal output was suspended (as if flow control instructed it to stop). And no, I did not press Ctrl-S or the Pause key.
Anyone have information on this and how to avoid the hang, or at least how to resume the console output without killing the application.
Sounds like a very long timeout for a lost packet somewhere between you and the vendor....
I have seen similar problems inside my own building if I have to traverse a flaky firewall between my desk and the system. It is quite irritating....
Now for the interesting part. It may be of interest to other readers.
In a different test run of the application, I let it stay hanging for a long time, then the buffered text was dumped out. IOW the application was running correctly, just the text hung up.
I was, and still am, having issues with the hardware vendor not configuring their infiniband switches properly. As it turns out, the IP address that was supposed to route over IB to a second KNL looped back to the same KNL. This resulted in all the ranks (2) running on the same node, but as if they had all of the computational resources of the node available. IOW a double subscription of hardware threads. Apparently the MPI management thread(s) and console output thread have a lower priority than the application's compute threads. (which in itself is OK). The text output came out only when the rank 0 master process (and master OpenMP thread) reached a serialization point to post a summary to a MySQL database. At this point the MPI console management thread got scheduled to dump the output.
Another adventure in Wonderland...