Intel Trace Collector Crashing with Large Number of Cores

Mohamad_Sindi · ‎12-05-2013

Dear Support,

I am currently running on RedHat Linux 6.2 64-bit with Intel compilers 12.1.0 and Intel MPI 4.0.3.008 over Qlogic Infiniband QDR (PSM). I am also using Intel Trace Analyzer and Collector 8.0.3.007.

I am trying to debug an MPI problem when running on a large number of cores (>6000) and I compile my application with "-check_mpi". My application is mixed FORTRAN, C, and C++ and most MPI calls are in FORTRAN.

I launch my MPI job with the below options:

mpiexec.hydra -env I_MPI_FABRICS tmi -env I_MPI_TMI_PROVIDER psm -env I_MPI_DEBUG 5 .......

As soon as I launch the application the trace collector crashes with the below error:

[0] Intel(R) Trace Collector ERROR: cannot create socket: socket(): Too many open files
[32] Intel(R) Trace Collector ERROR: connection closed by peer #0, receiving remaining 8 of 8 bytes failed

It works fine on a less number of cores but I need to debug on a large number of cores beyond 6000 cores since that's when my application starts giving me problems with MPI.

Any suggestions on how to overcome this limitation? Is there a way to have the trace collector run over Infiniband instead of TCP sockets?

Thank you for your help.

Mohamad Sindi

EXPEC Advanced Research Center

Saudi Aramco

Gergana_S_Intel · ‎12-05-2013

Hey Mohamad,

Thanks for getting in touch. That's a tough question. You're using the MPI Correctness Checking library as part of the Intel® Trace Analyzer and Collector (via -check_mpi) and that library hasn't been tested at that scale yet.

That way the tool works is: the correctness checker starts up a background thread for each MPI rank your application is running. Those background threads are in charge of doing a variety of checks and communicate to each over via TCP sockets. That ensures the correctness checker threads do not interfere with the running MPI ranks which could lead to false results.

I don't know of a way to change the underlying communication system from TCP sockets to something else. I'll ping the developers and see if there's anything else we can do.

In the meantime, I would recommend that you run the MPI Correctness checker at a smaller number of ranks. The tool does have the ability to report on both existing and potential issues in your code. Odds are that, even though you're seeing the error at 6,000 cores, the root cause of the error is there even at a 3,000 core run. Can you try a smaller run and let me know if you see the tool report any problems?

Hope this helps and I look forward to hearing back.

Regards,
~Gergana

Mohamad_Sindi · ‎12-08-2013

Dear Gergana,

Thanks for your quick response.

I've tried running on less cores as low as 1000 cores and I still get the same error from the trace collector. I won't be able to run on lesser cores as the model we are tackling is quite large and requires a large number of nodes and cores.

Please keep us updated if your developers come back to you with any solutions or workarounds for this.

For the time being are there any other Intel tools that you might be able to suggest to debug the MPI layer during run time for large scale runs?

Thanks again for your help, I really appreciate it.

Mohamad Sindi

EXPEC Advanced Research Center

Saudi Aramco