<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hey Mohamad, in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-Trace-Collector-Crashing-with-Large-Number-of-Cores/m-p/929059#M2535</link>
    <description>&lt;P&gt;Hey Mohamad,&lt;/P&gt;

&lt;P&gt;Thanks for getting in touch.&amp;nbsp; That's a tough question.&amp;nbsp; You're using the MPI Correctness Checking library as part of the Intel® Trace Analyzer and Collector (via -check_mpi) and that library hasn't been tested at that scale yet.&lt;/P&gt;

&lt;P&gt;That way the tool works is: the correctness checker starts up a background thread for each MPI rank your application is running.&amp;nbsp; Those background threads are in charge of doing a variety of checks and communicate to each over via TCP sockets.&amp;nbsp; That ensures the correctness checker threads do not interfere with the running MPI ranks which could lead to false results.&lt;/P&gt;

&lt;P&gt;I don't know of a way to change the underlying communication system from TCP sockets to something else.&amp;nbsp; I'll ping the developers and see if there's anything else we can do.&lt;/P&gt;

&lt;P&gt;In the meantime, I would recommend that you run the MPI Correctness checker at a smaller number of ranks.&amp;nbsp; The tool does have the ability to report on both existing and potential issues in your code.&amp;nbsp; Odds are that, even though you're seeing the error at 6,000 cores, the root cause of the error is there even at a 3,000 core run.&amp;nbsp; Can you try a smaller run and let me know if you see the tool report any problems?&lt;/P&gt;

&lt;P&gt;Hope this helps and I look forward to hearing back.&lt;/P&gt;

&lt;P&gt;Regards,&lt;BR /&gt;
	~Gergana&lt;/P&gt;</description>
    <pubDate>Thu, 05 Dec 2013 17:28:45 GMT</pubDate>
    <dc:creator>Gergana_S_Intel</dc:creator>
    <dc:date>2013-12-05T17:28:45Z</dc:date>
    <item>
      <title>Intel Trace Collector Crashing with Large Number of Cores</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-Trace-Collector-Crashing-with-Large-Number-of-Cores/m-p/929058#M2534</link>
      <description>&lt;P&gt;Dear Support,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; I am currently running on RedHat Linux 6.2 64-bit with Intel compilers 12.1.0 and Intel MPI 4.0.3.008 over Qlogic Infiniband QDR (PSM). I am also using Intel Trace Analyzer and Collector 8.0.3.007.&lt;/P&gt;

&lt;P&gt;I am trying to debug an MPI problem when running on a large number of&amp;nbsp; cores (&amp;gt;6000) and&amp;nbsp;I compile my application with "-check_mpi". My application is mixed FORTRAN, C, and C++ and most MPI calls are in FORTRAN.&lt;/P&gt;

&lt;P&gt;I launch my MPI job with the below options:&lt;/P&gt;

&lt;P&gt;mpiexec.hydra&amp;nbsp; -env I_MPI_FABRICS&amp;nbsp; tmi&amp;nbsp; -env I_MPI_TMI_PROVIDER&amp;nbsp; psm&amp;nbsp; -env I_MPI_DEBUG 5&amp;nbsp; .......&lt;/P&gt;

&lt;P&gt;As soon as I launch the application the trace collector crashes with the below error:&lt;/P&gt;

&lt;P&gt;[0] Intel(R) Trace Collector ERROR: cannot create socket: socket(): Too many open files&lt;BR /&gt;
	[32] Intel(R) Trace Collector ERROR: connection closed by peer #0, receiving remaining 8 of 8 bytes failed&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;It works fine on a less number of cores but I need to debug on a&amp;nbsp;large number of cores beyond 6000 cores since that's when my application starts giving me problems with MPI.&lt;/P&gt;

&lt;P&gt;Any suggestions on how to overcome this limitation? Is there a way to have the trace collector run over Infiniband instead of TCP sockets?&lt;/P&gt;

&lt;P&gt;Thank you for your help.&lt;/P&gt;

&lt;P&gt;Mohamad Sindi&lt;/P&gt;

&lt;P&gt;EXPEC Advanced Research Center&lt;/P&gt;

&lt;P&gt;Saudi Aramco&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Dec 2013 13:20:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-Trace-Collector-Crashing-with-Large-Number-of-Cores/m-p/929058#M2534</guid>
      <dc:creator>Mohamad_Sindi</dc:creator>
      <dc:date>2013-12-05T13:20:53Z</dc:date>
    </item>
    <item>
      <title>Hey Mohamad,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-Trace-Collector-Crashing-with-Large-Number-of-Cores/m-p/929059#M2535</link>
      <description>&lt;P&gt;Hey Mohamad,&lt;/P&gt;

&lt;P&gt;Thanks for getting in touch.&amp;nbsp; That's a tough question.&amp;nbsp; You're using the MPI Correctness Checking library as part of the Intel® Trace Analyzer and Collector (via -check_mpi) and that library hasn't been tested at that scale yet.&lt;/P&gt;

&lt;P&gt;That way the tool works is: the correctness checker starts up a background thread for each MPI rank your application is running.&amp;nbsp; Those background threads are in charge of doing a variety of checks and communicate to each over via TCP sockets.&amp;nbsp; That ensures the correctness checker threads do not interfere with the running MPI ranks which could lead to false results.&lt;/P&gt;

&lt;P&gt;I don't know of a way to change the underlying communication system from TCP sockets to something else.&amp;nbsp; I'll ping the developers and see if there's anything else we can do.&lt;/P&gt;

&lt;P&gt;In the meantime, I would recommend that you run the MPI Correctness checker at a smaller number of ranks.&amp;nbsp; The tool does have the ability to report on both existing and potential issues in your code.&amp;nbsp; Odds are that, even though you're seeing the error at 6,000 cores, the root cause of the error is there even at a 3,000 core run.&amp;nbsp; Can you try a smaller run and let me know if you see the tool report any problems?&lt;/P&gt;

&lt;P&gt;Hope this helps and I look forward to hearing back.&lt;/P&gt;

&lt;P&gt;Regards,&lt;BR /&gt;
	~Gergana&lt;/P&gt;</description>
      <pubDate>Thu, 05 Dec 2013 17:28:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-Trace-Collector-Crashing-with-Large-Number-of-Cores/m-p/929059#M2535</guid>
      <dc:creator>Gergana_S_Intel</dc:creator>
      <dc:date>2013-12-05T17:28:45Z</dc:date>
    </item>
    <item>
      <title>Dear Gergana,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-Trace-Collector-Crashing-with-Large-Number-of-Cores/m-p/929060#M2536</link>
      <description>&lt;P&gt;Dear Gergana,&lt;/P&gt;

&lt;P&gt;Thanks for your quick response.&lt;/P&gt;

&lt;P&gt;I've tried running on less cores as low as 1000 cores and I still get the same error from the trace collector. I won't be able to run on lesser cores as the model we are tackling is quite large and requires a large number of nodes and cores.&lt;/P&gt;

&lt;P&gt;Please keep us updated if your developers come back to you with any solutions or workarounds for this.&lt;/P&gt;

&lt;P&gt;For the time being are there any other Intel tools that you might be able to suggest to debug the MPI layer during run time for large scale runs?&lt;/P&gt;

&lt;P&gt;Thanks again for your help, I really appreciate it.&lt;/P&gt;

&lt;P&gt;Mohamad Sindi&lt;/P&gt;

&lt;P&gt;EXPEC Advanced Research Center&lt;/P&gt;

&lt;P&gt;Saudi Aramco&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 08 Dec 2013 09:51:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-Trace-Collector-Crashing-with-Large-Number-of-Cores/m-p/929060#M2536</guid>
      <dc:creator>Mohamad_Sindi</dc:creator>
      <dc:date>2013-12-08T09:51:00Z</dc:date>
    </item>
  </channel>
</rss>

