<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thanks Gergana, in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079512#M4858</link>
    <description>&lt;P&gt;Thanks Gergana,&lt;/P&gt;

&lt;P&gt;The problem disappeared once I removed the "setenv I_MPI_PMI_LIBRARY /usr/lib64/libpmi.so" line and executed with "srun --mpi=pmi2 ..." instead of "srun ...".&lt;/P&gt;

&lt;P&gt;For OpenMPI, it seems like --mpi=pmi2 should be used if pmi2 is enabled. Is there something similar for Intel MPI?&lt;/P&gt;

&lt;P&gt;"&lt;SPAN style="color: rgb(0, 0, 0); font-family: Arial, Verdana, Helvetica, sans-serif; font-size: 18px; line-height: 26px;"&gt;If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line." &amp;lt;=&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 16.26px; line-height: 24.39px;"&gt;from &lt;A href="http://slurm.schedmd.com/mpi_guide.html#open_mpi" target="_blank"&gt;http://slurm.schedmd.com/mpi_guide.html#open_mpi&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;And I am encountering another problem.&lt;/P&gt;

&lt;P&gt;With srun --mpi=pmi2 and 128 or more nodes (1 MPI process per node, no error message till 64 nodes),&lt;/P&gt;

&lt;P&gt;I got "slurmstepd: error: tree_msg_to_stepds: host=g161, rc = 1" in MPI_Init_thread(), but the code seems like working fine. With mpirun or mpiexec, MPI_Init_thread() does not return any error message, but MPI communication is way slower.&lt;/P&gt;

&lt;P&gt;Any idea?&lt;/P&gt;

&lt;P&gt;Thank you very much!!!&lt;/P&gt;

&lt;P&gt;-seunghwa&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Gergana S. (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Seunghwa,&lt;/P&gt;

&lt;P&gt;Thanks for getting in touch.&amp;nbsp; This is more likely a configuration error than an issue with your application.&amp;nbsp; Although, it's likely your application takes up more memory than the defaults allow.&lt;/P&gt;

&lt;P&gt;In your case, you're saying that using I_MPI_FABRICS=shm:dapl (which is running over your local InfiniBand software stack, likely OFED) works fine when doing mpirun, mpirun -bootstrap=slurm, and mpiexec.hydra.&amp;nbsp; But doing the same with srun causes the "trying to free memory block" errors you see.&lt;/P&gt;

&lt;P&gt;The main difference in all of these cases is the launch mechanism.&amp;nbsp; When using mpirun/mpiexec.hydra, you're relying on the Intel MPI Library to start your job using the underlying SLURM startup method.&amp;nbsp; But in the srun case, you're actually asking SLURM to start your MPI job by pulling in the appropriate Intel MPI libs and scripts.&amp;nbsp; So the issue with srun is that some of the defaults on your system might be set differently as compared to when starting up with mpirun.&lt;/P&gt;

&lt;P&gt;Do you know if your memory limits are set appropriately?&amp;nbsp; Check out &lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/329053#comment-1713129"&gt;this forum thread&lt;/A&gt; which talks about how to set some of these limits.&amp;nbsp; Furthermore, the same error was resolved&amp;nbsp;&lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/370967#comment-1725651"&gt;here&lt;/A&gt; by setting log_num_mtt to 24.&lt;/P&gt;

&lt;P&gt;I hope this helps.&amp;nbsp; Let me know if updating your settings changes the outcome.&lt;/P&gt;

&lt;P&gt;Regards,&lt;BR /&gt;
	~Gergana&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 12 Jan 2016 21:55:10 GMT</pubDate>
    <dc:creator>Seunghwa_Kang</dc:creator>
    <dc:date>2016-01-12T21:55:10Z</dc:date>
    <item>
      <title>memory error occurs only with a certain job launching method and shm:dapl</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079510#M4856</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I'm playing with different job launching methods (http://slurm.schedmd.com/mpi_guide.html#intel_mpi), and getting the following error only when I launch a job using srun (my code works fine with mpirun, mpirun --bootstrp=slurm, and mpiexec.hyra) AND using shm:dapl (works fine with shm:tcp).&lt;/P&gt;

&lt;P&gt;If I launch the job with&lt;/P&gt;

&lt;P&gt;setenv I_MPI_PMI_LIBRARY /usr/lib64/libpmi.so&lt;BR /&gt;
	setenv I_MPI_FABRICS shm:dapl&lt;BR /&gt;
	srun -n 2 my_exec&lt;/P&gt;

&lt;P&gt;I get&lt;/P&gt;

&lt;P&gt;1: [1] trying to free memory block that is currently involved to uncompleted data transfer operation&lt;BR /&gt;
	1: &amp;nbsp;free mem &amp;nbsp;- addr=0x2b7a44547f70 len=1146388320&lt;BR /&gt;
	1: &amp;nbsp;RTC entry - addr=0x2b7a4bc93a00 len=1254064 cnt=1&lt;BR /&gt;
	1: Assertion failed in file ../../i_rtc_cache.c at line 1338: 0&lt;BR /&gt;
	1: internal ABORT - process 1&lt;BR /&gt;
	0: [0] trying to free memory block that is currently involved to uncompleted data transfer operation&lt;BR /&gt;
	0: &amp;nbsp;free mem &amp;nbsp;- addr=0x2ab3a253ff90 len=2723413888&lt;BR /&gt;
	0: &amp;nbsp;RTC entry - addr=0x2ab3a7aada80 len=1182864 cnt=1&lt;BR /&gt;
	0: Assertion failed in file ../../i_rtc_cache.c at line 1338: 0&lt;BR /&gt;
	0: internal ABORT - process 0&lt;/P&gt;

&lt;P&gt;And this error disappears if I set I_MPI_FABRICS to shm:tcp&lt;/P&gt;

&lt;P&gt;So what's the difference between srun and other launching methods in this regard? I want to make sure whether this can happen due to a bug in my code (so I need to fix it) or this is just a configuration issue and just not using srun will be sufficient.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 09 Jan 2016 01:19:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079510#M4856</guid>
      <dc:creator>Seunghwa_Kang</dc:creator>
      <dc:date>2016-01-09T01:19:56Z</dc:date>
    </item>
    <item>
      <title>Hi Seunghwa,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079511#M4857</link>
      <description>&lt;P abp="807"&gt;Hi Seunghwa,&lt;/P&gt;

&lt;P abp="808"&gt;Thanks for getting in touch.&amp;nbsp; This is more likely a configuration error than an issue with your application.&amp;nbsp; Although, it's likely your application takes up more memory than the defaults allow.&lt;/P&gt;

&lt;P abp="809"&gt;In your case, you're saying that using I_MPI_FABRICS=shm:dapl (which is running over your local InfiniBand software stack, likely OFED) works fine when doing mpirun, mpirun -bootstrap=slurm, and mpiexec.hydra.&amp;nbsp; But doing the same with srun causes the "trying to free memory block" errors you see.&lt;/P&gt;

&lt;P abp="810"&gt;The main difference in all of these cases is the launch mechanism.&amp;nbsp; When using mpirun/mpiexec.hydra, you're relying on the Intel MPI Library to start your job using the underlying SLURM startup method.&amp;nbsp; But in the srun case, you're actually asking SLURM to start your MPI job by pulling in the appropriate Intel MPI libs and scripts.&amp;nbsp; So the issue with srun is that some of the defaults on your system might be set differently as compared to when starting up with mpirun.&lt;/P&gt;

&lt;P abp="811"&gt;Do you know if your memory limits are set appropriately?&amp;nbsp; Check out &lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/329053#comment-1713129"&gt;this forum thread&lt;/A&gt; which talks about how to set some of these limits.&amp;nbsp; Furthermore, the same error was resolved&amp;nbsp;&lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/370967#comment-1725651"&gt;here&lt;/A&gt; by setting log_num_mtt to 24.&lt;/P&gt;

&lt;P abp="811"&gt;I hope this helps.&amp;nbsp; Let me know if updating your settings changes the outcome.&lt;/P&gt;

&lt;P abp="811"&gt;Regards,&lt;BR /&gt;
	~Gergana&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2016 16:26:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079511#M4857</guid>
      <dc:creator>Gergana_S_Intel</dc:creator>
      <dc:date>2016-01-12T16:26:03Z</dc:date>
    </item>
    <item>
      <title>Thanks Gergana,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079512#M4858</link>
      <description>&lt;P&gt;Thanks Gergana,&lt;/P&gt;

&lt;P&gt;The problem disappeared once I removed the "setenv I_MPI_PMI_LIBRARY /usr/lib64/libpmi.so" line and executed with "srun --mpi=pmi2 ..." instead of "srun ...".&lt;/P&gt;

&lt;P&gt;For OpenMPI, it seems like --mpi=pmi2 should be used if pmi2 is enabled. Is there something similar for Intel MPI?&lt;/P&gt;

&lt;P&gt;"&lt;SPAN style="color: rgb(0, 0, 0); font-family: Arial, Verdana, Helvetica, sans-serif; font-size: 18px; line-height: 26px;"&gt;If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line." &amp;lt;=&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 16.26px; line-height: 24.39px;"&gt;from &lt;A href="http://slurm.schedmd.com/mpi_guide.html#open_mpi" target="_blank"&gt;http://slurm.schedmd.com/mpi_guide.html#open_mpi&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;And I am encountering another problem.&lt;/P&gt;

&lt;P&gt;With srun --mpi=pmi2 and 128 or more nodes (1 MPI process per node, no error message till 64 nodes),&lt;/P&gt;

&lt;P&gt;I got "slurmstepd: error: tree_msg_to_stepds: host=g161, rc = 1" in MPI_Init_thread(), but the code seems like working fine. With mpirun or mpiexec, MPI_Init_thread() does not return any error message, but MPI communication is way slower.&lt;/P&gt;

&lt;P&gt;Any idea?&lt;/P&gt;

&lt;P&gt;Thank you very much!!!&lt;/P&gt;

&lt;P&gt;-seunghwa&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Gergana S. (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Seunghwa,&lt;/P&gt;

&lt;P&gt;Thanks for getting in touch.&amp;nbsp; This is more likely a configuration error than an issue with your application.&amp;nbsp; Although, it's likely your application takes up more memory than the defaults allow.&lt;/P&gt;

&lt;P&gt;In your case, you're saying that using I_MPI_FABRICS=shm:dapl (which is running over your local InfiniBand software stack, likely OFED) works fine when doing mpirun, mpirun -bootstrap=slurm, and mpiexec.hydra.&amp;nbsp; But doing the same with srun causes the "trying to free memory block" errors you see.&lt;/P&gt;

&lt;P&gt;The main difference in all of these cases is the launch mechanism.&amp;nbsp; When using mpirun/mpiexec.hydra, you're relying on the Intel MPI Library to start your job using the underlying SLURM startup method.&amp;nbsp; But in the srun case, you're actually asking SLURM to start your MPI job by pulling in the appropriate Intel MPI libs and scripts.&amp;nbsp; So the issue with srun is that some of the defaults on your system might be set differently as compared to when starting up with mpirun.&lt;/P&gt;

&lt;P&gt;Do you know if your memory limits are set appropriately?&amp;nbsp; Check out &lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/329053#comment-1713129"&gt;this forum thread&lt;/A&gt; which talks about how to set some of these limits.&amp;nbsp; Furthermore, the same error was resolved&amp;nbsp;&lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/370967#comment-1725651"&gt;here&lt;/A&gt; by setting log_num_mtt to 24.&lt;/P&gt;

&lt;P&gt;I hope this helps.&amp;nbsp; Let me know if updating your settings changes the outcome.&lt;/P&gt;

&lt;P&gt;Regards,&lt;BR /&gt;
	~Gergana&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2016 21:55:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079512#M4858</guid>
      <dc:creator>Seunghwa_Kang</dc:creator>
      <dc:date>2016-01-12T21:55:10Z</dc:date>
    </item>
    <item>
      <title>This turned out to be an</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079513#M4859</link>
      <description>&lt;P&gt;This turned out to be an issue at the system I am using and now it's fixed.&lt;/P&gt;

&lt;P&gt;Thanks for the support!&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Seunghwa Kang wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Thanks Gergana,&lt;/P&gt;

&lt;P&gt;The problem disappeared once I removed the "setenv I_MPI_PMI_LIBRARY /usr/lib64/libpmi.so" line and executed with "srun --mpi=pmi2 ..." instead of "srun ...".&lt;/P&gt;

&lt;P&gt;For OpenMPI, it seems like --mpi=pmi2 should be used if pmi2 is enabled. Is there something similar for Intel MPI?&lt;/P&gt;

&lt;P&gt;"If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line." &amp;lt;=&amp;nbsp;from &lt;A href="http://slurm.schedmd.com/mpi_guide.html#open_mpi"&gt;http://slurm.schedmd.com/mpi_guide.html#open_mpi&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;And I am encountering another problem.&lt;/P&gt;

&lt;P&gt;With srun --mpi=pmi2 and 128 or more nodes (1 MPI process per node, no error message till 64 nodes),&lt;/P&gt;

&lt;P&gt;I got "slurmstepd: error: tree_msg_to_stepds: host=g161, rc = 1" in MPI_Init_thread(), but the code seems like working fine. With mpirun or mpiexec, MPI_Init_thread() does not return any error message, but MPI communication is way slower.&lt;/P&gt;

&lt;P&gt;Any idea?&lt;/P&gt;

&lt;P&gt;Thank you very much!!!&lt;/P&gt;

&lt;P&gt;-seunghwa&lt;/P&gt;

&lt;P&gt;&lt;STRONG class="quote-header"&gt;Quote:&lt;/STRONG&gt;&lt;/P&gt;

&lt;BLOCKQUOTE class="quote-msg quote-nest-1 odd"&gt;
	&lt;DIV class="quote-author"&gt;&lt;EM class="placeholder"&gt;Gergana S. (Intel)&lt;/EM&gt; wrote:&lt;/DIV&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;Hi Seunghwa,&lt;/P&gt;

	&lt;P&gt;Thanks for getting in touch.&amp;nbsp; This is more likely a configuration error than an issue with your application.&amp;nbsp; Although, it's likely your application takes up more memory than the defaults allow.&lt;/P&gt;

	&lt;P&gt;In your case, you're saying that using I_MPI_FABRICS=shm:dapl (which is running over your local InfiniBand software stack, likely OFED) works fine when doing mpirun, mpirun -bootstrap=slurm, and mpiexec.hydra.&amp;nbsp; But doing the same with srun causes the "trying to free memory block" errors you see.&lt;/P&gt;

	&lt;P&gt;The main difference in all of these cases is the launch mechanism.&amp;nbsp; When using mpirun/mpiexec.hydra, you're relying on the Intel MPI Library to start your job using the underlying SLURM startup method.&amp;nbsp; But in the srun case, you're actually asking SLURM to start your MPI job by pulling in the appropriate Intel MPI libs and scripts.&amp;nbsp; So the issue with srun is that some of the defaults on your system might be set differently as compared to when starting up with mpirun.&lt;/P&gt;

	&lt;P&gt;Do you know if your memory limits are set appropriately?&amp;nbsp; Check out &lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/329053#comment-1713129"&gt;this forum thread&lt;/A&gt; which talks about how to set some of these limits.&amp;nbsp; Furthermore, the same error was resolved&amp;nbsp;&lt;A href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/370967#comment-1725651"&gt;here&lt;/A&gt; by setting log_num_mtt to 24.&lt;/P&gt;

	&lt;P&gt;I hope this helps.&amp;nbsp; Let me know if updating your settings changes the outcome.&lt;/P&gt;

	&lt;P&gt;Regards,&lt;BR /&gt;
		~Gergana&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2016 18:40:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/memory-error-occurs-only-with-a-certain-job-launching-method-and/m-p/1079513#M4859</guid>
      <dc:creator>Seunghwa_Kang</dc:creator>
      <dc:date>2016-01-20T18:40:54Z</dc:date>
    </item>
  </channel>
</rss>

