<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:Enforcing MPI progress thread when overlapping communication and computation in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1408505#M9760</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for posting in Intel communities.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Could you please provide the below details to investigate more on your issue?&lt;/P&gt;&lt;P&gt;1) OS details.&lt;/P&gt;&lt;P&gt;2) Complete reproducer code and steps to reproduce your issue?&lt;/P&gt;&lt;P&gt;3) Are you running the program on single node or multiple node?&lt;/P&gt;&lt;P&gt;4) How you are calculating the time of your program?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Hemanth&lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Tue, 16 Aug 2022 10:26:36 GMT</pubDate>
    <dc:creator>HemanthCH_Intel</dc:creator>
    <dc:date>2022-08-16T10:26:36Z</dc:date>
    <item>
      <title>Enforcing MPI progress thread when overlapping communication and computation</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1408118#M9759</link>
      <description>&lt;P&gt;I've implemented Cannon's algorithm which performs distributed memory tensor-matrix multiplication. During this, I thought it would be clever to hide communication latencies by overlapping computation and communication.&lt;/P&gt;
&lt;P&gt;Now I started micro-benchmarking the components i.e. the communication, the computation and the overlapped communication and computation, and something funny has come out of it. The overlapped operation is taking &lt;EM&gt;2.1 times&lt;/EM&gt; as long as the longest time taking operation of the two. Only sending a single message took 521639 us, only computing on the (same sized) data took 340435 us but the act of overlapping them took 1111500 us.&lt;/P&gt;
&lt;P&gt;After numerous test-runs involving independent data buffers, changing the order of the operations in the overlap and even serializing the overlap, I have come to the conclusion that the problem is being caused by MPI's progression.&lt;/P&gt;
&lt;P&gt;The following is the desired behaviour:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;only&lt;/STRONG&gt; the thread identified by &lt;EM&gt;COMM_THREAD&lt;/EM&gt; handles the communication and&lt;/LI&gt;
&lt;LI&gt;all the other threads perform the computation.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;If the above behaviour can be forced, in the above example, I expect to see the overlapped operation take ~521639 us.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Information&lt;/STRONG&gt;:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;The MPI implementation is by Intel as part of OneAPI v2021.6.0.&lt;/LI&gt;
&lt;LI&gt;A single compute node has 2 sockets of Intel Xeon Platinum 8168 (2x 24 = 48 cores).&lt;/LI&gt;
&lt;LI&gt;SMT is not being made use of i.e. each thread is pinned to a single core.&lt;/LI&gt;
&lt;LI&gt;Data is being initialized before each experiment by being mapped to the corresponding memory nodes as required by the computation.&lt;/LI&gt;
&lt;LI&gt;In the given example, the tensor is sized N=600 i.e. has 600^3 data-points. However, the same behaviour was observed for smaller sizes as well.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;STRONG&gt;What I've tried&lt;/STRONG&gt;:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Just making asynchronous calls in the overlap:&lt;/LI&gt;
&lt;/OL&gt;
&lt;PRE&gt;&lt;CODE&gt;// ...&lt;BR /&gt;#define COMM_THREAD 0&lt;BR /&gt;// ...&lt;BR /&gt;#pragma omp parallel&lt;BR /&gt;{&lt;BR /&gt;    if (omp_get_thread_num() == COMM_THREAD)&lt;BR /&gt;    {&lt;BR /&gt;        // perform the comms.&lt;BR /&gt;        auto requests = std::array&amp;lt;MPI_Request, 4&amp;gt;{};&lt;BR /&gt;        const auto r1 = MPI_Irecv(tens_recv_buffer_,2 * tens_recv_buffer_size_, MPI_DOUBLE, src_rank_tens, 2, MPI_COMM_WORLD, &amp;amp;requests[0]);&lt;BR /&gt;        const auto s1 = MPI_Isend(tens_send_buffer_, 2 * tens_send_buffer_size_, MPI_DOUBLE, dest_rank_tens, 2, MPI_COMM_WORLD, &amp;amp;requests[1]);&lt;BR /&gt;        const auto r2 = MPI_Irecv(mat_recv_buffer_, 2 * mat_recv_buffer_size_, MPI_DOUBLE, src_rank_mat, 3, MPI_COMM_WORLD, &amp;amp;requests[2]);&lt;BR /&gt;        const auto s2 = MPI_Isend(mat_send_buffer_, 2 * mat_send_buffer_size_, MPI_DOUBLE, dest_rank_mat, 3, MPI_COMM_WORLD, &amp;amp;requests[3]);&lt;BR /&gt;        if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)&lt;BR /&gt;        {&lt;BR /&gt;            throw std::runtime_error("mpi_sendrecv_error");&lt;BR /&gt;        }&lt;BR /&gt;        if (MPI_SUCCESS != MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE))&lt;BR /&gt;        {&lt;BR /&gt;            throw std::runtime_error("mpi_waitall_error");&lt;BR /&gt;        }&lt;BR /&gt;    }&lt;BR /&gt;    else&lt;BR /&gt;    {&lt;BR /&gt;        const auto work_indices = schedule_thread_work(tens_recv_buffer_size_, 1);&lt;BR /&gt;        shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);&lt;BR /&gt;    }&lt;BR /&gt;}&lt;/CODE&gt;&lt;/PRE&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;Trying manual progression:&lt;/LI&gt;
&lt;/OL&gt;
&lt;PRE&gt;&lt;CODE&gt;// ...&lt;BR /&gt;#define COMM_THREAD 0&lt;BR /&gt;// ...&lt;BR /&gt;#pragma omp parallel&lt;BR /&gt;{&lt;BR /&gt;    if (omp_get_thread_num() == COMM_THREAD)&lt;BR /&gt;    {&lt;BR /&gt;        // perform the comms.&lt;BR /&gt;        auto requests = std::array&amp;lt;MPI_Request, 4&amp;gt;{};&lt;BR /&gt;        const auto r1 = MPI_Irecv(tens_recv_buffer_, 2 * tens_recv_buffer_size_, MPI_DOUBLE, src_rank_tens, 2, MPI_COMM_WORLD, &amp;amp;requests[0]);&lt;BR /&gt;        const auto s1 = MPI_Isend(tens_send_buffer_, 2 * tens_send_buffer_size_, MPI_DOUBLE, dest_rank_tens, 2, MPI_COMM_WORLD, &amp;amp;requests[1]);&lt;BR /&gt;        const auto r2 = MPI_Irecv(mat_recv_buffer_, 2 * mat_recv_buffer_size_, MPI_DOUBLE, src_rank_mat, 3, MPI_COMM_WORLD, &amp;amp;requests[2]);&lt;BR /&gt;        const auto s2 = MPI_Isend(mat_send_buffer_, 2 * mat_send_buffer_size_, MPI_DOUBLE, dest_rank_mat, 3, MPI_COMM_WORLD, &amp;amp;requests[3]);&lt;BR /&gt;        if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)&lt;BR /&gt;        {&lt;BR /&gt;            throw std::runtime_error("mpi_sendrecv_error");&lt;BR /&gt;        }&lt;BR /&gt;&lt;BR /&gt;        // custom wait-all to ensure COMM_THREAD makes progress happen&lt;BR /&gt;        auto comm_done = std::array&amp;lt;int, 4&amp;gt;{0, 0, 0, 0};&lt;BR /&gt;        auto all_comm_done = false;&lt;BR /&gt;        while(!all_comm_done)&lt;BR /&gt;        {&lt;BR /&gt;            auto open_comms = 0;&lt;BR /&gt;            for (auto request_index = std::size_t{}; request_index &amp;lt; requests.size(); ++request_index)&lt;BR /&gt;            {&lt;BR /&gt;                if (comm_done[request_index])&lt;BR /&gt;                {&lt;BR /&gt;                    continue;&lt;BR /&gt;                }&lt;BR /&gt;                MPI_Test(&amp;amp;requests[request_index], &amp;amp;comm_done[request_index], MPI_STATUS_IGNORE);&lt;BR /&gt;                ++open_comms;&lt;BR /&gt;            }&lt;BR /&gt;            all_comm_done = open_comms == 0;&lt;BR /&gt;        }&lt;BR /&gt;    }&lt;BR /&gt;    else&lt;BR /&gt;    {&lt;BR /&gt;        const auto work_indices = schedule_thread_work(tens_recv_buffer_size_, 1);&lt;BR /&gt;        shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);&lt;BR /&gt;    }&lt;BR /&gt;}
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;OL start="3"&gt;
&lt;LI&gt;Using the environment variables mentioned here: &lt;A href="https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-async-progress-control.html" target="_blank" rel="nofollow noopener noreferrer"&gt;https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-async-progress-control.html&lt;/A&gt; in my job-script:&lt;/LI&gt;
&lt;/OL&gt;
&lt;PRE&gt;&lt;CODE&gt;export I_MPI_ASYNC_PROGRESS=1 I_MPI_ASYNC_PROGRESS_THREADS=1 I_MPI_ASYNC_PROGRESS_PIN="0"&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; and then running the code in variant 1.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;All of the above attempts have resulted in the same undesirable behaviour.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Question&lt;/STRONG&gt;: How can I force only &lt;EM&gt;COMM_THREAD&lt;/EM&gt; to participate in MPI progression?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Any thoughts, suggestions, speculations and ideas will be greatly appreciated. Thanks in advance.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Notes&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;1. although the buffers &lt;EM&gt;tens_send_buffer&lt;/EM&gt; and &lt;EM&gt;mat_send_buffer&lt;/EM&gt; are accessed concurrently during the overlap, this is read-only access.&lt;/P&gt;
&lt;P&gt;2. the function &lt;EM&gt;schedule_thread_work&lt;/EM&gt; performs round-robin static scheduling of the work involved in the computation by excluding &lt;EM&gt;COMM_THREAD&lt;/EM&gt;.&lt;/P&gt;</description>
      <pubDate>Sun, 14 Aug 2022 13:51:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1408118#M9759</guid>
      <dc:creator>nitin_malapally</dc:creator>
      <dc:date>2022-08-14T13:51:36Z</dc:date>
    </item>
    <item>
      <title>Re:Enforcing MPI progress thread when overlapping communication and computation</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1408505#M9760</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for posting in Intel communities.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Could you please provide the below details to investigate more on your issue?&lt;/P&gt;&lt;P&gt;1) OS details.&lt;/P&gt;&lt;P&gt;2) Complete reproducer code and steps to reproduce your issue?&lt;/P&gt;&lt;P&gt;3) Are you running the program on single node or multiple node?&lt;/P&gt;&lt;P&gt;4) How you are calculating the time of your program?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Hemanth&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 16 Aug 2022 10:26:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1408505#M9760</guid>
      <dc:creator>HemanthCH_Intel</dc:creator>
      <dc:date>2022-08-16T10:26:36Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Enforcing MPI progress thread when overlapping communication and computation</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1408508#M9761</link>
      <description>&lt;P&gt;Dear Hemanth,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for your reply.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1. The OS is Rocky Linux 8.5.&lt;/P&gt;
&lt;P&gt;2. Unfortunately, I cannot provide a minimum reproducer.&lt;/P&gt;
&lt;P&gt;3. I am running the program on 8 nodes.&lt;/P&gt;
&lt;P&gt;4. I am calculating the time using this library &lt;A href="https://gitlab.com/anxiousprogrammer/tixl" target="_blank"&gt;https://gitlab.com/anxiousprogrammer/tixl&lt;/A&gt; which uses a monotonic clock.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Alternatively, it would be very helpful to know the following information:&lt;/P&gt;
&lt;P&gt;1. Which threads participate in MPI progression in an OpenMP parallel region when&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; a. I initialize with &lt;EM&gt;MPI_Init_thread&lt;/EM&gt; with MPI_THREAD_FUNNELED&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; b. I initialize with &lt;EM&gt;MPI_Init_thread&lt;/EM&gt; with MPI_THREAD_SERIALIZED?&lt;/P&gt;
&lt;P&gt;2. Can I ensure that only one thread participates in progression?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks in advance,&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;/P&gt;
&lt;P&gt;Nitin&lt;/P&gt;</description>
      <pubDate>Tue, 16 Aug 2022 10:39:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1408508#M9761</guid>
      <dc:creator>nitin_malapally</dc:creator>
      <dc:date>2022-08-16T10:39:28Z</dc:date>
    </item>
    <item>
      <title>Re:Enforcing MPI progress thread when overlapping communication and computation</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1409835#M9776</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Could you please follow the MPI split model and use&amp;nbsp;MPI_THREAD_MULTIPLE to improve the performance? For more information refer to the below link:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;A href="https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/multiple-endpoints-support/mpi-thread-split-programming-model.html" target="_blank"&gt;https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/multiple-endpoints-support/mpi-thread-split-programming-model.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards&lt;/P&gt;&lt;P&gt;Hemanth&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 22 Aug 2022 10:18:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1409835#M9776</guid>
      <dc:creator>HemanthCH_Intel</dc:creator>
      <dc:date>2022-08-22T10:18:41Z</dc:date>
    </item>
    <item>
      <title>Re:Enforcing MPI progress thread when overlapping communication and computation</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1411653#M9802</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We haven't heard back from you. Could you please provide an update on your issue?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards&lt;/P&gt;&lt;P&gt;Hemanth&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 30 Aug 2022 12:23:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1411653#M9802</guid>
      <dc:creator>HemanthCH_Intel</dc:creator>
      <dc:date>2022-08-30T12:23:48Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Enforcing MPI progress thread when overlapping communication and computation</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1411666#M9803</link>
      <description>&lt;P&gt;Dear Hemanth,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Using tracing software, I was able to confirm that only the calling thread participates in MPI progression when &lt;SPAN class="ph codeph"&gt;I_MPI_ASYNC_PROGRESS&lt;/SPAN&gt;=0.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;As for the problem behaviour stated above, the bug was found to be in the benchmarking; in the overlap benchmark I was unfortunately not properly initializing all of the test data, which led to my measuring the time taken by the page-faults of some of the test data. Having fixed this, the overlap takes approximately as much time as the communication, which is the desired outcome.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks very much for your time and interest. Please excuse my delayed reply.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;/P&gt;
&lt;P&gt;Nitin&lt;/P&gt;</description>
      <pubDate>Tue, 30 Aug 2022 13:32:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1411666#M9803</guid>
      <dc:creator>nitin_malapally</dc:creator>
      <dc:date>2022-08-30T13:32:39Z</dc:date>
    </item>
    <item>
      <title>Re:Enforcing MPI progress thread when overlapping communication and computation</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1412426#M9809</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Hemanth&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 02 Sep 2022 13:13:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Enforcing-MPI-progress-thread-when-overlapping-communication-and/m-p/1412426#M9809</guid>
      <dc:creator>HemanthCH_Intel</dc:creator>
      <dc:date>2022-09-02T13:13:47Z</dc:date>
    </item>
  </channel>
</rss>

