Enforcing MPI progress thread when overlapping communication and computation

nitin_malapally · ‎08-14-2022

I've implemented Cannon's algorithm which performs distributed memory tensor-matrix multiplication. During this, I thought it would be clever to hide communication latencies by overlapping computation and communication.

Now I started micro-benchmarking the components i.e. the communication, the computation and the overlapped communication and computation, and something funny has come out of it. The overlapped operation is taking 2.1 times as long as the longest time taking operation of the two. Only sending a single message took 521639 us, only computing on the (same sized) data took 340435 us but the act of overlapping them took 1111500 us.

After numerous test-runs involving independent data buffers, changing the order of the operations in the overlap and even serializing the overlap, I have come to the conclusion that the problem is being caused by MPI's progression.

The following is the desired behaviour:

only the thread identified by COMM_THREAD handles the communication and
all the other threads perform the computation.

If the above behaviour can be forced, in the above example, I expect to see the overlapped operation take ~521639 us.

Information:

The MPI implementation is by Intel as part of OneAPI v2021.6.0.
A single compute node has 2 sockets of Intel Xeon Platinum 8168 (2x 24 = 48 cores).
SMT is not being made use of i.e. each thread is pinned to a single core.
Data is being initialized before each experiment by being mapped to the corresponding memory nodes as required by the computation.
In the given example, the tensor is sized N=600 i.e. has 600^3 data-points. However, the same behaviour was observed for smaller sizes as well.

What I've tried:

Just making asynchronous calls in the overlap:

// ...
#define COMM_THREAD 0
// ...
#pragma omp parallel
{
    if (omp_get_thread_num() == COMM_THREAD)
    {
        // perform the comms.
        auto requests = std::array<MPI_Request, 4>{};
        const auto r1 = MPI_Irecv(tens_recv_buffer_,2 * tens_recv_buffer_size_, MPI_DOUBLE, src_rank_tens, 2, MPI_COMM_WORLD, &requests[0]);
        const auto s1 = MPI_Isend(tens_send_buffer_, 2 * tens_send_buffer_size_, MPI_DOUBLE, dest_rank_tens, 2, MPI_COMM_WORLD, &requests[1]);
        const auto r2 = MPI_Irecv(mat_recv_buffer_, 2 * mat_recv_buffer_size_, MPI_DOUBLE, src_rank_mat, 3, MPI_COMM_WORLD, &requests[2]);
        const auto s2 = MPI_Isend(mat_send_buffer_, 2 * mat_send_buffer_size_, MPI_DOUBLE, dest_rank_mat, 3, MPI_COMM_WORLD, &requests[3]);
        if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)
        {
            throw std::runtime_error("mpi_sendrecv_error");
        }
        if (MPI_SUCCESS != MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE))
        {
            throw std::runtime_error("mpi_waitall_error");
        }
    }
    else
    {
        const auto work_indices = schedule_thread_work(tens_recv_buffer_size_, 1);
        shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);
    }
}

Trying manual progression:

// ...
#define COMM_THREAD 0
// ...
#pragma omp parallel
{
    if (omp_get_thread_num() == COMM_THREAD)
    {
        // perform the comms.
        auto requests = std::array<MPI_Request, 4>{};
        const auto r1 = MPI_Irecv(tens_recv_buffer_, 2 * tens_recv_buffer_size_, MPI_DOUBLE, src_rank_tens, 2, MPI_COMM_WORLD, &requests[0]);
        const auto s1 = MPI_Isend(tens_send_buffer_, 2 * tens_send_buffer_size_, MPI_DOUBLE, dest_rank_tens, 2, MPI_COMM_WORLD, &requests[1]);
        const auto r2 = MPI_Irecv(mat_recv_buffer_, 2 * mat_recv_buffer_size_, MPI_DOUBLE, src_rank_mat, 3, MPI_COMM_WORLD, &requests[2]);
        const auto s2 = MPI_Isend(mat_send_buffer_, 2 * mat_send_buffer_size_, MPI_DOUBLE, dest_rank_mat, 3, MPI_COMM_WORLD, &requests[3]);
        if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)
        {
            throw std::runtime_error("mpi_sendrecv_error");
        }

        // custom wait-all to ensure COMM_THREAD makes progress happen
        auto comm_done = std::array<int, 4>{0, 0, 0, 0};
        auto all_comm_done = false;
        while(!all_comm_done)
        {
            auto open_comms = 0;
            for (auto request_index = std::size_t{}; request_index < requests.size(); ++request_index)
            {
                if (comm_done[request_index])
                {
                    continue;
                }
                MPI_Test(&requests[request_index], &comm_done[request_index], MPI_STATUS_IGNORE);
                ++open_comms;
            }
            all_comm_done = open_comms == 0;
        }
    }
    else
    {
        const auto work_indices = schedule_thread_work(tens_recv_buffer_size_, 1);
        shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);
    }
}

Using the environment variables mentioned here: https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-async-progress-control.html in my job-script:

export I_MPI_ASYNC_PROGRESS=1 I_MPI_ASYNC_PROGRESS_THREADS=1 I_MPI_ASYNC_PROGRESS_PIN="0"

and then running the code in variant 1.

All of the above attempts have resulted in the same undesirable behaviour.

Question: How can I force only COMM_THREAD to participate in MPI progression?

Any thoughts, suggestions, speculations and ideas will be greatly appreciated. Thanks in advance.

Notes:

1. although the buffers tens_send_buffer and mat_send_buffer are accessed concurrently during the overlap, this is read-only access.

2. the function schedule_thread_work performs round-robin static scheduling of the work involved in the computation by excluding COMM_THREAD.

HemanthCH_Intel · ‎08-16-2022

Hi,

Thanks for posting in Intel communities.

Could you please provide the below details to investigate more on your issue?

1) OS details.

2) Complete reproducer code and steps to reproduce your issue?

3) Are you running the program on single node or multiple node?

4) How you are calculating the time of your program?

Thanks & Regards,

Hemanth

nitin_malapally · ‎08-16-2022

Dear Hemanth,

Thanks for your reply.

1. The OS is Rocky Linux 8.5.

2. Unfortunately, I cannot provide a minimum reproducer.

3. I am running the program on 8 nodes.

4. I am calculating the time using this library https://gitlab.com/anxiousprogrammer/tixl which uses a monotonic clock.

Alternatively, it would be very helpful to know the following information:

1. Which threads participate in MPI progression in an OpenMP parallel region when

a. I initialize with MPI_Init_thread with MPI_THREAD_FUNNELED

b. I initialize with MPI_Init_thread with MPI_THREAD_SERIALIZED?

2. Can I ensure that only one thread participates in progression?

Thanks in advance,

Best Regards,

Nitin

HemanthCH_Intel · ‎08-22-2022

Hi,

Could you please follow the MPI split model and use MPI_THREAD_MULTIPLE to improve the performance? For more information refer to the below link:

https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/multiple-endpoints-support/mpi-thread-split-programming-model.html

Thanks & Regards

Hemanth

HemanthCH_Intel · ‎08-30-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks & Regards

Hemanth

nitin_malapally · ‎08-30-2022

Dear Hemanth,

Using tracing software, I was able to confirm that only the calling thread participates in MPI progression when I_MPI_ASYNC_PROGRESS=0.

As for the problem behaviour stated above, the bug was found to be in the benchmarking; in the overlap benchmark I was unfortunately not properly initializing all of the test data, which led to my measuring the time taken by the page-faults of some of the test data. Having fixed this, the overlap takes approximately as much time as the communication, which is the desired outcome.

Thanks very much for your time and interest. Please excuse my delayed reply.

Best Regards,

Nitin

HemanthCH_Intel · ‎09-02-2022

Hi,

Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards,

Hemanth