Solved: Intel MPI fails to ensure progress for one-sided operations

csubich · ‎07-19-2023

Greetings,

In working with RMA operations on an HPC cluster, I noticed that Intel MPI fails to ensure progress is made for certain RMA operations (here MPI_Fetch_and_op). In particular, this means that if rank 0 performs an RMA operation with rank 1 as a target, rank 1 might spin forever waiting for the operation to complete. Forcing progress through an extraneous MPI call ensures completion, but I believe that this requirement violates the MPI standard.

A minimal example:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv) {
    MPI_Init(&argc, &argv); // Initialize MPI
    int rank, nproc;
    // Get MPI rank and world size
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&nproc);

    int * rma_memory; // RMA memory (to be allocated)
    MPI_Win rma_window;
    MPI_Win_allocate(sizeof(int),1,MPI_INFO_NULL,MPI_COMM_WORLD,&rma_memory,&rma_window);

    // Get and display memory model for window
    int *memory_model, flag;
    MPI_Win_get_attr(rma_window, MPI_WIN_MODEL, &memory_model, &flag);
    if (*memory_model == MPI_WIN_UNIFIED) {
        printf("Rank %d created RMA window with the unified memory model\n",rank);
    } else if (*memory_model == MPI_WIN_SEPARATE) {
        printf("Rank %d created RMA window with the separate memory model\n",rank);
    } else {
        printf("Rank %d created RMA window with an unknown memory model(???)\n",rank);
    }

    *rma_memory = 0; // Initialize to zero

    // All processes will lock the window
    MPI_Win_lock_all(MPI_MODE_NOCHECK,rma_window);

    if (rank == 0) { 
        // Rank 0: wait for rank 1 to enter its spinlock, then use MPI_Fetch_and_op to increment
        // *rma_memory at rank 1

        // Receive zero-byte message indicating that rank 1 is ready to enter its spinlock
        MPI_Recv(0,0,MPI_BYTE,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

        // Wait a further 0.1s so that rank 1 should have assuredly completed any progress-making
        // MPI calls
        double tic = MPI_Wtime(); 
        while (MPI_Wtime() - tic < 0.1); 

        tic = MPI_Wtime(); // Reset tic value to account for delay

        // Perform fetch-and-op
        int one = 1; 
        int result = -1;
        MPI_Fetch_and_op(&one, &result, MPI_INT, 1, 0, MPI_SUM, rma_window);

        // Flush the window to ensure completion
        MPI_Win_flush_all(rma_window); 

        printf("Rank 0: sent %d, received %d (should be 0) in %.2fms\n",one, result, (MPI_Wtime() - tic)*1e3);
    } else if (rank == 1) {
        // Rank 1: Send a message to rank 0 indicating readiness for Fetch_and_op
        MPI_Send(0,0,MPI_BYTE,0,0,MPI_COMM_WORLD);

        double tic = MPI_Wtime();

        // Spinlock waiting for '1' to be written to the RMA_Window
        while (*rma_memory != 1 && MPI_Wtime() - tic < 5) {
            // Separate memory model: synchronize remote and local copies of window
            // Unified memory model: memory barrier
            MPI_Win_sync(rma_window);
        }
        int old_value = *rma_memory;
        printf("Rank 1: Memory value %d (should be 1) in %.2fms\n",old_value,1e3*(MPI_Wtime()-tic-0.1));

        // Demonstrate forced progress
        MPI_Win_flush(1,rma_window); // Should be a no-op, since there are no pending RMA ops from this rank
        MPI_Win_sync(rma_window);
        if (old_value != *rma_memory) {
            printf("Rank 1: After flush, memory value is now %d (should be 1)\n",*rma_memory);
        }
    }

    MPI_Win_unlock_all(rma_window);
    MPI_Win_free(&rma_window);
    MPI_Finalize();
    return 0;
}

Executing this on a cluster gives:

$ mpicc --version -diag-disable 10441
icc (ICC) 2021.8.0 20221119
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

$ mpicc -diag-disable 10441 progress.c
$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.8 Build 20221129 (id: 339ec755a1)
Copyright 2003-2022, Intel Corporation.
$ mpirun -np 2 -ppn 1 ./a.out
Rank 0 created RMA window with the unified memory model
Rank 1 created RMA window with the unified memory model
Rank 1: Memory value 0 (should be 1) in 4900.00ms
Rank 1: After flush, memory value is now 1 (should be 1)
Rank 0: sent 1, received 0 (should be 0) in 4903.32ms

When run on the same cluster node, this code completes successfully (with rank 1 reading the value 1) with the RMA portion taking a fraction of a millisecond (presumably because of the use of shared memory).

I believe that this behaviour violates the MPI 3.1 specification at §11.7.3 regarding progress. Lines 18-19 state:

One-sided communication has the same progress requirements as point-to-point communication: once a communication is enabled it is guaranteed to complete.

… although the standard equivocates whether true asynchronous progress is required. However, it does state at lines 28-48 that progress should be made during MPI calls. The call to MPI_Win_sync should therefore engage the progress engine.

ShivaniK_Intel · ‎08-31-2023

Hi,

The difference in output on single and multi-node is caused by the different transports. As you noticed, if disable shared memory transport on a single node, the behavior is the same as on two nodes. The shared memory transport is the only transport capable of 'true' passive one-sided progress.

I have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks & Regards

Shivani

View solution in original post

ShivaniK_Intel · ‎07-21-2023

Hi,

Thanks for posting in the Intel forum.

Could you please provide us with the below details?

1. OS

2. output of lscpu command

>>When run on the same cluster node, this code completes successfully (with rank 1 reading the value 1) with the RMA portion taking a

fraction of a millisecond (presumably because of the use of shared memory).

We are also able to run the code successfully on a single node.

Thanks & Regards

Shivani

csubich · ‎07-21-2023

$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.3 (Ootpa)
$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              160
On-line CPU(s) list: 0-159
Thread(s) per core:  2
Core(s) per socket:  40
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Stepping:            6
CPU MHz:             2984.771
CPU max MHz:         2301.0000
CPU min MHz:         800.0000
BogoMIPS:            4600.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            61440K
NUMA node0 CPU(s):   0-39,80-119
NUMA node1 CPU(s):   40-79,120-159
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid md_clear pconfig flush_l1d arch_capabilities

csubich · ‎07-21-2023

In addition, it looks like I can reproduce the problem on a single node if I disable the shared memory transport:

$ mpirun -genv 'I_MPI_SHM=off' -np 2 ./a.out
Rank 0 created RMA window with the unified memory model
Rank 1 created RMA window with the unified memory model
Rank 1: Memory value 0 (should be 1) in 4900.00ms
Rank 1: After flush, memory value is now 1 (should be 1)
Rank 0: sent 1, received 0 (should be 0) in 4900.14ms

ShivaniK_Intel · ‎07-27-2023

Hi,

We are working on it and will get back to you.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎08-17-2023

Hi,

Thanks for your patience.

>>>" The call to MPI_Win_sync should therefore engage the progress engine"

As the memory is allocated in a unified memory model, MPI_Win_sync is implemented as a memory barrier not triggering the MPI progress engine.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎08-24-2023

Hi,

As we did not hear back from you could you please let us know whether the above information is helpful?

Thanks & Regards

Shivani

ShivaniK_Intel · ‎08-31-2023

Hi,

The difference in output on single and multi-node is caused by the different transports. As you noticed, if disable shared memory transport on a single node, the behavior is the same as on two nodes. The shared memory transport is the only transport capable of 'true' passive one-sided progress.

I have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks & Regards

Shivani

csubich · ‎09-01-2023

Drat, I hate how often my work e-mail sends the Intel forum notices to spam.

This response is a bit disappointing, but it means my complaint now rises to the level of a specification ambiguity.