Community
cancel
Showing results for 
Search instead for 
Did you mean: 
fractureman
Novice
112 Views

Memory registration cache feature in DAPL -> random failure in simple code


  MPI w/ DAPL user ** beware **
  
  In our open source finite element code, we have encountered a simple
  manager-worker code section than fails randomly while moving arrays (blocks)
  of double precision data from worker ranks to manager rank 0.
  
  The failures occur (consistently) with DAPL but never with tcp over IB (IPoIB).
  
  After much effort, the culprit was found to be the memory registration
  cache feature in DAPL.
  
  This feature/bug is ON by default ** even though ** the manual states:
   
   "The cache substantially increases performance, but may lead 
    to correctness issues in certain situations."
    
    From: Intel® MPI Library for Linux OS Developer Reference (2017). pg 95
  
  Once we set this option OFF, the code runs successfully for all test cases over
  large and small numbers of cluster nodes. The DAPL performance is still
  at least 2x better than IPoIB.
  
    export I_MPI_DAPL_TRANSLATION_CACHE=0
  
  Recommendation to Intel MPI group:
  
     Set I_MPI_DAPL_TRANSLATION_CACHE=0 as the DEFAULT. Encourage developers
     to explore setting this option ON ** if ** their code works properly
     with OFF.
     
  Specifics:
  
     - Intel ifort 17.0.2
     - Intel MPI 17.0.2
     - Ohio Supercomputer Center, Owens Cluster.
         RedHat 7.3
         Mellanox EDR (100Gbps) Infiniband
         Broadwell/Haswell cluster nodes.

 Code section that randomly fails:

    -> Blocks are ALLOCATEd with variable size in a Fortran
       derived type (itself also allocated to the number of blocks).
       All blocks on rank 0 created before this code below is entered.

  sync worker ranks to this point
  
  if rank = 0 then
  
     loop sequentially over all blocks to be moved
        if rank 0 owns block -> next block
        send worker who owns block the block number (MPI_SEND)
        receive block from worker (MPI_RECV)
     end loop
     
     loop over all workers
       send block = 0 to signal we are done moving blocks
     end loop
     
  else ! worker code
  
     loop
        post MPI_RECV to get a block number 
        if block number = 0 -> done
        if worker does not own this block, the manager made an error !
        send root the entire block -> MPI_SEND
     end loop
     
  end if
              

0 Kudos
0 Replies
Reply