Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI_BCAST vs MPI_PUT/MPI_GET

Pierpaolo_M_
New Contributor I
719 Views

Hi,

I am trying to explore one-side-communication using Intel MPI (version 5.0.3.048, ifort version 15.0.2 20150121).

I have a cluster of 4 nodes (8 cores/node) and on each node only one rank generate a big array. Now i have to copy this array on each rank's node.

Using MPI_BCAST i use this code:

      PROGRAM MAPS
      USE MPI
      IMPLICIT NONE
      INTEGER, PARAMETER :: n=100000000
      INTEGER, PARAMETER :: cores=8
      REAL, DIMENSION(n) :: B
      INTEGER :: ierr,world_rank,world_size,i,j
      INTEGER :: rank2,comm2,size2
      
      CALL MPI_INIT(ierr)
      CALL MPI_COMM_RANK(MPI_COMM_WORLD,world_rank,ierr)
      CALL MPI_COMM_SIZE(MPI_COMM_WORLD,world_size,ierr)
      CALL MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD,MPI_COMM_TYPE_SHARED,0,  &
     &                          MPI_INFO_NULL,comm2,ierr)
     
      CALL MPI_COMM_RANK(comm2,rank2,ierr)
      CALL MPI_COMM_SIZE(comm2,size2,ierr)
      
      DO j=1,100
      
        IF(rank2 == 0)THEN
          DO i =1,n
            B(i)=FLOAT(i)*FLOAT(j)
          END DO
        END IF
        CALL MPI_BCAST(B,n,MPI_REAL,0,comm2,ierr)
        
      END DO
      
      CALL MPI_FINALIZE(ierr)
      
      END PROGRAM MAPS

While using one-side-communication i use for example this code:

      PROGRAM MAPS
      USE MPI
      IMPLICIT NONE
      INTEGER, PARAMETER :: n=100000000
      INTEGER, PARAMETER :: cores=8
      REAL, DIMENSION(n) :: B
      INTEGER :: disp_int,win,ierr,world_rank,world_size,i,j,k
      INTEGER :: memory_model,rank2,comm2,size2
      LOGICAL ::flag
      INTEGER (KIND=MPI_ADDRESS_KIND) :: lowerbound,size,realextent
      INTEGER (KIND=MPI_ADDRESS_KIND) ::disp_aint
      
      CALL MPI_INIT(ierr)
      CALL MPI_COMM_RANK(MPI_COMM_WORLD,world_rank,ierr)
      CALL MPI_COMM_SIZE(MPI_COMM_WORLD,world_size,ierr)
      CALL MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD,MPI_COMM_TYPE_SHARED,0,  &
     &                          MPI_INFO_NULL,comm2,ierr)
     
      CALL MPI_COMM_RANK(comm2,rank2,ierr)
      CALL MPI_COMM_SIZE(comm2,size2,ierr)
      
      CALL MPI_TYPE_GET_EXTENT(MPI_REAL,lowerbound,realextent,ierr)
      disp_int=realextent
      size=n*realextent
      CALL MPI_WIN_CREATE(B,size,disp_int,MPI_INFO_NULL,comm2,win,ierr)
      CALL MPI_WIN_GET_ATTR(win,MPI_WIN_MODEL,memory_model,flag,ierr)

      disp_aint=0
      
      DO k=1,100
      
        IF(rank2 == 0)THEN
          DO i =1,n
            B(i)=FLOAT(i)*FLOAT(k)
          END DO
        END IF
        
        CALL MPI_WIN_FENCE(0,win,ierr)
      
        IF(rank2 /= 0)THEN
          CALL MPI_GET(B,n,MPI_REAL,0,disp_aint,n,MPI_REAL,win,ierr)
        END IF
      
        CALL MPI_WIN_FENCE(0,win,ierr)
        
      END DO
      
      CALL MPI_WIN_FREE(win,ierr)
      CALL MPI_FINALIZE(ierr)
      
      END PROGRAM MAPS

I tried also using MPI_WIN_POST/START/COMPLETE/WAIT and MPI_LOCK but with same performances.

I compile both of them in this way:

mpiifort -O0 -g -debug inline-debug-info bcast.f90
mpiifort -O0 -g -debug inline-debug-info win-fence2.f90

I launch both of them in this way:

mpiexec.hydra -f ./mpd.hosts -print-rank-map -ppn 8 -n 32 -env I_MPI_FABRICS shm:tcp a.out

where mpd.hosts contains my 4 nodes. I repeat the same operation 100 times only to obtain an elapsed time big enough.

I noticed that MPI_BCAST version is faster than MPI_GET version while i was trying to speed-up this operation of copy using new RMA features.

Is there some naive error in my MPI_GET version? Is it correct to aspect a better performance using MPI_BCAST to address a problem like this?

Any suggestions or comments would be very helpful.

Thanks in advance

0 Kudos
1 Reply
Dmitry_S_Intel
Moderator
719 Views

Hi,

MPI_BCAST in Intel(R) MPI Library is highly optimized for Intel platforms by Intel MPI Library engineering team.

So if you'll get version that faster than MPI_BCAST with the same result, please let us know.

--

Dmitry Sivkov

Intel(R) Cluster Tools TCE

0 Kudos
Reply