Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1890 Discussions

## MPI communication: is it possible get information in other CPU? Beginner
100 Views

I am trying to use MPI to solve a problem as if

C=A*B

where A=A(M,N), B=B(N,O)

Before calculating C, I needs to create A and B using MPI paralelized code. However, because some reson, A can only be paralelized using M as distribution index, ie. A is distributed as AM(1:N) in different CPUs. On the other hand, B can onlydistributed as BO(1:N). Since both A and B are so larger, both gather and broadcast are not good for memory. So I am thinking just keep A and B as they were. When I calculate C, I use BN distribution as CPU index, when I need the information of A (ie. AM), I go to the responsible CPU to get the AM. as this:

do i=1,M

do j=1,O

// do l=1,N // parallelized,

mpi_send(AM, i, ..., o,..)

mpi_recv(AM,o, ..., i,...)

CMO= sum(AM*BO)

//enddo

enddo

enddo

Of coz, this will not work because the send and receive are distributed in single thread.

So, I am here to ask for help. Is there any better idea? Thanks.

2 Replies Black Belt
100 Views

I would suggest you look at breaking up the AM*BO into tiles (stripes) and also do partial sum() on partial product matricies, and then sum the partial sums. In this way you can queue up the production of the AM stripes and queue up the production of the BM stripe(s) (assuming BM is stripeable). Break the code up into two loops, One to queue up (mpi_send...) and then another to product the partial product when data becomes available.

do i=1,M
do j=1,O
// do l=1,N // parallelized,
mpi_send(AM, i, ..., o,..)
// enddo
enddo
do j=1,O
// do l=1,N // parallelized,
mpi_recv(AM,o, ..., i,...)
CMO= sum(AM*BO)
// enddo
enddo
enddo

Then break down the AM into stripes and produce the partial sums of the partial products.

Jim Dempsey Beginner
100 Views
Jim, Thanks.
If I send all the data in the first loop, then where are the data staying? In cache? And it seems each data AM has been sent O times. Is this ok? During the receive, does it matter which data is received earlier i.e. is it ok for lately sent data being received earlier?

I would suggest you look at breaking up the AM*BO into tiles (stripes) and also do partial sum() on partial product matricies, and then sum the partial sums. In this way you can queue up the production of the AM stripes and queue up the production of the BM stripe(s) (assuming BM is stripeable). Break the code up into two loops, One to queue up (mpi_send...) and then another to product the partial product when data becomes available.

do i=1,M
do j=1,O
// do l=1,N // parallelized,
mpi_send(AM, i, ..., o,..)
// enddo
enddo
do j=1,O
// do l=1,N // parallelized,
mpi_recv(AM,o, ..., i,...)
CMO= sum(AM*BO)
// enddo
enddo
enddo

Then break down the AM into stripes and produce the partial sums of the partial products.

Jim Dempsey 