Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28445 Discussions

Code examples how to move an MPI computation from CPU to GPU

New Contributor III


It would be very helpful for me to see some simple examples how to perform an MPI computation on a GPU instead on the CPU. Today a computation-intense part of my code is parallelized via MPI.

Will it also be possible to use a third party subroutine package  on the GPU using some "interface layer code"?

Best regards

Anders S

0 Kudos
3 Replies
Honored Contributor III

Why not keep MPI and integrate GPU into each Rank's code?

Why not keep MPI and integrate OpenMP with GPU support into each Rank's code.

Note, a one rank system is equivalent to OpenMP with GPU.

Why take out MPI? Your problem may grow to beyond the capability of one system.


Jim Dempsey


0 Kudos
New Contributor III

Hi Jim,

Thanks for the suggestions!  Does this mean that the rank will be be limited but the number of theads of the GPU instead of the CPU?Can you give a reference to where I can read about this?

I am not sure I fully understand your last comment.

Best regards

Anders S

0 Kudos

Simpliest example - you have 2 servers with 8 cores each and 1 GPU each.  You launch 2 MPI rank job, 1 per node.

This MPI job runs like it currently does.  BUT you find some loop nests that take a lot of time.

So first thing you try is EACH MPI rank runs with 8 threads, OMP_NUM_THREADS=8.  Then you put a !OMP$ PARALLEL DO around that loop with necessary clauses for data sharing.  


So far so good?

Then instead of running 8 threads on the CPU, you use OMP TARGET and MAP clauses to run that loop on a GPU.


This is called MPI+X, where X is either pthreads, OpenMP, OpenACC, CUDA, or some other mechanism to speed up EACH MPI rank

You do not change the MPI - you still divide up the data and work in parallel, exchanging data.  But the WORK you do within each rank is put on multiple threads on CPU cores OR you offload the work to GPUs.  2 levels of parallelism.

Large grain parallelism with MPI

fine grain paralelism at loop or procedure level within each rank