MPI call of host offloaded

aketh_t_ · ‎05-01-2015

Hi all,

I have come across a stage where I want to offload a host code to MIC.

However I came across this error.

This is the section of the code I want to offload.

#ifdef CCSMCOUPLED
!  call MPI_BARRIER(POP_Communicator,ierr)
!dir$ offload begin target(mic:0)
   ierr = 13
   call MPI_ABORT(MPI_COMM_WORLD,errorCode, ierr)
!dir$ end offload
#else
!dir$ offload begin target(mic:0)
   call MPI_BARRIER(POP_Communicator, ierr)
   call MPI_ABORT(MPI_COMM_WORLD, errorCode, ierr)
   call MPI_FINALIZE(ierr)
!dir$ end offload
#endif

I got the following error. Is there a work around to this problem otherwise. I want to offload host MPI call to XeonPhi

.

TimP · ‎05-01-2015

If you want to run MPI in native or "symmetric" mode on MIC, you compile with mpiifort -mmic, so that offload directives should be ignored. I don't believe MPI is supported in offloaded regions; if it were, it would require mpi_init_thread(MPI_THREAD_MULTIPLE) and linking explicitly with a library which supports that mode.

Offloaded regions can be used in an application which runs under MPI, but they would run inside a single MPI process, as in mpi_thread_funneled mode.

aketh_t_ · ‎05-01-2015

hi,

Thank you for the prompt reply first.

Interesting that you mentioned this

"It would require mpi_init_thread(MPI_THREAD_MULTIPLE) and linking explicitly with a library which supports that mode."

So is it possible to offload MPI functions intended for host be run on the MIC, with some tweaks/library support.

This is important to us because

1)I have a huge SIMD loop with deep nesting of functions(function1 -> func2 -> func3 ........). And one of these have MPI, and there is no way I could offload the loop without offloading MPI.

Frances_R_Intel · ‎05-01-2015

What Tim was saying was that you would start the code on the coprocessor, not offload a section and do an mpi_init_thread in that offload section.

When you offload code to the coprocessor, are you using any other type of parallelism such as OpenMP or Pthreads? Typically, that is how you make use of multiple threads in an offload section. If the only form of parallelism is MPI, then, instead of offloading part of your code, you would start up some or all of your MPI ranks on the coprocessor. To do this you would need to disable or remove your offload directives. The -mmic option that Tim told you about will both disable the offload directives and compile the code specifically for the coprocessor. You would use the -mmic option even if you physically removed the offload directives. For the host, to disable the offload directives, you would use the -qno-offload directive.

You can find more information in the Intel reference manuals, such as the Intel® MPI Library Reference Manual. The manuals are at https://software.intel.com/en-us/intel-software-technical-documentation. The 5.0.3 version of the Intel® MPI Library Reference Manual is located at https://software.intel.com/en-us/mpi-refman-lin-5.0.3-html.

aketh_t_ · ‎05-03-2015

Hi Mr Frances

I am currently looking at exploiting thread level parallelism.

The main reason I asked offload of MPI was , my code is structured such that I want an offload but cant exclude MPI.

Its a loop I wish to offload but along the call tree I reach these MPI messages. There is no way I can offload without this support.

Check the image below

As you can see from the call tree there is no getaway here.

Also, the depth of nesting is making life harder by compile errors like- A procedure called in an offload must have xxxxxxx attribute.

The depth is so huge offloading takes a lot of time for me and we are looking at other ways now of getting things done.

Is there a solution to both these problems below you guys can suggest?

1)MPI of host in offload

2)handling deep nesting

3)can you guys suggest a loop profiler (--loop option doesn't work for parallel applications and CESM doesnt run on single process. it takes ages)

Frances_R_Intel · ‎05-04-2015

Let me make sure I understand your drawing. You have a module function baroclinic::baroclinic_driver. I am assuming a module since you seem to be programming in Fortran, although I guess it could be a class just as easily. Inside that function is a loop which can call one or more of overflows::ovf_utlda, blocks::get_block, advection::comp_flux_vel_ghost, etc., some of which call other functions, which call other functions, etc until some of them eventually call exit_mod::exit_pop, where you have some MPI code.

Assuming that is right, is the MPI code in exit_mod::exit_pop communicating with other ranks which have also gone through baroclinic::baroclinic_drive or is it passing work off to ranks which have been sitting quietly waiting for work? I would assume that exit_mod::exit_pop is calling idle ranks, if only because the many alternate paths to exit_mod::exit_pop would make it too easy to deadlock.

At present, the loop in baroclinic::baroclinic_driver is running on a single thread and you would like to multithread it, is that correct? Possibly with OpenMP or are you looking at MPI there as well? With those function calls inside the loop, the loop won't vectorize although there may be inner loops that will. If I am interpreting your code correctly, you will need to be very careful to either have separate MPI communicators in the different threads or use barriers to keep more than one thread from entering the MPI section at a time. Otherwise, you can deadlock very easily or end up with messed up results.

In any event, I don't think offload is what you want. You could run the code natively on the coprocessor, then have the loop thread and call MPI, all on the coprocessor. Personally, I would avoid the complication of multithreading the loop but if I really wanted to multithread the loop, I would probably thread the loop on the host, set up barriers to let only one of the threads into the MPI section at a time rather than setting up different communicators and have MPI on the host send work to MPI ranks running on the coprocessor. Well, actually I would probably rip the MPI out and just use OpenMP to thread everything but that is a personal preference and probably not rational for your case.

Frances