Solved: Intel coarray hybrid program

adenchfi · ‎02-19-2023

Hello,

Intel coarrays can be run in either distributed or shared form, which as far as I know emulates MPI vs openMP behavior under the hood. I had some questions about making programs for clusters of computers that get the advantages of hybrid MPI+openMP:

1) Can we achieve hybrid MPI+openMP(/pthread?) performance entirely within coarray language features in current Intel compilers? If not, is it on the roadmap?

For example, assigning a team to a node, where the images of each team will have a shared-memory model and thus not get communication-bottlenecked, but each team communicates between teams with MPI? (since cluster openMP isn't a thing anymore)
If a user can't do it explicitly (above point), does it occur under-the-hood? Does a user have any control over it then?

2) If we are limited to choosing one paradigm for coarrays, but want the hybrid performance, what is the best approach [for writing new programs]?

It seems to me for ease of programming, coarrays=distributed (with one image per node) + openMP(/pthread?) usage on each node is the way to go. How does one go about actually implementing this? I presume compiling with coarrays and omp/pthread works, but are there differences/limitations in the syntax? Are there simple examples out there? In slurm scripts I can assign only one process per node, and coarrays iirc will launch those available processes by default. Are there practical limitations to this approach compared to MPI+openMP/pthread?

Thanks.

Steve_Lionel · ‎02-19-2023

Intel doesn't want you to mix OpenMP and coarrays, but you can do it if you're careful.

OpenMP inter-thread communication and synchronization is much faster than using MPI, but Intel MPI tries to do a good job of same-node communication. Typically for coarray programs you want to minimize passing data and synchronization between images. That doesn't mean eliminate it, but recognize that there is quite a bit of overhead involved and its best if an image has a lot of work to do based on initial data. Teams can help with this.

I can't speak for Intel regarding roadmaps. If I were to hazard a guess, it would be that they aren't devoting resources to the combination of coarrays and OpenMP.

View solution in original post

Steve_Lionel · ‎02-19-2023

Intel's "shared" coarray implementation uses MPI, not OpenMP. Each image is its own process. There is no difference in syntax. I have seen quite a few examples of combining OpenMP and coarrays. The most important thing is to not over-subscribe the system. Keep in mind that OpenMP by default will start as many threads as you have cores, and isn't aware of MPI.

adenchfi · ‎02-19-2023

Thank you for the clarification. I have been reading the Intel MPI documentation for the last few minutes and have seen that. However, on the same line as it clarifies that, it also states OpenMP is not supported with coarrays:

Is this documentation then outdated, based on your answer?

I suppose my other questions come down to what the Intel MPI implementation is like. I read through https://www.intel.com/content/www/us/en/developer/articles/technical/tuning-the-intel-mpi-library-basic-techniques.html which has a lot of useful information.

However, I am still not entirely clear on the question of: do Intel coarrays (=distributed) have quicker communication when images are intra-node, comparable to OpenMP/pthreads usage? If so, is there any way for me to be able to write my program with this in mind, to minimize communication bottlenecks, purely within coarrays (and possibly Intel MPI environment variables)?

For example, I may want to do domain decomposition where different nodes are responsible for different subdomains, such that optimal memory sharing is possible on each node. After reviewing the coarrays documentation, I don't believe this is possible within the coarrays syntax alone, necessitating coarrays+OpenMP/pthreads usage. However, is this behavior on the roadmap?

Steve_Lionel · ‎02-19-2023

Intel doesn't want you to mix OpenMP and coarrays, but you can do it if you're careful.

OpenMP inter-thread communication and synchronization is much faster than using MPI, but Intel MPI tries to do a good job of same-node communication. Typically for coarray programs you want to minimize passing data and synchronization between images. That doesn't mean eliminate it, but recognize that there is quite a bit of overhead involved and its best if an image has a lot of work to do based on initial data. Teams can help with this.

I can't speak for Intel regarding roadmaps. If I were to hazard a guess, it would be that they aren't devoting resources to the combination of coarrays and OpenMP.

PierU · ‎07-08-2025

What do you mean by "if you're careful"?

To me, not being able to mix coarrays with OpenMP is close to a show-stopper.

jimdempseyatthecove · ‎07-08-2025

>>What do you mean by "if you're careful"?

From @Steve_Lionel "for coarray programs you want to minimize passing data and synchronization between images"

There is also an additional issue to consider.

If (when) your application is using MKL (Math Kernel Library) .AND. uses OpenMP .AND. coarray there is a complication as to how to manage the OpenMP threading within a process between the OpenMP regions of the main thread(s) and the MKL threads.

When a process heavily relies on the threading within MKL then you select the MKL threaded library.

Conversely, process heavily relies on the threading within the user code (explicit !$omp parallel regions) then you select the MKL sequential library.

And when there is a mix, you will benefit from carefully partitioning the threads amongst the two competing portions (your parallel code and MKL parallel code) of the process.

Jim Dempsey

PierU · ‎07-09-2025

Hi,

I get these points, but they don't look specific to the coarray+OpenMP combination:

Even without OpenMP I guess that one has to be careful and minimize the underlying message passing events when accessing the data from other images.

And the threading interactions between MKL and the own user code with OpenMP are also something to have in mind even without any coarray.