Re: How handle assumed-size arrays in OpenMP to make them "private"

Anders_S_1 · ‎09-02-2023

Hi,

I want to parallelize code with two assumed-size vectors and it is not possible to

use the PRIVATE clause in this case to make possible thread-specific content of these

arrays. Is there any solution to this demand?

Best regards

Anders_S

jimdempseyatthecove · ‎09-02-2023

If your code looks like this:

subroutine foo(A, B)
  implicit none
  real :: A(:), B(:)
  ...

You do not state as to if the subroutine is called from within a parallel region (caller or some level above caller)

.or.

if the subroutine itself contains a parallel region.

and/or if the dummy arguments are treadprivate variables declared up-level

Assume the case where from the caller (or above) that foo is not called from within a parallel region. A and B are are quite large and you would like to do a DOT_PRODUCT in parallel.

function DOT(A, B) result(ret)
   implicit none
   real :: A(:), B(:), ret
   integer :: I
   !$omp parallel do reduction(+:ret) ! reduction(+... also sets ret = 0
   do I=1, size(A)
     ret = ret + A(I) * B(I)
   end do
   !$omp end parallel do
end function DOT

Each thread performs a partial dot product on different sub-sections of the array. Then their local value for ret is thread-safely summed into the shared variable ret (owned by the thread calling DOT).

Now then, if this function is called from within a parallel region, and A and B are thread private (and different) sub-sections of arrays A and B, then this procedure would NOT require the !$omp statements. Including them in this case would be counter-productive.

If your usage if different from the two above scenarios, please be specific.

Jim Dempsey

Anders_S_1 · ‎09-02-2023

Hi Jim,

Thanks for a rapid reply!

The subroutines using the assumed-size arrays are obtained as DLLs from a third party.

In my main program I specify the following:

INTEGER n1,n2,s2,arr1,arr2

PARAMETER(n1=20000,n2=40000,s2=2)

DIMENSION arr1(n1),arr2(n2)

As I get it, arr1,and arr2 is used to store and transfer data, when the DLLs compute intermediate results

before delivering the wanted output. This is all I can tell.

Best regards

Anders _S

jimdempseyatthecove · ‎09-03-2023

Some answers you haven't supplied:

Does the DLL contain parallel regions?
Are you calling the DLL from parallel regions?
Both?

My assumptions are: No, Yes, No

If my assumptions are correct, then, does the third party assure the library is thread safe?

If yes, then each thread must have its own arr1 and arr2 (or sub-slice of single arrays) .and. you have to figure out how to perform any reduction (consolidation), if any reduction is necessary.

If no, then either:

a) protect the calls to the DLL with a critical section (and lose parallel advantage).

b) Adapt the program to use MPI as opposed to (or in addition to) OpenMP (more coding work)

Note for sub-slicing of single arrays:

!$omp parallel private(iThread, stride1, stride2, Begin1, Begin2, End1, End2)
iThread = omp_get_thread_num()
nThreads = omp_get_num_threads()
stride1 = (size(arr1)+ nThreads - 1) / nThreads
stride2 = (size(arr2)+ nThreads - 1) / nThreads
Begin1 = 1 + stride1 * iThread
Begin2 = 1 + stride2 * iThread
End1 = min(Begin1+stride1, size(arr1))
End2 = min(Begin2+stride2, size(arr2))
if(Begin1 > size(arr1) .or. Begin2 > size(arr2)) STOP "fix code"
call DLL(arr1(Begin1:End1), arr2(Begin2:End2), ...)
!$omp end parallel

CAUTION

If the code in the DLL has values that have loop order dependencies (e.g. the value of the current element incorporates the value of the prior element (or 0)), then you will have to add code to propagate the end value of first section to the second section, the new end of section to third section, ...). As to what have to do, depends on what is required.

Note, the slicing of the arrays (may) need only to be done once to threadprivate variables provided array sizes do not change.

Jim Dempsey

TobiasK · ‎09-04-2023

Hi @Anders_S_1
I agree with @jimdempseyatthecove a little bit more information is needed, where are the arrays declared as assumed size and where do you want to privatize them?
Strictly speaking, no there is no way to have a assumed size array privatized.

However, if you have access to the code, would it not be simple changing the assumed size to explicit shape?

Anders_S_1 · ‎09-04-2023

Hi Jim and TobiasK,

Jim, your assumptions were correct, If I get you right, the idea is to make the assumed size nthread times longer and then

let each thread use its part of the array. This seems to be a smart way to obtain "privacy" for each thread! Is this something that has been tried and verified to work (or should work)?

TobiasK, thanks for joining the discussion. You suggest that the assumed-size array should be changed to a fixed-size array. Will such a change guarantee that the private clauses can be applied? Does your rather strong statement on making assumed-size arrays ¨private¨contradict the proposal by Jim?

I am presently using MPI, which works fine except that the computation stops stoichastically (this is the topic of an other task sent to Intel). An other driving force to try to use OpenMP is that OpenMP seems to be one of the ways to include GPUs in HPC calculations. Am I right here? I have asked several times if MPI can be used together with GPUs, but never got any answer! Maybe it is a very stupid question, but I have not been able to find any info on this topic.

Best regards

Anders_S

TobiasK · ‎09-05-2023

Hi @Anders_S_1

with thread private I refer to the OpenMP standard definition of thread private. In Jim's suggestion the vector is not private but shared. (technically, if the compiler does not create a temporary copy of the vector, every thread can still access the entire vector with out of bounds...)

OpenMP offers offloading to target, a target can be a GPU.
However, I still have problems understanding where you want to add OpenMP parallelization. If you just want to call a subroutine with each thread then this approach will work on the GPU but the performance will be far inferior compared to a CPU. If you want to enable GPU acceleration, you will have to do much more to make it performant.

MPI supports GPUs.
I had a look at the other thread, and you are referring to the Data Center GPUs which are not supported on Windows. So in that case, it's not supported, but it's not a question if MPI supports GPUs or not, it's a question if that particular GPU is supported on a particular OS in general.

As for the stochastic issues with MPI, I would recommend to run your application with -check_mpi and compile it with -check all:

https://www.intel.com/content/www/us/en/docs/trace-analyzer-collector/user-guide-reference/2021-10/correctness-checking-of-mpi-applications.html
If that does not find an error, you may consider running the application with valgrind enabled.

jimdempseyatthecove · ‎09-05-2023

>> if MPI can be used together with GPUs

Yes you can. You can use MPI together with OpenMP, and OpenMP with MKL.

MKL has a threaded version and a serial version.

OpenMP + MKL generally uses the serial version of MKL (with care, to avoid oversubscription, you can use the parallel version of MKL with OpenMP).

In a serial Fortran (ifx) application, you can use the !$omp directives to offload to GPU (provided you have a supported GPU).

Jim Dempsey

Anders_S_1 · ‎09-06-2023

Hi Jim and TobiasK,

Thanks for your rapid and informative replies.

Today:

My code runs fine except for the stoichastic stops. To remove these stops is priority one and I am

applying the advices to trace MPI. I have got one error message when applying the MPI debugger :

BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

RANK 54 PID 11708 RUNNING AT DESKTOP-A63MN13

EXIT STATUS: -1 (ffffffff)

This message is obtained for all 55 threads. How can I understand this message and come further?

Next step to increase computational speed:

From your comments it is clear that faster CPU cores, more efficient code and optimized MPI use (to surpress MPI overhead and increase number of threads) is the way to go.

OpenMP has to wait until the third party vendor modifies the DLL library.

The GPU road (e.g. MAX series) means a lot of work, if possible to go, with a hard -to-estimate speed gain.

Best regards

Anders_S

jimdempseyatthecove · ‎09-06-2023

One of your ranks abnormally ended.

First thing to do is to compile the code with full runtime checks enabled.

Then run the code in the debugger using 1 Rank by launching the application directly as opposed to via mpirun or mpiexec. This will assure that the code itself is functional (bypassing and MPI related issues).

If that works, then run 1 rank but launch via mpirun or mpiexec.

If that works, then run with 2 ranks on the same system. Note, debugging multiple ranks is a bit tricky. Please locate and review techniques.

Jim Dempsey

Anders_S_1 · ‎09-06-2023

Hi Jim,

How do I run my code in the debugger? Simply writing the name of the exec module on the command line and press return?

Best regards

Anders_S

jimdempseyatthecove · ‎09-06-2023

Did you read the link for a recommendation?

For an MPI application, that is launched via mpiexec or mpirun, insert a console read statement (a PAUSE will do), that is executed by rank 0, .AND. follow the conditional that executes the PAUSE with an MPI BARRIER.

Then, run the program from the command line (or MS VS IDE issuing mpiexec/mpirun).

Wait for rank 0 to reach the pause, and all other ranks to wait at the barrier.

Issue the Debug | Attach to Process

Locate the (all the) ranks and attach to them.

For each rank, insert your break point(s)

Then from the CMD window wait at PAUSE, press return.

Note, if you have MS VS Professional, you may have the MPI specific debugger, which makes setting/managing breakpoints much easier. But you do not need this version of the Debugger. This is another reason why you might want to debug using a small number of ranks.

Jim Dempsey