Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

open mp parallel calculations and module

Le_Callet__Morgan_M
542 Views

I would like to collect guidance on doing parallel calculation /open mp with date from a module.

1.Within a module function/subroutine is it safe to do some parallel calculation using and updating the module variable

2. Within a module function/subroutine is it safe to do call a module function/subroutine which themselves will be updating/using the module variable

 

0 Kudos
10 Replies
TimP
Honored Contributor III
542 Views
Perhaps your concern may be that module variables are shared, while local variables without SAVE in a procedure declared recursive and called in a parallel region will be thread local.
0 Kudos
jimdempseyatthecove
Honored Contributor III
542 Views

If you intend for the module variable, which is shared amongst threads (when not attributed as threadprivate), and you wish for multiple threads to update said variable, then you must use a serializing construct such as

!$OMP CRITICAL
modVar = modVar + delta
!$OMP end CRITICAL

!$OMP CRITICAL(modVar_critical)
modVar = modVar + delta
!$OMP end CRITICAL(modVar_critical)

!$OMP ATOMIC
modVar = modVar + delta
!$OMP END ATOMIC

Note, there are variations on these directives. Consult the documentation.

Jim Dempsey

0 Kudos
Le_Callet__Morgan_M
542 Views

Ok thanks, assuming each thread will be updating a different slice of a module array (call it u(:) ) with data from a module shared array (which will be read only call it (read_array(:)) cant I use !OMP Parallel do like I would for normal array declared locally where I both u and  read_array are public

 

!$omp parallel do default(shared) private(i)

do I = 1, counter

u(i)  = u(i)   + read_array(i)

enddo

 

Is something special about the array being shared across the module ?

0 Kudos
mecej4
Honored Contributor III
542 Views

If you have an EXE that uses routines from a user-built DLL, and there are routines in both that USE data in a module, sharing the data between the EXE and DLL, not to mention threads in the EXE and threads in the DLL, requires careful attention to detail. Be aware that the EXE and DLL may end up with distinct and incoherent copies of the module variables otherwise.

0 Kudos
Le_Callet__Morgan_M
542 Views

Thanks all,

so basically i should stick to locally defined variables with my procedure to be on the safe side ?

0 Kudos
jimdempseyatthecove
Honored Contributor III
542 Views

>>so basically i should stick to locally defined variables with my procedure to be on the safe side ?

mecej4 did not say nor indicate that this is what you should do. Rather, the advice is to learn the nuances of shared data, scope of data, and thread pools.

What you show in #4 is correct...
... provided that array u is not also concurrently being processed by a separate parallel do or other threads than those specified in the #4 code snip.

Note, while the code above does not show this concurrency, the accompanying statement does not state otherwise.

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
542 Views

Le Callet, Morgan M wrote:

!$omp parallel do default(shared) private(i)

do I = 1, counter

u(i)  = u(i)   + read_array(i)

enddo

This example is a special case. Array U can be shared, as it is addressed by the OMP "i" index. In this case, the same value of U would not be overwritten by different threads, so in this respect there is no problem. CRITICAL would not be required
There are problems of practicality, as the array U is being updated by different threads and so the same page of memory is being updated by all threads (cores) which can lead to cacheing inefficiency.
If read_array(I) is an array, rather than a function, there is not much work being done by each thread iteration and so you may wish to review the SCHEDULE option. SCHEDULE (STATIC) could be best. Also, the use of CRITICAL has an overhead, which may swamp any performance gain you are hoping to achieve. PARALLEL DO, CRITICAL, ATOMIC etc all have different (significant) overheads, when measured in processor cycles, and so the computational effort for each iteration of the omp loop must be sufficient to overcome this overhead.

You could try something like the following and see if this simple example provides any performance improvement. If n is too small, the OMP overhead will be excessive, while if n is too large, the memory addressing overhead will be limiting. A best case scenario is you need to give each thread something significant to do.

t1 = get_processor_ticks ()
sum = 0
!$OMP PARALLEL DO PRIVATE(i) SHARED (A,n) SCHEDULE(STATIC) REDUCTION(+:sum)
  do i = 1,n
    sum = sum + A(i)
  end do
!$OMP END PARALLEL DO
to = get_processor_ticks () - t1

t1 = get_processor_ticks ()
  sum = 0
  do i = 1,n
    sum = sum + A(i)
  end do
ts = get_processor_ticks () - t1

Too often, !$OMP demonstration examples use very simple DO loops to demonstrate the functionality of using multiple threads, but the practicality of using multi-threading is not achieved, as the computational saving { run time * (num_threads-1) / num_threads } is lost in the !$OMP initiation overheads.

0 Kudos
Le_Callet__Morgan_M
542 Views

Dear All,

Many tanks for the feedback and i appologise  if i sounded off.

I will go and study the different nuances as suggested.  This really struck a cord however:

"Too often, !$OMP demonstration examples use very simple DO loops to demonstrate the functionality of using multiple threads, but the practicality of using multi-threading is not achieved, as the computational saving { run time * (num_threads-1) / num_threads } is lost in the !$OMP initiation overheads."

I wonder if i should let the automatic  parallelisation do the job ?

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
542 Views

John,

The case I was presenting was not for the simple single parallel region producing a reduction-able sum. Thanks for pointing out the  REDUCTION clause to Morgan, I was rather referring to having multiple concurrent parallel regions that could occur via nested parallelism or via separate tasks, each with a parallel region manipulating the same module/common variable. These situations require a bit more care to do efficiently.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
542 Views
Auto parallel sometimes matches performance of explicit openmp and may uncover missed opportunities. If you are implying it might overcome the efficiency thresholds, that is unlikely.
0 Kudos
Reply