Reduction with large OpenMP arrays

Adrian_A_1 · ‎05-25-2019

Hi all,

I know that this has been asked sometimes: https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/345415, or even in StackOverflow: https://stackoverflow.com/questions/24882425/openmp-array-reductions-with-fortran, https://stackoverflow.com/questions/20413995/reducing-on-array-in-openmp, but I would like to know your opinion because the scalability that I get is not the one that I expect.

So I need to fill a really large array of complex numbers, which I would like to parallelize with OpenMP. Our first approach is this one:

COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), ALLOCATABLE :: thread_huge_array(:)
INTEGER :: huge_number, index1, index2, index3, index4, index5, bignumber1, bignumber2, smallnumber, depending_index

ALLOCATE(huge_array(huge_number))

!$OMP PARALLEL FIRSTPRIVATE(thread_huge_array)
ALLOCATE(thread_huge_array(SIZE(huge_array)))
thread_huge_array = ZERO
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,6
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index) = thread_huge_array(depending_index)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP BARRIER
!$OMP MASTER
huge_array = ZERO
!$OMP END MASTER
!$OMP CRITICAL
huge_array = huge_array + thread_huge_array
!$OMP END CRITICAL
DEALLOCATE(thread_huge_array)
!$OMP END PARALLEL

So, with that approach, we get good scalability until 8 cores, reasonable scalability until 32 cores and from 40 cores, it is slower than with 16 cores (we have a machine with 80 physical cores). Of course, we cannot use REDUCTION clause because the size of the array is so big that it doesn't fit in the stack (even increasing ulimit to the maximum allowed in the machine).

We have tried a different approach with this one:

COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), POINTER:: thread_huge_array(:,:)
INTEGER :: huge_number

ALLOCATE(huge_array(huge_number))

ALLOCATE(thread_huge_array(SIZE(huge_array),omp_get_max_threads()))
thread_huge_array = ZERO

!$OMP PARALLEL PRIVATE (num_thread)

num_thread = omp_get_thread_num()+1
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,num_weights_sp
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index, omp_get_thread_num()) = thread_huge_array(depending_index, omp_get_thread_num())
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL

huge_array = ZERO

DO index_ii = 1,omp_get_max_threads()
huge_array = huge_array + thread_huge_array(:,index_ii)
ENDDO

DEALLOCATE(thread_huge_array)

DEALLOCATE(huge_array)

And in this last case, we obtain longer times for the method (due to the allocation of the memory, which is much bigger) and worse relative acceleration.

Can you provide some hints to achieve a better acceleration? Or is it impossible with these huge arrays with OpenMP?

Thanks for reading. This forum always helps me a lot with FORTRAN and you have always provided with very useful insights into the language and standards.

Best,

Adrian.

jimdempseyatthecove · ‎05-25-2019

It may help if you provide additional information.

Run your 1st code without OpenMP (using stub functions), and under VTune.

What is the % CPU time for:

1st ! Some calculations
2nd ! Some calculations
depending_index = function(...

And, explain in better detail about:

thread_huge_array(depending_index) = thread_huge_array(depending_index)

as it is not computing anything.

Also can you show: depending_index = function(index1, index2, index3, index4, index5)

Jim Dempsey

Adrian_A_1 · ‎05-27-2019

Hi Jim,

Thanks for answering me so fast! I didn't want to bother with the details since it is a double integral which resembles some convolutional operation. I have changed the first code with the numerical values of the loop limits for one real case and some insight about the scalability of this particular case:

COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), ALLOCATABLE :: thread_huge_array(:)
INTEGER :: huge_number, index1, index2, index3, index4, index5, bignumber1, bignumber2, smallnumber, depending_index

ALLOCATE(huge_array(huge_number))

!$OMP PARALLEL FIRSTPRIVATE(thread_huge_array)
ALLOCATE(thread_huge_array(SIZE(huge_array)))
thread_huge_array = ZERO
!$OMP DO
DO index1=1,396
! CPU time here: negligible (<0.000001)
DO index2=1,3432
! CPU time from here to assignment of thread_huge_array: approx. 1 ms
DO index3=1,6
DO index4=1,6
DO index5=1,100
thread_huge_array(depending_index+1) = thread_huge_array(depending_index+1) + function1(values_computed_before)
thread_huge_array(depending_index+2) = thread_huge_array(depending_index+2) + function2(values_computed_before)
thread_huge_array(depending_index+3) = thread_huge_array(depending_index+3) + function3(values_computed_before)
thread_huge_array(depending_index+4) = thread_huge_array(depending_index+4) + function4(values_computed_before)
thread_huge_array(depending_index+5) = thread_huge_array(depending_index+5) + function5(values_computed_before)
thread_huge_array(depending_index+6) = thread_huge_array(depending_index+6) + function6(values_computed_before)
depending_index = depending_index + 6;
ENDDO
depending_index = depending_index - 6*100;
ENDDO
depending_index = depending_index + 6*100;
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP MASTER
huge_array = ZERO
!$OMP END MASTER
!$OMP BARRIER
!$OMP CRITICAL
huge_array = huge_array + thread_huge_array
!$OMP END CRITICAL
DEALLOCATE(thread_huge_array)
!$OMP END PARALLEL

Also, I changed the barrier since before it wasn't thread-safe (my bad).

Scalability results for this case (and for the whole code that I have shown you) are, in seconds:

1 thread 132.90

8 threads 17.58 (0.42) | 19.22 (0.37)

20 threads 8.33 (0.951) | 9.522 (0.785)

40 threads 5.89 (1.802) | 7.79 (1.58)

80 threads 6.748 (3.597) | 10.188 (3.27)

The first result is with the first code and the second result, with the serial reduction. Between parenthesis, you can find the time which is spending on reduction. You see that the scalability from 40 to 80 not only improves but is worse since the time spent in the reduction increases so much.

Thanks! I am looking forward to possible solutions (if any).

Adrian.

jimdempseyatthecove · ‎05-27-2019

Can you partition the loop into two passes?
Not seeing the code, I cannot make this assessment. Also is depending_index now an input arg, or is there a missing statement to initialize depending_index?

While depending_index will repeat, it is not clear as to if they do or do not overlap at specific loop levels. If they do not, then you can possibly eliminate the thread_huge_array and selectively write the data directly into huge_array (like a filter process).

Something like the following:

!$omp parallel private(index1, index2, index3, index4, index5, index6, ...) shared(huge_array)
! *** note, all threads execute complete loop
iThread = omp_get_thread_num()
nThreads = omp_get_num_threads()
     DO index1=1,396
         ! CPU time here: negligible (<0.000001)
         ...
         DO index2=1,3432
            ! CPU time from here to end do: approx. 1 ms
            ...
            DO index3=1,6
               DO index4=1,6
                  DO index5=1,100
                     ! *** filterthe folowing by cache line and thread
                     if(MOD(depending_index+1-1)/8, nThreads) == iThread) huge_array(depending_index+1) = huge_array(depending_index+1) + function1(values_computed_before)
                     if(MOD(depending_index+2-1)/8, nThreads) == iThread) huge_array(depending_index+2) = huge_array(depending_index+2) + function2(values_computed_before)
                     if(MOD(depending_index+3-1)/8, nThreads) == iThread) huge_array(depending_index+3) = huge_array(depending_index+3) + function3(values_computed_before)
                     if(MOD(depending_index+4-1)/8, nThreads) == iThread) huge_array(depending_index+4) = huge_array(depending_index+4) + function4(values_computed_before)
                     if(MOD(depending_index+5-1)/8, nThreads) == iThread) huge_array(depending_index+5) = huge_array(depending_index+5) + function5(values_computed_before)
                     if(MOD(depending_index+6-1)/8, nThreads) == iThread) huge_array(depending_index+6) = huge_array(depending_index+6) + function6(values_computed_before)
                     depending_index = depending_index + 6;
                  ENDDO
                  depending_index = depending_index - 6*100;
               ENDDO 
               depending_index = depending_index + 6*100;
            ENDDO
         ENDDO 
      ENDDO 
!$omp end parallel

The above can be optimized further should this produce the correct results.

The effectiveness will depend on the function1:6 times verses the approximately 1ms time.

Jim Dempsey

Adrian_A_1 · ‎05-31-2019

Hi Jim,

again, thanks for answering me! I tried to divide the loop but, indeed, since function1, function2, function3... depends on the values_computed_before (in the former calculations), each thread should run the same code and then fill the corresponding position in huge_array, so although it is a good idea for some applications, I think this is not the case. And I think that I would need a CRITICAL clause to assign values to huge_array since it is shared, isn't it?

I assumed that for these large arrays is not a good practice to go much further than 40 threads in OpenMP because of the memory problems and that's all... if there's nothing else that I can do!

Thanks a lot for the help,

Adrian.

jimdempseyatthecove · ‎05-31-2019

In the scheme above, all threads iterate the full range of indexes .AND. all threads compute:

! CPU time here: negligible (<0.000001)
and
! CPU time from here to end do: approx. 1 ms

However, the filter if(MOD(depending_index... selects which thread performs the function... .AND. corisponding write

There is no need for:reduction, no need for critical section

The filter code will be effective WHEN the compute time for each function exceeds that of the redundant calculations.

I do not have sufficient information to determine the computation costs. Rather I will just provide the basic concept in sketch form for you to consider.

*** The code provided in #3 is insufficient to determine if the outer two loops generate independent regions of writes into the output array, and in which case, parallelizing the loop (or collapse of outer two loops) without reduction will have non-conflicting writes (depending_index values).

***.*** Should the depending_index conflict with strait loop construct, the confliction may only occur with adjacent index1 .AND./.OR. adjacent index2 values. IIF this is the case, then consider DO index1=1,396,2 and/or DO index2=1,3432,2 then follow that/those loop(s) with DO index1=2,396,2 and/or DO index2=2,3432,2 IOW for each half or quadrant.

Jim Dempsey