How to make work with allocatable 3D arrays optimally parallel using OpenMP?

jirina · ‎02-24-2012

I am using allocatable 3D arrays in my application and I often need to perform simple operations like initialization to a constant value (usually 0), multiplication, subtraction, etc. I thought it would be a good idea to make this parallel using OpenMP, so I have e.g. this for the initialization:

[fxfortran] subroutine zero_field ( s ) include 'settings.inc' include 'variables.inc' integer*4 i, j, k real*4 s(nx,ny,nz) !$omp parallel if ( enableOpenMP ) num_threads ( threads ) default ( shared ) !$omp do schedule(static) private ( i, j, k ) do i=1,nx do j=1,ny do k=1,nz s(i,j,k) = 0.0 end do end do end do !$omp end do return end[/fxfortran] I tried checking locks and waits using Intel VTune Amplifier and the results is that the wait time in enormous for the parallelized nested loop above, even if nx= 250, ny=150, nz=50, which I think are not so low values.

Is there any document related to making such simple things efficient and optimal during parallelization? What could I try changing in my code to decrease the wait time?

Arjen_Markus · ‎02-24-2012

There is a certain overhead in creating a parallel region and then leaving it again. The loop you show
does very little work actually and that means the array would have to be enormous to get any performance
gain.

Why not rely on (simple) array operations? The compiler can optimise those quite well.

In the above:

s = 0.0

would suffice

Regards,

Arjen

TimP · ‎02-24-2012

Under OpenMP, the compiler isn't permitted to optimize your loop nesting, so it will heed your specification of massive false sharing.
In the non-parallel case suggested by Arjen, recent versions of ifort should make an automatic fast_memset substitution, which would switch to non-temporal store (bypassing cache) if the array is large compared with cache. With the old compiler, you might achieve similar effect by placing a VECTOR NONTEMPORAL directive.
You would expect performance to be limited by memory bandwidth, so that parallelization might achieve no gain on a single CPU platform.
On a multiple CPU platform, you should achieve some gain up to 1 thread per CPU by using the multiple memory channels more effectively. This would require that you have set BIOS NUMA mode on platforms which have it, that you set KMP_AFFINITY and use consistent access patterns so that the array remains partitioned effectively on the memory channels.

jirina · ‎02-24-2012

Thank you guys for your suggestions and explanations. As I am not an expert in Fortran, it starts to be difficult to understand what TimP wrote. Anyway, what I pick from both posts is that I might try s = 0 and let the compiler (I am using the latest version) do its best to optimize the operation. The question is whether I should enable the auto parallization option (it is probably intended only for loops) or whether I should do anything else to make s = 0 parallel or simply faster. Or will vectorization take care of this?

Arjen_Markus · ‎02-24-2012

I can not really comment on that (not being an expert in compiler technology), but here are
some general observations:

- For the compiler it is easier to optimise such array operations than it is to optimise the
do-loop. Simply because there can be no mistake about the intention (most simple
do-loops are a bit more complicated than the one you showed)

- In general, with nested do-loops, let the rightmost index vary the slowest and the leftmost index
the fastest. That way you get the most localized memory access.
In you example you let the third index vary fastest - that means you are accessing array elements
that are widely separated from each other.That goes against the caching mechanism built into
all computers of pastseveral decades ;). The compiler is able to solve that in simple cases,
butit may be hampered bythe OpenMP directives in doing that.

The preferred loops would be something like:

do k = 1,nk
do j = 1,nj
do i = 1,ni
s(i,j,k) = 0.0
enddo
enddo
enddo

In Fortran the elements in an array are stored such that s(i+1,j,k) is next to s(i,j,k) etc.
(I can never remember if that is row-major or column-major order ;))

Regards,

Arjen