Array alignment with OpenMP private arrays

Bendi1 · ‎11-07-2025

I can write serial code that ifort/ifx can vectorise. However, as soon as I put an array into an OMP private clause the compiler no longer recognises the arrays as suitable for aligned access. This makes sense since each per thread array, beyond the first, is allocated at runtime. However, it seems to me that there must be a solution to this out there. I have spent a lot of time looking for one with no joy.

Here is my minimum working example, filename align.f90,

program openmp_align_test
  use omp_lib
  implicit none
  integer, parameter :: dp = kind(1.0d0)
  integer, parameter :: n = 128
  integer :: i
  real(dp), dimension(n) :: array1, array2

  !$OMP PARALLEL DEFAULT(NONE) PRIVATE(array1, array2, i)
  ! Initialize arrays
  array1 = 1.0_dp
  array2 = 2.0_dp

  ! Simple vectorizable loop
  !$OMP DO
  do i = 1, n
     array1(i) = array1(i) + array2(i)
  end do
  !$OMP END DO
  !$OMP END PARALLEL
end program openmp_align_test

First I compiled with,

ifort -O3 -xCORE-AVX512 -align array64byte -qopt-zmm-usage=high -qopt-report=5 align.f90

note: this is without -qopenmp. In this case the three loops are fused and fully vectorised with aligned access. The optimisation report (align.optrpt) states : "estimated potential speedup: 9.450".

Then I compiled with '-qopenmp' and '-vec-threshold0',

ifort -O3 -xCORE-AVX512 -qopenmp -vec-threshold0 -align array64byte -qopt-zmm-usage=high -qopt-report=5 align.f90

Now the two initialisation loops are fused with aligned access but the addition loop is vectorised with unaligned access (without the threshold flag the compiler chose not to vectorise it). This time the report states: "estimated potential speedup: 1.920".

I have been using ifort 2021.2.0 but would be very happy with ifx only solutions.

Finally, here are some more details on what I am trying to achieve beyond the minimum example above.

Making array1 and array2 allocatable or on the heap in any way is desirable. Using iso_c_binding to achieve this is less desirable.
I would prefer solutions that do not use intel specific directives. I am happy with OpenMP solutions to get the alignment right. However, it seems the OpenMP standard is ahead of most compilers when it comes to alignment.
I have tried making array1 and array2 persistent threadprivate arrays but I get worse results than declaring them in OMP private clause as above.