Solved: allocatable arrays in modules has unaligned access on Xeon Phi

perturbed_spoon · ‎12-14-2016

Hi everyone,

As the title said, an array from my modules is said to have unaligned access.
It's something like this:

module some_module
...
real(dp), allocatable :: a(:,:)
!dir$ attributes align:64 :: a 
...
end module some_module

It gives me unaligned access problem:

            remark #15389: vectorization support: reference some_module_mp_a_ has unaligned access

As far as I can tell from articles Data Alignment to Assist Vectorization and Fortran Array Data and Arguments and Vectorization, that is not the right thing to do. Another similar thread "Ifort array alignment problem" has the suggestion to use assume_aligned.

I tried this way, in another file:

subroutine some_subroutine

use some_module
...
!$omp parallel do private(...) schedule(...)
!dir$ assume_aligned a:64
do i = 1, n
    a(:,i) = 0.0d0
    do j = 1, n
        ! update a(:,i) here
    enddo
enddo
!$omp end parallel do
...
end subroutine some_subroutine

Compiling with version 16.0 gives me this error:

error #8018: The memory reference associated with an ASSUME_ALIGNED directive must not be in COMMON, a field reference, use associated, or host associated.

While compiling with version 17.0 works but it still says unaligned access in the report.

I complied all of them with -align array64byte. As far as I understand, aligned access with allocatable arrays from modules has problems. I originally assumed it's because of the OpenMP's clauses, but removing them doesn't change the report. Did I misunderstood something?

Martyn_C_Intel · ‎12-19-2016

Hi,

You're applying the VECTOR ALIGNED directives to the wrong loop. You should apply it to the array assignment(s), (the implied inner loop(s)), not the outer DO loop:

!$omp parallel do private(...) schedule(...)
do i = 1, n
!dir$ vector aligned
    a(:,i) = 0.0d0

    do j = 1, n
    !dir$ vector aligned
        ! update a(:,i) here
    enddo
enddo
!$omp end parallel do

For this to work, the first dimension of A needs to be a multiple of 64 bytes, as well as A being aligned.

-align array64bytes works for aligning both locally and globally declared arrays. But the alignment of dummy arguments in a subroutine depends on the alignment (and contiguity) of the actual arguments that you pass into it.

For KNC, if the compiler that you build with doesn't match (is newer than) the run-time libraries available on the coprocessor, I'd expect you to have more severe problems than alignment. For kmp_aligned_malloc(), you should make sure that libiomp5.so is available and accessible via the LD_LIBRARY_PATH for the coprocessor. I'd be surprised if the assume_aligned directive was what caused that reference, though. assume_aligned just asserts alignment to the compiler optimizer, it doesn't cause the compiler to align anything at allocation.

View solution in original post

Kevin_D_Intel · ‎12-15-2016

I just recently learned of some remaining issues related to the situation you nicely outlined.

For the loop-nest you showed, see if using a syntax where you specify the alignment of the first element of the array eliminates the unaligned accesses. (e.g. !dir$ assume_aligned a(1,1):64)

This may or may not eliminate the unaligned accesses for your case. From what I've learned, there can still be unaligned access issues related to array references in the loop. Here's an example case where the issue is not fully resolved and for which our Developers are considering how to resolve:

real(8),allocatable :: C(:,:),A(:,:),B(:,:)
!DIR$ ATTRIBUTES ALIGN: 16 :: C,A,B
...
!DIR$ ASSUME_ALIGNED  C(1,1):16,  A(1,1):16,  B(1,1):16
!!DIR$ ASSUME (MOD(N,2).eq.0)
    do J=1,5
      do I=1,N     
        C(I,1) = C(I,1) - A(I,J   )*B(I,J)  ! Only C(I,1) is aligned, others unaligned                          
        C(I,2) = C(I,2) - A(I,J+ 5)*B(I,J)  ! All unaligned
        C(I,3) = C(I,3) - A(I,J+10)*B(I,J)  ! All unaligned
      end do
    end do

I asked my colleague with more knowledge about this look over your post and reply also.

Martyn_C_Intel · ‎12-15-2016

Hi,

Both the switch -align array64byte and the directive !DIR$ ATTRIBUTES ALIGN:64 :: A work for aligning an allocatable array at its lower bound. Unlike for static arrays, the compiler does not in general know the lower bound at compile time. It can even vary during run-time, if the array is reallocated. The problem comes in asserting the alignment to the compiler for use in vectorization, since

!DIR$ ASSUME_ALIGNED A:64 doesn’t tell the compiler which element of A is aligned.

In the 17.0 compiler, support was added (for allocatable arrays or pointers only) for the syntax

!DIR$ ASSUME_ALIGNED A(1):64

which tells the compiler that element 1 of A is aligned. (Or it could be A(-1) if the lower bound is -1).

This works for 1-dimensional allocatable arrays or pointers. However, for multi-dimensional arrays there’s another problem: even if the first column is aligned, the compiler doesn’t know whether subsequent columns are aligned, because it doesn’t know the extent of the first dimension. (It needs to be a multiple of the alignment factor, here 64 bytes, for all columns to be aligned). For static arrays, you can assert this with an ASSUME directive, as in Kevin’s example, but this doesn’t work for allocatable arrays or pointers, for the same reason that the dimensions are only defined at run-time and can vary.

In simple cases such as your examples, you can assert that all columns are aligned like this:

!DIR$ ASSUME_ALIGNED A(1,i):64 (note the second subscript is "I", not "1").

(use the actual second subscript in your loop, even if it’s something like icount+1 or 13 )

There are currently some limitations – the directive needs to be inside your outer loop over J, either just before or inside the inner loop or array assignment, and you can’t make multiple assertions for different references to the same array in one loop. We’re working on relaxing these restrictions, hopefully in a forthcoming compiler update.

Please let us know whether the above works for you.

Incidentally, nothing about this is specific to Intel Xeon Phi, apart from the alignment factor of 64 bytes for Intel AVX-512. (It would be 32 bytes when targeting Intel AVX or AVX2 and 16 bytes when targeting Intel SSE).

Martyn_C_Intel · ‎12-16-2016

Incidentally, if all the array accesses in a loop are aligned, the simplest way to ask the compiler to generate aligned loads is

!DIR$ VECTOR ALIGNED immediately before the loop.

This works for array assignments as well as DO loops, and for all types of arrays, allocatable or not. You need to be sure that the data really are contiguous and aligned.

perturbed_spoon · ‎12-18-2016

Thank you for your extensive explanation.

I tried assumed_aligned directives and it gives me this error when running a native Xeon Phi application:

./my-executable: relocation error: ./my-executable: symbol kmp_aligned_malloc, version VERSION not defined in file libiomp5.so with link time reference

Most likely because I compiled it with version 17, but on the coprocessor itself only has version 15 is available. I don't think this will be fixed soon enough.
I then tried to use the vector aligned directives with -align array64byte.

subroutine some_subroutine

use some_module
...
!$omp parallel do private(...) schedule(...)
!dir$ vector aligned
do i = 1, n
    a(:,i) = 0.0d0
    !dir$ vector aligned ! or here, tried both positions, there are other arrays involved in this calculation
    do j = 1, n
        ! update a(:,i) here
    enddo
enddo
!$omp end parallel do
...
end subroutine some_subroutine

For single-threaded execution, KMP_PLACE_THREADS=1C,1t, it gives me this error:

forrtl: severe (154): array index out of bounds
Image                     PC                 Routine         Line     Source             
my-excutable             000000000050272D    Unknown         Unknown  Unknown
my-excutable             00000000005004C7    Unknown         Unknown  Unknown
my-excutable             00000000004A8464    Unknown         Unknown  Unknown
my-excutable             00000000004A8276    Unknown         Unknown  Unknown
my-excutable             0000000000434D65    Unknown         Unknown  Unknown
my-excutable             0000000000438B7D    Unknown         Unknown  Unknown
libpthread.so.0          00007F075D5FE4D0    Unknown         Unknown  Unknown
my-excutable             000000000041942E    Unknown         Unknown  Unknown

While multi-threaded execution, KMP_PLACE_THREADS=60C,4t, gives me segmentation faults:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred

In any case, I tried adding the "attributes align:64" directives around, where I declare the allocatable array a(:,:), where I allocate the array a(3,n), and where I use the array a itself. Obviously, there are several allocatable arrays other than a(:,:) involved. And the vectorization report also complained about their unaligned access. I suspect openmp is also part of the problem.

I wonder if you have any suggestion where to go from here?

Martyn_C_Intel · ‎12-19-2016

Hi,

You're applying the VECTOR ALIGNED directives to the wrong loop. You should apply it to the array assignment(s), (the implied inner loop(s)), not the outer DO loop:

!$omp parallel do private(...) schedule(...)
do i = 1, n
!dir$ vector aligned
    a(:,i) = 0.0d0

    do j = 1, n
    !dir$ vector aligned
        ! update a(:,i) here
    enddo
enddo
!$omp end parallel do

For this to work, the first dimension of A needs to be a multiple of 64 bytes, as well as A being aligned.

-align array64bytes works for aligning both locally and globally declared arrays. But the alignment of dummy arguments in a subroutine depends on the alignment (and contiguity) of the actual arguments that you pass into it.

For KNC, if the compiler that you build with doesn't match (is newer than) the run-time libraries available on the coprocessor, I'd expect you to have more severe problems than alignment. For kmp_aligned_malloc(), you should make sure that libiomp5.so is available and accessible via the LD_LIBRARY_PATH for the coprocessor. I'd be surprised if the assume_aligned directive was what caused that reference, though. assume_aligned just asserts alignment to the compiler optimizer, it doesn't cause the compiler to align anything at allocation.

perturbed_spoon · ‎12-21-2016

Martyn Corden (Intel) wrote:

For this to work, the first dimension of A needs to be a multiple of 64 bytes, as well as A being aligned.

Thanks again for the extensive explanation.

In that case, at least for my code, a solution is perhaps to split the multi-dimensional array a to separate arrays.

Let me ask the admin about the mismatch of libraries on the host and xeon phi.
I'm running the code as native execution, it's possible everyone else in my place use the offload approach.