Data alignment for supporting more efficient vectorization

Bob4 · ‎12-13-2018

I am testing on an AVX machine. The code somehow looks like this:

 include "Sub_Prog_1.f90"
 include "Sub_Prog_2.f90"

 program MyCode

 use Sub_Prog_1_Mod
 use Sub_Prog_2_Mod
 
 implicit none
                                                    
 integer, parameter :: dp = selected_real_kind(15,307), dp2 =selected_real_kind(15,307)    
 integer  :: array_size, i, j, k, l
 integer, dimension(:), allocatable :: idx  
 real(kind = dp2) :: time1, time2, omp_get_wtime
 real(kind = dp), dimension(:), allocatable :: a, b, c 

 array_size = 100000000  !assuming I read from an input file, here I just wrote like this

 allocate ( idx(array_size), a(array_size), b(array_size), c(array_size) ) 

   ! Initialization
       do i = 1, array_size
          a(i) = dble(i)   ;   b(i) = dble(i * 2)   ;   idx(i) = array_size - i + 1
       end do

   time1 = omp_get_wtime()

      !$omp parallel
       do i = 1, 10

          call Sub_Prog_1 ( array_size, idx, a, b )
          call Sub_Prog_2 ( array_size, a, b, c ) 

       end do
      !$omp end parallel

   time2 = omp_get_wtime()

   print *, c(8000000)
   print *, 'Results =', time2 - time1

 end program MyCode

!==================================================================

 subroutine Sub_Prog_1 ( array_size, idx, a, b )

 implicit none

 integer, parameter :: dp = selected_real_kind(15,307), dp2 = selected_real_kind(15,307)
 integer :: array_size, i, j, k, l
 integer, dimension(:), allocatable :: idx    
 real(kind = dp), dimension(:), allocatable :: a, b, c    
      
       !$omp do private(i) schedule(runtime)
       !dir$ vector aligned 
       !$omp simd simdlen(4)
        do i = 1, array_size   
            a(i) = a(idx(i)) + dble(i)
                if (a(i) <= 3000.0d+0) then
                     a(i) = dble(idx(i)) / 200.0d+0
                end if 
            b(i) = sqrt(b(i)) + dble(i * 2)
        end do
       !$omp end simd
       !$omp end do

    end subroutine Sub_Prog_1

!==================================================================

 subroutine Sub_Prog_2 ( array_size, a, b, c )

 implicit none

 integer, parameter :: dp = selected_real_kind(15,307), dp2 = selected_real_kind(15,307)  
 integer  :: array_size, i, j, k, l
 real(kind = dp), dimension(:), allocatable :: a, b, c    

       !$omp do private(i) schedule(runtime)
       !dir$ vector aligned
       !$omp simd simdlen(4)
        do i = 1, array_size 
            c(i) = a(i) + sqrt(b(i)) / 3.67d+0
               if (c(i) <= 350.0d+0) then
                     c(i) = a(i) + sqrt(b(i)) / 8.67d+0
               end if 
        end do
       !$omp end simd
       !$omp end do       

 end subroutine Sub_Prog_2

I wanted to exploit the ability of the Intel Compiler 19 for applications of aligned data access for efficient vectorization. Thus, I compiled using the flags "ifort -O3 -qopt-report5 -qopenmp -align array32byte -xAVX -o MyCode.exe Main.f90". Now, I have two questions.

I was wondering why I cannot combine !dir$ vector aligned and !$omp simd simdlen(...) like written above as the compiler always showed me a message like this:

Sub_Prog_1.f90(17): catastrophic error: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.
compilation aborted for Main.f90 (code 1)

As I actually prefer using OpenMP directives to Intel one, I was also previously using the directives "!$omp simd simdlen(4) aligned(a,b,idx :32)" and "!$omp simd simdlen(4) aligned(a,b,c :32)" for the first and second subroutines, respectively. However, as I saw the vectorization reports, I found that the arrays still had unaligned access. The only thing that I could do so that I achieved both aligned access and vectorization is— to use "!dir$ simd vectorlength(4)" instead of "!$omp simd simdlen(4)".

Could someone please explain this matter?

Many thanks.

Best wishes,

Steve_Lionel · ‎12-13-2018

Item 1 is a compiler bug. Please report it to Intel through https://supporttickets.intel.com/?lang=en-US and provide a complete source that reproduces the problem, along with the exact command line you used to compile.

I'll let someone else address your other question.

TimP · ‎12-14-2018

I don't know what you mean by "efficient vectorization." Restricting optimizations to those matching simdlen(4) seems likely to reduce performance, unless possibly you hope to optimize short loop counts of odd multiples of 4. I could understand if the compiler fails to adapt, although it must not internal error no matter if the compiler developers think it nonsense to go about it this way. As you use Intel directives, loop count directive seems more apt.

What are you looking for with your alignment directive? The only thing which should change is this should permit generating code to adjust for alignment at the beginning of the loop. I suppose this might be exsmined by diffing on the displays of generated code, if you don't trust when the opt_report says aligned. As it doesn't cost anything on a cpu which supports avx, code is generated to support unaligned access even if optimized for aligned, unless you find out the internal option to chsnge this, in which case internal error might not be a bug. You couldn't see a difference in timing tests except with specific short loop counts. You could achieve that also for avx by vector unaligned directive.