vectorization involving allocatable type component

Alexis_R_ · ‎01-25-2014

Using ifort 14.0.1.106 Build 20131008, the following code:

[fortran]module my_module

type :: my_type

real, allocatable :: arr(:,:,:)

end type

contains

subroutine mult_1(self,scalar)

class(my_type) :: self

real :: scalar

self%arr = self%arr * scalar

end subroutine

subroutine mult_2(self,scalar)

type(my_type) :: self

real :: scalar

self%arr = self%arr * scalar

end subroutine

subroutine mult_3(self,scalar)

class(my_type) :: self

real :: scalar

select type(self)

type is (my_type)

self%arr = self%arr * scalar

end select

end subroutine

end module[/fortran]

Compiled with: ifort -opt-report=1 -c

Gives the optimization report copied below, from which I conclude that only the loop in line 14 was successfully vectorized.

I would like to understand:

line 9: why no vectorization? The advice regarding array of structures seems wrong - this is a REAL array, which should be contiguous I thought.
line 21: why no vectorization? IMO, the optimizer should see this line exactly as it sees line 14. What is the difference?

<;-1:-1;IPO;;0>
WHOLE PROGRAM (SAFE) [EITHER METHOD]: false

WHOLE PROGRAM (SEEN) [TABLE METHOD]: false

WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
-inline-factor: 100
-inline-min-size: 30
-inline-max-size: 230
-inline-max-total-size: 2000
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000

<test_opt.f90;1:1;IPO INLINING;my_module._;0>
INLINING REPORT: (my_module._) [1/4=25.0%]

<test_opt.f90;6:10;IPO INLINING;my_module_mp_mult_1_;0>
INLINING REPORT: (my_module_mp_mult_1_) [2/4=50.0%]

HPO VECTORIZER REPORT (my_module_mp_mult_1_) LOG OPENED ON Sun Jan 26 00:27:04 2014

<test_opt.f90;-1:-1;hpo_vectorization;my_module_mp_mult_1_;0>
HPO Vectorizer Report (my_module_mp_mult_1_)

test_opt.f90(9:3-9:3):VEC:my_module_mp_mult_1_: loop was not vectorized: not inner loop
loop was not vectorized: not inner loop

HLO REPORT LOG OPENED ON Sun Jan 26 00:27:04 2014

<test_opt.f90;-1:-1;hlo;my_module_mp_mult_1_;0>
High Level Optimizer Report (my_module_mp_mult_1_)

<test_opt.f90;9:9;hlo_linear_trans;my_module_mp_mult_1_;0>
Poor memory locality: Array of structurese detected
Advice: Replacing Array of Structures by Structure of Arrays might help loopnest at lines: 9

<test_opt.f90;11:15;IPO INLINING;my_module_mp_mult_2_;0>
INLINING REPORT: (my_module_mp_mult_2_) [3/4=75.0%]

HPO VECTORIZER REPORT (my_module_mp_mult_2_) LOG OPENED ON Sun Jan 26 00:27:04 2014

<test_opt.f90;-1:-1;hpo_vectorization;my_module_mp_mult_2_;0>
HPO Vectorizer Report (my_module_mp_mult_2_)

test_opt.f90(14:3-14:3):VEC:my_module_mp_mult_2_: LOOP WAS VECTORIZED
loop was not vectorized: not inner loop
loop was not vectorized: not inner loop

<test_opt.f90;-1:-1;hlo;my_module_mp_mult_2_;0>
High Level Optimizer Report (my_module_mp_mult_2_)

<test_opt.f90;16:23;IPO INLINING;my_module_mp_mult_3_;0>
INLINING REPORT: (my_module_mp_mult_3_) [4/4=100.0%]

-> strcmp(EXTERN)

HPO VECTORIZER REPORT (my_module_mp_mult_3_) LOG OPENED ON Sun Jan 26 00:27:04 2014

<test_opt.f90;-1:-1;hpo_vectorization;my_module_mp_mult_3_;0>
HPO Vectorizer Report (my_module_mp_mult_3_)

test_opt.f90(21:4-21:4):VEC:my_module_mp_mult_3_: loop was not vectorized: not inner loop
loop was not vectorized: not inner loop

<test_opt.f90;-1:-1;hlo;my_module_mp_mult_3_;0>
High Level Optimizer Report (my_module_mp_mult_3_)

Loopnest Preprocessing Report:

<test_opt.f90;21:0;hlo;my_module_mp_mult_3_;0>
Preprocess Loopnests <my_module_mp_mult_3_>: Moving Out Store @Line<0> in Loop @Line<21>
Preprocess Loopnests <my_module_mp_mult_3_>: Moving Out Store @Line<0> in Loop @Line<21>

Block, Unroll, Jam Report:
(loop line numbers, unroll factors and type of transformation)

<test_opt.f90;21:21;hlo_unroll;my_module_mp_mult_3_;0>
Loop at line 21 unrolled with remainder by 2

<test_opt.f90;-1:-1;PGO;my_module_mp_mult_3_;0>
STATIC:test_opt.f90 my_module_mp_mult_3_

<test_opt.f90;-1:-1;PGO;my_module_mp_mult_2_;0>
STATIC:test_opt.f90 my_module_mp_mult_2_

<test_opt.f90;-1:-1;PGO;my_module_mp_mult_1_;0>
STATIC:test_opt.f90 my_module_mp_mult_1_

<test_opt.f90;-1:-1;PGO;my_module._;0>
STATIC:test_opt.f90 my_module._

<;-1:-1;PGO;;0>

4 FUNCTIONS HAD VALID STATIC PROFILES

IPO CURRENT QUALITY METRIC: 50.0%
IPO POSSIBLE QUALITY METRIC: 50.0%
IPO QUALITY METRIC RATIO: 100.0%

Martyn_C_Intel · ‎01-27-2014

mult_1 (line 9): vectorization of a write to an allocatable component of a class is a known issue - the compiler is not able to prove that this is safe, even though it can do so for an explicit derived type. I'm afraid the HLO message about memory locality and array of structures is a distraction. You can work around this by inserting !DIR$ SIMD before the array assignment. (Or if you prefer OpenMP 4.0 standard syntax, !$OMP SIMD , but then you must compile with -openmp or -openmp-simd). If the allocatable component were one dimensional, you could also use !DIR$ IVDEP. However, for a multidimensional array assignment, the compiler does not know to which loop (i.e., which array index), to apply the directive. I'll report back when a compiler containing a fix for this issue is available, but I don't know when that might be.

I believe the issue in mult_3 (line 21) is similar, the compiler cannot satisfy itself that vectorization of class my_type is safe, and doesn't seem to realize that it could switch to the rules for types. This can be worked around by compiler directives, as before. I believe a fix for this simpler case might already be targeted for a future compiler.

Incidentally, you can get a better idea of what is going on by increasing the diagnostic level of the report, but restricting it to the "high performance optimizer" (vectorization and parallelization):

ifort -c -opt-report=2 -opt-report-phase=hpo

or use -opt-report=3 to get explicit dependency messages, (which don't help much in this case):

HPO VECTORIZER REPORT (my_module_mp_mult_1_) LOG OPENED ON Mon Jan 27 16:50:10 2014

<test_type_vec.f90;-1:-1;hpo_vectorization;my_module_mp_mult_1_;0>
HPO Vectorizer Report (my_module_mp_mult_1_)

test_type_vec.f90(9:3-9:3):VEC:my_module_mp_mult_1_: loop was not vectorized: existence of vector dependence
potential FLOW dependence
potential ANTI dependence
loop was not vectorized: not inner loop
loop was not vectorized: not inner loop

Another way to get just the vectorization report is with -vec-report2 (or -vec-report3 or -vec-report6) instead of -opt-report.

Alexis_R_ · ‎01-31-2014

Thanks Martyn, I've been playing around with this some more, but the forum software won't let me reply on this thread with more than few lines, so I've started another thread with another code sample.

Please see http://software.intel.com/en-us/forums/topic/500890

Martyn_C_Intel · ‎02-03-2014

Alexis,

I'm sorry, but we can't see your new post, I don't know what happened there. Please could you try again to reply to this thread, (or to open a new one if it's a new topic or issue), but using plain text if inserting/attaching code isn't working?

Alexis_R_ · ‎02-03-2014

My other post finally appeared on the forum this morning, maybe 10 hours after I'd initially submitted it. But it appears to have been deleted now. I reproduce it here.

I am investigating vectorization strategies for my project, where most of the computation is over three-dimensional allocatable arrays within type-bound procedures. Typically I allocate a real array and then setup a complex pointer to it (see example below), so that I can do in-place Fourier transforms. For now, I'm testing on Xeon E5-2680 using composer_xe_2013_sp1.0.080 with a minimal "reproducer" below but I am also testing a Xeon Phi board, hence my focus on vectorization right now.

The results I get (copied below) with -O2 suggest some terrible code generation from the do concurrent and inefficient vectorization of the array assignment with the simd directive. Also, the explicit nested do loops are much better optimized than the other constructs, which I expected to be equivalent. What's going on?

Finally, the fastest way to do this work with -O2 was to loop over the complex array~~, by a factor of 4, presumably because there are 4x less iterations when looping over the complex array~~ (Edit: I had an error in my original code listing, now corrected; complex array has 2x less elements, not 4x; timings have been updated accordingly). What is the compiler doing there that it can't do when looping over the real array? Can I help or force it do so? That'd be a great speedup if I could generalize it to other loops. I have tried to look at the assembly using Amplifier but I am not experienced enough to make sense of it.

When compiled with -O2 -assume realloc_lhs -heap-arrays -mkl, here are the timings:

mult1              : did 1000 trips in .72 cpu seconds
mult1_simd         : did 1000 trips in 1.30 cpu seconds
mult1_do           : did 1000 trips in .52 cpu seconds
mult1_do_ptr       : did 1000 trips in .52 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .32 cpu seconds
mult1_do_concurrent: did 1000 trips in 6.74 cpu seconds
mult1_forall       : did 1000 trips in .72 cpu seconds

When compiled with -xHost -no-prec-div -O3 -assume realloc_lhs -heap-arrays -mkl, annoyingly the code seems significantly slower in most cases. What's going on there? Perhaps other optimization flags would work better for this compiler?

mult1              : did 1000 trips in 1.01 cpu seconds
mult1_simd         : did 1000 trips in 1.55 cpu seconds
mult1_do           : did 1000 trips in 1.00 cpu seconds
mult1_do_ptr       : did 1000 trips in .52 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .32 cpu seconds
mult1_do_concurrent: did 1000 trips in 6.78 cpu seconds
mult1_forall       : did 1000 trips in 1.36 cpu seconds

Here is the code:

[fortran]module my_module
    use iso_c_binding
type :: my_type
    real, allocatable :: arr(:,:,:)
    !dir$ attributes align:64 :: arr
    real, pointer, contiguous :: arr_ptr(:,:,:)
    complex, pointer, contiguous :: arr_ptr_comp(:,:,:)
end type
contains
subroutine my_allocation(self)
    type(my_type), target:: self
    type(c_ptr) :: cptr
    allocate(self%arr(1024,1024,1))
    self%arr_ptr => self%arr
    cptr = c_loc(self%arr)
    call c_f_pointer(cptr,self%arr_ptr_comp,[512,1024,1])
end subroutine
subroutine mult_1(self,scalar)
    class(my_type) :: self
    real           :: scalar
    self%arr = self%arr * scalar
end subroutine
subroutine mult_1_simd(self,scalar)
    class(my_type) :: self
    real           :: scalar
    !dir$ simd
    self%arr = self%arr * scalar
end subroutine
subroutine mult_1_do(self,scalar)
    class(my_type) :: self
    real           :: scalar
    integer        ::  i,j,k
    do k=1,size(self%arr,3)
        do j=1,size(self%arr,2)
            do i=1,size(self%arr,1)
                self%arr(i,j,k) = self%arr(i,j,k) * scalar
            enddo
        enddo
    enddo
end subroutine
subroutine mult_1_do_ptr(self,scalar)
    class(my_type) :: self
    real           :: scalar
    integer        ::  i,j,k
    do k=1,size(self%arr,3)
        do j=1,size(self%arr,2)
            do i=1,size(self%arr,1)
                self%arr_ptr(i,j,k) = self%arr_ptr(i,j,k) * scalar
            enddo
        enddo
    enddo
end subroutine
subroutine mult_1_do_ptr_comp(self,scalar)
    class(my_type) :: self
    real           :: scalar
    integer        ::  i,j,k
    do k=1,size(self%arr_ptr_comp,3)
        do j=1,size(self%arr_ptr_comp,2)
            do i=1,size(self%arr_ptr_comp,1)
                self%arr_ptr_comp(i,j,k) = self%arr_ptr_comp(i,j,k) * scalar
            enddo
        enddo
    enddo
end subroutine
subroutine mult_1_do_concurrent(self,scalar)
    class(my_type) :: self
    real           :: scalar
    integer        ::  i,j,k
    do concurrent(i=1:size(self%arr,1),j=1:size(self%arr,2),k=1:size(self%arr,3))
        self%arr(i,j,k) = self%arr(i,j,k) * scalar
    enddo
end subroutine
subroutine mult_1_forall(self,scalar)
    class(my_type) :: self
    real           :: scalar
    integer        ::  i,j,k
    forall(i=1:size(self%arr,1),j=1:size(self%arr,2),k=1:size(self%arr,3))
        self%arr(i,j,k) = self%arr(i,j,k) * scalar
    end forall
end subroutine

end module

program vectest
    use my_module
    implicit none
    type(my_type)           ::  my_obj
    integer, parameter      ::  number_of_trips = 1000
    real                    ::  time_begin,time_end
    integer                 ::  current_trip

    call my_allocation(my_obj)

    my_obj%arr = 1.0e0
    call cpu_time(time_begin)
    do current_trip=1,number_of_trips
        call mult_1(my_obj,1.01e0)
    enddo
    call cpu_time(time_end)
    write(*,'(a,i0,a,f0.2,a)') 'mult1              : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds'

    my_obj%arr = 1.0e0
    call cpu_time(time_begin)
    do current_trip=1,number_of_trips
        call mult_1_simd(my_obj,1.01e0)
    enddo
    call cpu_time(time_end)
    write(*,'(a,i0,a,f0.2,a)') 'mult1_simd         : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds'

    my_obj%arr = 1.0e0
    call cpu_time(time_begin)
    do current_trip=1,number_of_trips
        call mult_1_do(my_obj,1.01e0)
    enddo
    call cpu_time(time_end)
    write(*,'(a,i0,a,f0.2,a)') 'mult1_do           : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds'


    my_obj%arr_ptr = 1.0e0
    call cpu_time(time_begin)
    do current_trip=1,number_of_trips
        call mult_1_do_ptr(my_obj,1.01e0)
    enddo
    call cpu_time(time_end)
    write(*,'(a,i0,a,f0.2,a)') 'mult1_do_ptr       : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds'


    my_obj%arr_ptr_comp = (1.0e0,1.0e0)
    call cpu_time(time_begin)
    do current_trip=1,number_of_trips
        call mult_1_do_ptr_comp(my_obj,1.01e0)
    enddo
    call cpu_time(time_end)
    write(*,'(a,i0,a,f0.2,a)') 'mult1_do_ptr_comp  : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds'


    my_obj%arr = 1.0e0
    call cpu_time(time_begin)
    do current_trip=1,number_of_trips
        call mult_1_do_concurrent(my_obj,1.01e0)
    enddo
    call cpu_time(time_end)
    write(*,'(a,i0,a,f0.2,a)') 'mult1_do_concurrent: did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds'

    my_obj%arr = 1.0e0
    call cpu_time(time_begin)
    do current_trip=1,number_of_trips
        call mult_1_forall(my_obj,1.01e0)
    enddo
    call cpu_time(time_end)
    write(*,'(a,i0,a,f0.2,a)') 'mult1_forall       : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds'


end program vectest[/fortran]

Steven_L_Intel1 · ‎02-03-2014

Sorry - I think I deleted the post by accident thinking it was a duplicate.

TimP · ‎02-04-2014

after commenting out the simd directive (what did your compiler report doing with it?)

On my Windows laptop (the only one I have with AVX) with ifort 14.0.1 :

ifort -assume:realloc_lhs -heap-arrays ar.f90

$ ./ar
mult1 : did 1000 trips in .98 cpu seconds
mult1_simd : did 1000 trips in .97 cpu seconds
mult1_do : did 1000 trips in .72 cpu seconds
mult1_do_ptr : did 1000 trips in .73 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .45 cpu seconds
mult1_do_concurrent: did 1000 trips in 18.52 cpu seconds
mult1_forall : did 1000 trips in 3.00 cpu seconds

ifort -assume:realloc_lhs -heap-arrays -QxHost ar.f90

$ ./ar
mult1 : did 1000 trips in .39 cpu seconds
mult1_simd : did 1000 trips in .41 cpu seconds
mult1_do : did 1000 trips in .69 cpu seconds
mult1_do_ptr : did 1000 trips in .72 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .44 cpu seconds
mult1_do_concurrent: did 1000 trips in 20.22 cpu seconds
mult1_forall : did 1000 trips in 2.84 cpu seconds

forall is faster on stack. For a rank 1 forall, ifort will skip temporary stack allocation when !dir$ ivdep is set.

I didn't see you mention that the compiler reports "subscript too complex" and declines to optimize the do concurrent until the loop order is changed to outer,middle, inner so that it can reach the .72 second time. I don't know if that is an accepted way to optimize do concurrent; gfortran will optimize only a single level loop there.

gfortran is faster than ifort on mult1_do and mult1_do_ptr, and sse2 vs. avx makes little difference.

According to

$ ifort -assume:realloc_lhs -Qvec-report7 -QxHost /Qunroll2 ar.f90 /link /stack
:800000000 2>&1 | ../vecanalysis/vecanalysis.py --annotate

mult1_do, mult1_do_ptr, and optimized do_concurrent in-lined versions use 50% more "vector operations" than the faster ones, which evidently is reflected in the run time. The in-lining in this benchmark appears possibly counter-productive (accounting for losses in comparison to gfortran?)

Alexis_R_ · ‎02-05-2014

Thanks for your comments, Tim.

To answer your points:

- my compiler said SIMD LOOP WAS VECTORIZED at line 27. Is this not what you expected? Why did you remove it? I was following Martyn's suggestion above, but am I missing something?

- "forall is faster on stack". Do you mean without -heap-arrays? I believe I have to have -heap-arrays in there for unrelated reasons, so I shan't explore further. But do you know why the heap v stack may make a difference here?

- thanks for the hint regarding inlining. Disabling inlining makes the results look a bit saner:

mult1              : did 1000 trips in .70 cpu seconds
mult1_simd         : did 1000 trips in .70 cpu seconds
mult1_do           : did 1000 trips in .25 cpu seconds
mult1_do_ptr       : did 1000 trips in .25 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .24 cpu seconds
mult1_do_concurrent: did 1000 trips in 6.63 cpu seconds
mult1_forall       : did 1000 trips in .65 cpu seconds

Again, do you have a feeling for why preventing inlining may help? I'd love to understand this better so I might use lessons learnt here in "real world" applications.

I guess it seems to me that the compiler should be doing a much better job of optimizing mult1 and mult1_forall. And the treatment of the do concurrent construct seems buggy to me. I much prefer the gfortran behaviour you report.

TimP · ‎02-06-2014

I'm surprised that your version of the compiler accepted a simd directive on an array assignment, but it isn't needed, unless it can be used to suppress a temporary allocation. It's difficult to guess where the compiler writers have set the dividing line between directives which can apply only to DO loops or similar and those which apply also to array assignments.

The developers of OpenMP 4 standard must have decided not to tackle the ambiguity of simd clauses on array assignments (since the clause is applied specifically to the DO loop immediately following the directive). Apparently, there was debate on whether omp workshare should be retained and to what extent it should be useful (where it's not at all useful with ifort nor gfortran).

Your example points out how the compiler makes a distinction between multiple rank assignments and DO loops where you have specified the optimum order of operations. I already remarked on how it might be surprising that ifort can begin to optimize DO CONCURRENT only with a specific order of index specifications (and that other compilers can't optimize even with the order changed). I use a lot of plain DO loops with the inner loop written as a single rank array assignment, but that is an unpopular style.

MATMUL is the most obvious case where a compiler should be expected to optimize reliably what is in effect a multiple-rank assignment. Each compiler has its own peculiar MATMUL options.

The greater speed of FORALL with stack allocation presumably is due to the greater overhead of heap memory allocation (library function calls....). So I suppose the lesson is to be doubly wary of FORALL requiring allocation of a temporary array when using heap-arrays. I avoid FORALL usage where it requires over-writing an operand, which will cause the compiler to generate a temporary copy. FORALL cases where DO CONCURRENT doesn't work are to be avoided.

By popular demand, commercial compilers are expected to attempt to optimize redundant extra looping discovered though inter-procedural analysis, but it isn't done reliably. It's a little surprising that in your case this makes it inconsistently worse. Popular demand also has led to elimination of any documented compiler reports on attempts (successful or not) at dead code removal. Vectorization reports in the latest gfortran are so verbose and arcane as to be nearly useless for anyone but compiler developers, and the same is probably true of the undocumented il0 reports in ifort.

Martyn_C_Intel · ‎08-31-2015

Alexis,

All three of your loops are vectorized by version 16 of the Intel compiler, which was released a few days ago. (Loop 3 was already vectorized by the version 15 compiler). I believe the handling of DO CONCURRENT has also improved significantly since version 14.0. The version 16 compiler is available from registrationcenter.intel.com with current support. Thanks for reporting these vectorization issues.

There are some known performance issues associated with using -heap-arrays with OpenMP applications, which are being worked on. But that's another subject...