- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using ifort 14.0.1.106 Build 20131008, the following code:
[fortran]module my_module
type :: my_type
real, allocatable :: arr(:,:,:)
end type
contains
subroutine mult_1(self,scalar)
class(my_type) :: self
real :: scalar
self%arr = self%arr * scalar
end subroutine
subroutine mult_2(self,scalar)
type(my_type) :: self
real :: scalar
self%arr = self%arr * scalar
end subroutine
subroutine mult_3(self,scalar)
class(my_type) :: self
real :: scalar
select type(self)
type is (my_type)
self%arr = self%arr * scalar
end select
end subroutine
end module[/fortran]
Compiled with: ifort -opt-report=1 -c
Gives the optimization report copied below, from which I conclude that only the loop in line 14 was successfully vectorized.
I would like to understand:
- line 9: why no vectorization? The advice regarding array of structures seems wrong - this is a REAL array, which should be contiguous I thought.
- line 21: why no vectorization? IMO, the optimizer should see this line exactly as it sees line 14. What is the difference?
<;-1:-1;IPO;;0>
WHOLE PROGRAM (SAFE) [EITHER METHOD]: falseWHOLE PROGRAM (SEEN) [TABLE METHOD]: false
WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false
INLINING OPTION VALUES:
-inline-factor: 100
-inline-min-size: 30
-inline-max-size: 230
-inline-max-total-size: 2000
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000
<test_opt.f90;1:1;IPO INLINING;my_module._;0>
INLINING REPORT: (my_module._) [1/4=25.0%]
<test_opt.f90;6:10;IPO INLINING;my_module_mp_mult_1_;0>
INLINING REPORT: (my_module_mp_mult_1_) [2/4=50.0%]
HPO VECTORIZER REPORT (my_module_mp_mult_1_) LOG OPENED ON Sun Jan 26 00:27:04 2014
<test_opt.f90;-1:-1;hpo_vectorization;my_module_mp_mult_1_;0>
HPO Vectorizer Report (my_module_mp_mult_1_)test_opt.f90(9:3-9:3):VEC:my_module_mp_mult_1_: loop was not vectorized: not inner loop
loop was not vectorized: not inner loopHLO REPORT LOG OPENED ON Sun Jan 26 00:27:04 2014
<test_opt.f90;-1:-1;hlo;my_module_mp_mult_1_;0>
High Level Optimizer Report (my_module_mp_mult_1_)
<test_opt.f90;9:9;hlo_linear_trans;my_module_mp_mult_1_;0>
Poor memory locality: Array of structurese detected
Advice: Replacing Array of Structures by Structure of Arrays might help loopnest at lines: 9<test_opt.f90;11:15;IPO INLINING;my_module_mp_mult_2_;0>
INLINING REPORT: (my_module_mp_mult_2_) [3/4=75.0%]
HPO VECTORIZER REPORT (my_module_mp_mult_2_) LOG OPENED ON Sun Jan 26 00:27:04 2014
<test_opt.f90;-1:-1;hpo_vectorization;my_module_mp_mult_2_;0>
HPO Vectorizer Report (my_module_mp_mult_2_)test_opt.f90(14:3-14:3):VEC:my_module_mp_mult_2_: LOOP WAS VECTORIZED
loop was not vectorized: not inner loop
loop was not vectorized: not inner loop
<test_opt.f90;-1:-1;hlo;my_module_mp_mult_2_;0>
High Level Optimizer Report (my_module_mp_mult_2_)<test_opt.f90;16:23;IPO INLINING;my_module_mp_mult_3_;0>
INLINING REPORT: (my_module_mp_mult_3_) [4/4=100.0%]-> strcmp(EXTERN)
HPO VECTORIZER REPORT (my_module_mp_mult_3_) LOG OPENED ON Sun Jan 26 00:27:04 2014
<test_opt.f90;-1:-1;hpo_vectorization;my_module_mp_mult_3_;0>
HPO Vectorizer Report (my_module_mp_mult_3_)test_opt.f90(21:4-21:4):VEC:my_module_mp_mult_3_: loop was not vectorized: not inner loop
loop was not vectorized: not inner loop
<test_opt.f90;-1:-1;hlo;my_module_mp_mult_3_;0>
High Level Optimizer Report (my_module_mp_mult_3_)Loopnest Preprocessing Report:
<test_opt.f90;21:0;hlo;my_module_mp_mult_3_;0>
Preprocess Loopnests <my_module_mp_mult_3_>: Moving Out Store @Line<0> in Loop @Line<21>
Preprocess Loopnests <my_module_mp_mult_3_>: Moving Out Store @Line<0> in Loop @Line<21>Block, Unroll, Jam Report:
(loop line numbers, unroll factors and type of transformation)<test_opt.f90;21:21;hlo_unroll;my_module_mp_mult_3_;0>
Loop at line 21 unrolled with remainder by 2<test_opt.f90;-1:-1;PGO;my_module_mp_mult_3_;0>
STATIC:test_opt.f90 my_module_mp_mult_3_<test_opt.f90;-1:-1;PGO;my_module_mp_mult_2_;0>
STATIC:test_opt.f90 my_module_mp_mult_2_<test_opt.f90;-1:-1;PGO;my_module_mp_mult_1_;0>
STATIC:test_opt.f90 my_module_mp_mult_1_<test_opt.f90;-1:-1;PGO;my_module._;0>
STATIC:test_opt.f90 my_module._<;-1:-1;PGO;;0>
4 FUNCTIONS HAD VALID STATIC PROFILES
IPO CURRENT QUALITY METRIC: 50.0%
IPO POSSIBLE QUALITY METRIC: 50.0%
IPO QUALITY METRIC RATIO: 100.0%
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mult_1 (line 9): vectorization of a write to an allocatable component of a class is a known issue - the compiler is not able to prove that this is safe, even though it can do so for an explicit derived type. I'm afraid the HLO message about memory locality and array of structures is a distraction. You can work around this by inserting !DIR$ SIMD before the array assignment. (Or if you prefer OpenMP 4.0 standard syntax, !$OMP SIMD , but then you must compile with -openmp or -openmp-simd). If the allocatable component were one dimensional, you could also use !DIR$ IVDEP. However, for a multidimensional array assignment, the compiler does not know to which loop (i.e., which array index), to apply the directive. I'll report back when a compiler containing a fix for this issue is available, but I don't know when that might be.
I believe the issue in mult_3 (line 21) is similar, the compiler cannot satisfy itself that vectorization of class my_type is safe, and doesn't seem to realize that it could switch to the rules for types. This can be worked around by compiler directives, as before. I believe a fix for this simpler case might already be targeted for a future compiler.
Incidentally, you can get a better idea of what is going on by increasing the diagnostic level of the report, but restricting it to the "high performance optimizer" (vectorization and parallelization):
ifort -c -opt-report=2 -opt-report-phase=hpo
or use -opt-report=3 to get explicit dependency messages, (which don't help much in this case):
HPO VECTORIZER REPORT (my_module_mp_mult_1_) LOG OPENED ON Mon Jan 27 16:50:10 2014
<test_type_vec.f90;-1:-1;hpo_vectorization;my_module_mp_mult_1_;0>
HPO Vectorizer Report (my_module_mp_mult_1_)
test_type_vec.f90(9:3-9:3):VEC:my_module_mp_mult_1_: loop was not vectorized: existence of vector dependence
potential FLOW dependence
potential ANTI dependence
loop was not vectorized: not inner loop
loop was not vectorized: not inner loop
Another way to get just the vectorization report is with -vec-report2 (or -vec-report3 or -vec-report6) instead of -opt-report.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Martyn, I've been playing around with this some more, but the forum software won't let me reply on this thread with more than few lines, so I've started another thread with another code sample.
Please see http://software.intel.com/en-us/forums/topic/500890
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alexis,
I'm sorry, but we can't see your new post, I don't know what happened there. Please could you try again to reply to this thread, (or to open a new one if it's a new topic or issue), but using plain text if inserting/attaching code isn't working?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My other post finally appeared on the forum this morning, maybe 10 hours after I'd initially submitted it. But it appears to have been deleted now. I reproduce it here.
I am investigating vectorization strategies for my project, where most of the computation is over three-dimensional allocatable arrays within type-bound procedures. Typically I allocate a real array and then setup a complex pointer to it (see example below), so that I can do in-place Fourier transforms. For now, I'm testing on Xeon E5-2680 using composer_xe_2013_sp1.0.080 with a minimal "reproducer" below but I am also testing a Xeon Phi board, hence my focus on vectorization right now.
The results I get (copied below) with -O2 suggest some terrible code generation from the do concurrent and inefficient vectorization of the array assignment with the simd directive. Also, the explicit nested do loops are much better optimized than the other constructs, which I expected to be equivalent. What's going on?
Finally, the fastest way to do this work with -O2 was to loop over the complex array, by a factor of 4, presumably because there are 4x less iterations when looping over the complex array (Edit: I had an error in my original code listing, now corrected; complex array has 2x less elements, not 4x; timings have been updated accordingly). What is the compiler doing there that it can't do when looping over the real array? Can I help or force it do so? That'd be a great speedup if I could generalize it to other loops. I have tried to look at the assembly using Amplifier but I am not experienced enough to make sense of it.
When compiled with -O2 -assume realloc_lhs -heap-arrays -mkl, here are the timings:
mult1 : did 1000 trips in .72 cpu seconds
mult1_simd : did 1000 trips in 1.30 cpu seconds
mult1_do : did 1000 trips in .52 cpu seconds
mult1_do_ptr : did 1000 trips in .52 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .32 cpu seconds
mult1_do_concurrent: did 1000 trips in 6.74 cpu seconds
mult1_forall : did 1000 trips in .72 cpu seconds
When compiled with -xHost -no-prec-div -O3 -assume realloc_lhs -heap-arrays -mkl, annoyingly the code seems significantly slower in most cases. What's going on there? Perhaps other optimization flags would work better for this compiler?
mult1 : did 1000 trips in 1.01 cpu seconds
mult1_simd : did 1000 trips in 1.55 cpu seconds
mult1_do : did 1000 trips in 1.00 cpu seconds
mult1_do_ptr : did 1000 trips in .52 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .32 cpu seconds
mult1_do_concurrent: did 1000 trips in 6.78 cpu seconds
mult1_forall : did 1000 trips in 1.36 cpu seconds
Here is the code:
[fortran]module my_module use iso_c_binding type :: my_type real, allocatable :: arr(:,:,:) !dir$ attributes align:64 :: arr real, pointer, contiguous :: arr_ptr(:,:,:) complex, pointer, contiguous :: arr_ptr_comp(:,:,:) end type contains subroutine my_allocation(self) type(my_type), target:: self type(c_ptr) :: cptr allocate(self%arr(1024,1024,1)) self%arr_ptr => self%arr cptr = c_loc(self%arr) call c_f_pointer(cptr,self%arr_ptr_comp,[512,1024,1]) end subroutine subroutine mult_1(self,scalar) class(my_type) :: self real :: scalar self%arr = self%arr * scalar end subroutine subroutine mult_1_simd(self,scalar) class(my_type) :: self real :: scalar !dir$ simd self%arr = self%arr * scalar end subroutine subroutine mult_1_do(self,scalar) class(my_type) :: self real :: scalar integer :: i,j,k do k=1,size(self%arr,3) do j=1,size(self%arr,2) do i=1,size(self%arr,1) self%arr(i,j,k) = self%arr(i,j,k) * scalar enddo enddo enddo end subroutine subroutine mult_1_do_ptr(self,scalar) class(my_type) :: self real :: scalar integer :: i,j,k do k=1,size(self%arr,3) do j=1,size(self%arr,2) do i=1,size(self%arr,1) self%arr_ptr(i,j,k) = self%arr_ptr(i,j,k) * scalar enddo enddo enddo end subroutine subroutine mult_1_do_ptr_comp(self,scalar) class(my_type) :: self real :: scalar integer :: i,j,k do k=1,size(self%arr_ptr_comp,3) do j=1,size(self%arr_ptr_comp,2) do i=1,size(self%arr_ptr_comp,1) self%arr_ptr_comp(i,j,k) = self%arr_ptr_comp(i,j,k) * scalar enddo enddo enddo end subroutine subroutine mult_1_do_concurrent(self,scalar) class(my_type) :: self real :: scalar integer :: i,j,k do concurrent(i=1:size(self%arr,1),j=1:size(self%arr,2),k=1:size(self%arr,3)) self%arr(i,j,k) = self%arr(i,j,k) * scalar enddo end subroutine subroutine mult_1_forall(self,scalar) class(my_type) :: self real :: scalar integer :: i,j,k forall(i=1:size(self%arr,1),j=1:size(self%arr,2),k=1:size(self%arr,3)) self%arr(i,j,k) = self%arr(i,j,k) * scalar end forall end subroutine end module program vectest use my_module implicit none type(my_type) :: my_obj integer, parameter :: number_of_trips = 1000 real :: time_begin,time_end integer :: current_trip call my_allocation(my_obj) my_obj%arr = 1.0e0 call cpu_time(time_begin) do current_trip=1,number_of_trips call mult_1(my_obj,1.01e0) enddo call cpu_time(time_end) write(*,'(a,i0,a,f0.2,a)') 'mult1 : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds' my_obj%arr = 1.0e0 call cpu_time(time_begin) do current_trip=1,number_of_trips call mult_1_simd(my_obj,1.01e0) enddo call cpu_time(time_end) write(*,'(a,i0,a,f0.2,a)') 'mult1_simd : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds' my_obj%arr = 1.0e0 call cpu_time(time_begin) do current_trip=1,number_of_trips call mult_1_do(my_obj,1.01e0) enddo call cpu_time(time_end) write(*,'(a,i0,a,f0.2,a)') 'mult1_do : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds' my_obj%arr_ptr = 1.0e0 call cpu_time(time_begin) do current_trip=1,number_of_trips call mult_1_do_ptr(my_obj,1.01e0) enddo call cpu_time(time_end) write(*,'(a,i0,a,f0.2,a)') 'mult1_do_ptr : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds' my_obj%arr_ptr_comp = (1.0e0,1.0e0) call cpu_time(time_begin) do current_trip=1,number_of_trips call mult_1_do_ptr_comp(my_obj,1.01e0) enddo call cpu_time(time_end) write(*,'(a,i0,a,f0.2,a)') 'mult1_do_ptr_comp : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds' my_obj%arr = 1.0e0 call cpu_time(time_begin) do current_trip=1,number_of_trips call mult_1_do_concurrent(my_obj,1.01e0) enddo call cpu_time(time_end) write(*,'(a,i0,a,f0.2,a)') 'mult1_do_concurrent: did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds' my_obj%arr = 1.0e0 call cpu_time(time_begin) do current_trip=1,number_of_trips call mult_1_forall(my_obj,1.01e0) enddo call cpu_time(time_end) write(*,'(a,i0,a,f0.2,a)') 'mult1_forall : did ', number_of_trips, ' trips in ', time_end-time_begin, ' cpu seconds' end program vectest[/fortran]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry - I think I deleted the post by accident thinking it was a duplicate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
after commenting out the simd directive (what did your compiler report doing with it?)
On my Windows laptop (the only one I have with AVX) with ifort 14.0.1 :
ifort -assume:realloc_lhs -heap-arrays ar.f90
$ ./ar
mult1 : did 1000 trips in .98 cpu seconds
mult1_simd : did 1000 trips in .97 cpu seconds
mult1_do : did 1000 trips in .72 cpu seconds
mult1_do_ptr : did 1000 trips in .73 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .45 cpu seconds
mult1_do_concurrent: did 1000 trips in 18.52 cpu seconds
mult1_forall : did 1000 trips in 3.00 cpu seconds
ifort -assume:realloc_lhs -heap-arrays -QxHost ar.f90
$ ./ar
mult1 : did 1000 trips in .39 cpu seconds
mult1_simd : did 1000 trips in .41 cpu seconds
mult1_do : did 1000 trips in .69 cpu seconds
mult1_do_ptr : did 1000 trips in .72 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .44 cpu seconds
mult1_do_concurrent: did 1000 trips in 20.22 cpu seconds
mult1_forall : did 1000 trips in 2.84 cpu seconds
forall is faster on stack. For a rank 1 forall, ifort will skip temporary stack allocation when !dir$ ivdep is set.
I didn't see you mention that the compiler reports "subscript too complex" and declines to optimize the do concurrent until the loop order is changed to outer,middle, inner so that it can reach the .72 second time. I don't know if that is an accepted way to optimize do concurrent; gfortran will optimize only a single level loop there.
gfortran is faster than ifort on mult1_do and mult1_do_ptr, and sse2 vs. avx makes little difference.
According to
$ ifort -assume:realloc_lhs -Qvec-report7 -QxHost /Qunroll2 ar.f90 /link /stack
:800000000 2>&1 | ../vecanalysis/vecanalysis.py --annotate
mult1_do, mult1_do_ptr, and optimized do_concurrent in-lined versions use 50% more "vector operations" than the faster ones, which evidently is reflected in the run time. The in-lining in this benchmark appears possibly counter-productive (accounting for losses in comparison to gfortran?)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your comments, Tim.
To answer your points:
- my compiler said SIMD LOOP WAS VECTORIZED at line 27. Is this not what you expected? Why did you remove it? I was following Martyn's suggestion above, but am I missing something?
- "forall is faster on stack". Do you mean without -heap-arrays? I believe I have to have -heap-arrays in there for unrelated reasons, so I shan't explore further. But do you know why the heap v stack may make a difference here?
- thanks for the hint regarding inlining. Disabling inlining makes the results look a bit saner:
mult1 : did 1000 trips in .70 cpu seconds
mult1_simd : did 1000 trips in .70 cpu seconds
mult1_do : did 1000 trips in .25 cpu seconds
mult1_do_ptr : did 1000 trips in .25 cpu seconds
mult1_do_ptr_comp : did 1000 trips in .24 cpu seconds
mult1_do_concurrent: did 1000 trips in 6.63 cpu seconds
mult1_forall : did 1000 trips in .65 cpu seconds
Again, do you have a feeling for why preventing inlining may help? I'd love to understand this better so I might use lessons learnt here in "real world" applications.
I guess it seems to me that the compiler should be doing a much better job of optimizing mult1 and mult1_forall. And the treatment of the do concurrent construct seems buggy to me. I much prefer the gfortran behaviour you report.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm surprised that your version of the compiler accepted a simd directive on an array assignment, but it isn't needed, unless it can be used to suppress a temporary allocation. It's difficult to guess where the compiler writers have set the dividing line between directives which can apply only to DO loops or similar and those which apply also to array assignments.
The developers of OpenMP 4 standard must have decided not to tackle the ambiguity of simd clauses on array assignments (since the clause is applied specifically to the DO loop immediately following the directive). Apparently, there was debate on whether omp workshare should be retained and to what extent it should be useful (where it's not at all useful with ifort nor gfortran).
Your example points out how the compiler makes a distinction between multiple rank assignments and DO loops where you have specified the optimum order of operations. I already remarked on how it might be surprising that ifort can begin to optimize DO CONCURRENT only with a specific order of index specifications (and that other compilers can't optimize even with the order changed). I use a lot of plain DO loops with the inner loop written as a single rank array assignment, but that is an unpopular style.
MATMUL is the most obvious case where a compiler should be expected to optimize reliably what is in effect a multiple-rank assignment. Each compiler has its own peculiar MATMUL options.
The greater speed of FORALL with stack allocation presumably is due to the greater overhead of heap memory allocation (library function calls....). So I suppose the lesson is to be doubly wary of FORALL requiring allocation of a temporary array when using heap-arrays. I avoid FORALL usage where it requires over-writing an operand, which will cause the compiler to generate a temporary copy. FORALL cases where DO CONCURRENT doesn't work are to be avoided.
By popular demand, commercial compilers are expected to attempt to optimize redundant extra looping discovered though inter-procedural analysis, but it isn't done reliably. It's a little surprising that in your case this makes it inconsistently worse. Popular demand also has led to elimination of any documented compiler reports on attempts (successful or not) at dead code removal. Vectorization reports in the latest gfortran are so verbose and arcane as to be nearly useless for anyone but compiler developers, and the same is probably true of the undocumented il0 reports in ifort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alexis,
All three of your loops are vectorized by version 16 of the Intel compiler, which was released a few days ago. (Loop 3 was already vectorized by the version 15 compiler). I believe the handling of DO CONCURRENT has also improved significantly since version 14.0. The version 16 compiler is available from registrationcenter.intel.com with current support. Thanks for reporting these vectorization issues.
There are some known performance issues associated with using -heap-arrays with OpenMP applications, which are being worked on. But that's another subject...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page