Quote:Steve Lionel (Intel)

vahid_a_1 · ‎08-04-2016

Hi..

Would you please help me understand what is wrong with the below code structure. I spent a couple of months for converting my straight code into modules and subroutines similar to a form as below. Now, the speed of my code has decreased four times. By the way, in the new structure I didnt use 'interface" and the code works fine. The huge issue is right now is the speed.

The new object oriented code structure

module arrayMod   
  real,dimension(:,:) :: theArray    
end module arrayMod

program test
   use arrayMod
   implicit none

   call arraySub
   write(*,*) (thenewArray) 

end program test

subroutine arraySub
   use arrayMod

   write(*,*) 'Inside arraySub()'
   perform operations
   
end subroutine arraySub

The old straight forward code structure

program test
   implicit none
   real,dimension(:,:) :: theArray
   
   perform the operations   
   write(*,*) (thenewArray) 

end program test

Steven_L_Intel1 · ‎08-04-2016

This isn't your real program. Please show the actual code, before and after would help. You have left out the most important part - what is actually taking the time! There is also no object-oriented code here.

I would also recommend that "arraySub" be a procedure in the module rather than external.

TimP · ‎08-04-2016

This kind of question usually requires more digging, at least to the extent of comparing the results of -qopt-report=4. A factor of 4 would seem to imply missed vectorization or some such thing which should show up, maybe with reasons, in that optional output.

Vectorization Advisor might facilitate identification of such performance regressions.

vahid_a_1 · ‎08-04-2016

Steve Lionel (Intel) wrote:

This isn't your real program. Please show the actual code, before and after would help. You have left out the most important part - what is actually taking the time! There is also no object-oriented code here.

I would also recommend that "arraySub" be a procedure in the module rather than external.

Thanks Steve for such a quick reply.

The real program is more than 3000 lines. I can email it to you without any hesitation. The real code is an explicit finite difference solver of two Partial differential Equations with some preconditioning steps.

By the way, I have learned Fortran by myself and now I am at some point in which I feel that usual internet stuff is not improving my coding skills.

vahid_a_1 · ‎08-04-2016

Tim P. wrote:

This kind of question usually requires more digging, at least to the extent of comparing the results of -qopt-report=4. A factor of 4 would seem to imply missed vectorization or some such thing which should show up, maybe with reasons, in that optional output.

Vectorization Advisor might facilitate identification of such performance regressions.

Dear Tim,

I compiled the code with -qopt-report=4 option and I have attached the output here. Is there any instruction manual for understanding the content of this report file.

Thanks.

TimP · ‎08-05-2016

We need to compare the fields before and after which refer to sections of code which are losing time. A convenient way to do this is by running under Intel Parallel Advisor. The beta Advisor supports back to XE2016 fully and is sufficiently useful with 2015.

In the recent compilers, the optrpt will quote compile options, so, for example, we can see the target instruction set and -align: settings.

The report shows some "not vectorized" loops associated with decisions not to interchange loops. It seems ambiguous whether "imperfect loop nest" bears on this. Imperfect means roughly that there are operations inside the outer loop but outside the inner one. So you might look at any places where you changed the code in such a way, and check the suggestions about enabling outer loop vectorization by !$omp simd.

You surely got the compiler tied up in knots at source line 1199. Also note the comments that -O3 would be better than some unspecified more aggressive optimization setting the compiler appears to have picked up. If you have exceeded some compiler internal threshold and caused it to stop optimizing, that could well account for your slowdown. Compiling thousands of source lines in a single compilation unit can easily provoke such problems (and form a test case to see whether a more up to date compiler handles it better).

My FORTRAN object oriented code is very slow. Any suggestions?