Optimisation of Do Loops

John_Campbell · ‎05-13-2012

I have been looking at the response of old style DO loops to optimisation. I have found one case which performs opposite to what is expected.
I have considered 6 options for calculating a Dot_Product, using ifort ver 11.1:

1) Including a conventional DO loop in a block of code.
[bash] c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_0+k) * B(B_ptr_0+k) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c [/bash]

2) Converting the DO Loop to Dot_Product, using array sections.

[bash] A_ptr_t = A_ptr_b + JBAND - 1 B_ptr_t = B_ptr_b + JBAND - 1 A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) [/bash]

3) Wrapping the DO loop into a F77 style function

[bash] A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND) where REAL*8 FUNCTION VEC_SUM (A, B, N) ! integer*4, intent (in) :: n real*8, dimension(n), intent (in) :: a real*8, dimension(n), intent (in) :: b ! real*8 c integer*4 k ! c = 0 do k = 1,N c = c + A(k) * B(k) end do vec_sum = c ! RETURN ! END [/bash]

4) Wrapping the Dot_Product into a F77 style function, to avoid array sections.

[bash] A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND) where REAL*8 FUNCTION VEC_SUM_d (A, B, N) ! integer*4, intent (in) :: n real*8, dimension(n), intent (in) :: a real*8, dimension(n), intent (in) :: b ! vec_sum_d = dot_product ( a, b ) ! RETURN ! END [/bash]
5) Modified the DO loop with subscripts as temproary variables
[bash] c = 0 do k = JEQ_bot,J-1 ia = A_ptr_0+k ib = B_ptr_0+k c = c + A(ia) * B(ib) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c [/bash]
6) Modified DO loop for temporary subscripts as forauto increment
[bash] c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_b) * B(B_ptr_b) A_ptr_b = A_ptr_b+1 B_ptr_b = B_ptr_b+1 end do A(A_ptr_b) = A(A_ptr_b) - c [/bash]
A and B are real*8 vectors and all subscripts are integer*4

I have tested these on a Xeon processor with compiler options:
/o1, /o2 or /o3 (/o2 is default)
/Qvect or /Qvect- (/Qvect is default)
/QxHost

All coding options, EXCEPT option 3 show improvement from ( /o1or /o2 /Qvec- )to /o2, with /o3 similar, reducing from 13 seconds to 8 seconds, presumably benefiting from the vector instructions./QxHost does not have any significant effect on the results.

However for option 3, all /o1 or (/o2 /Qvec- /QxHost) took 13 seconds, but if /QxHostand (/o2 or /o3) were combined, with/Qvect default, the run time blew out to 22 seconds. /o3 without /QxHost reduced slightly to 11.5 seconds. This is much different from all other options.

I was expecting that /QxHost was to utilise the preferred coding for the processor installed, but in this case 3 it appears to fail.

Option 3 is a coding approach I have used in a lot of old F77 stylecodes, where the wrapper is a set of common calculations stored in a library of common routines. They are introduced as simple routines, that have been expected to benefit from optimisation at a local level. This approach suits a number of other compilers.

From this I conclude that for ifort:
/QxHost should not be used and
I should review my use of libraries of common calculations.

Why would the combination of /QxHost with /o2 or /o3 cause such a contra result in this coding example 3 ?

John

Steven_L_Intel1 · ‎05-14-2012

What make and exact model CPU are you using? I would not expect /QxHost to make things worse.

Your option 3 means that the compiler has no idea how long the loops are nor whether the arguments are aligned. Otherwise it looks ok. But I would suggest you look at the BLAS dot-product routines in MKL if performance is important.

John_Campbell · ‎08-21-2012

I have now updated ifort from Ver 11.1 to Ver 12.1.5.344, and the Xeon processor and to Win 7_64.
The problem I found with /QxHost has now been removed, so I am relieved that my old approach of using libraries of simple procedures can still be used. ( the problem still occured with Ver 11.1 and teh upgraded PC)

I now need to go back and see what other problems I was having.

The aim of this review ( and selection of ifort) has been to identify how to "parallel" a skyline direct solver for large sets of linear equations.
I have found that vectorising was easy to use, but I struggled with the advice I received last year on parallelizing my code. While I am very experienced in coding to F77 and F95, a consequence of this experience is that my age makes it more difficult to learn new techniques !

If you could recommend to me sections of the ifort documentation that I should first read to better understand how to approach parallelizing, it would be appreciated.

John

Steven_L_Intel1 · ‎08-22-2012

I suggest that you do a build with Guided Auto Parallelization/Vectorization (GAP) on to see what the compiler has to say about what you might do differently. This is a build using /Qguide - it does not create an executable, but can output diagnostics with recommendations.