- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I have been looking at the response of old style DO loops to optimisation. I have found one case which performs opposite to what is expected.
I have considered 6 options for calculating a Dot_Product, using ifort ver 11.1:
1) Including a conventional DO loop in a block of code.
[bash] c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_0+k) * B(B_ptr_0+k) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c [/bash]
5) Modified the DO loop with subscripts as temproary variables
[bash] c = 0 do k = JEQ_bot,J-1 ia = A_ptr_0+k ib = B_ptr_0+k c = c + A(ia) * B(ib) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c [/bash]
6) Modified DO loop for temporary subscripts as forauto increment
[bash] c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_b) * B(B_ptr_b) A_ptr_b = A_ptr_b+1 B_ptr_b = B_ptr_b+1 end do A(A_ptr_b) = A(A_ptr_b) - c [/bash]
A and B are real*8 vectors and all subscripts are integer*4
I have tested these on a Xeon processor with compiler options:
/o1, /o2 or /o3 (/o2 is default)
/Qvect or /Qvect- (/Qvect is default)
/QxHost
All coding options, EXCEPT option 3 show improvement from ( /o1or /o2 /Qvec- )to /o2, with /o3 similar, reducing from 13 seconds to 8 seconds, presumably benefiting from the vector instructions./QxHost does not have any significant effect on the results.
However for option 3, all /o1 or (/o2 /Qvec- /QxHost) took 13 seconds, but if /QxHostand (/o2 or /o3) were combined, with/Qvect default, the run time blew out to 22 seconds. /o3 without /QxHost reduced slightly to 11.5 seconds. This is much different from all other options.
I was expecting that /QxHost was to utilise the preferred coding for the processor installed, but in this case 3 it appears to fail.
Option 3 is a coding approach I have used in a lot of old F77 stylecodes, where the wrapper is a set of common calculations stored in a library of common routines. They are introduced as simple routines, that have been expected to benefit from optimisation at a local level. This approach suits a number of other compilers.
From this I conclude that for ifort:
/QxHost should not be used and
I should review my use of libraries of common calculations.
Why would the combination of /QxHost with /o2 or /o3 cause such a contra result in this coding example 3 ?
John
I have considered 6 options for calculating a Dot_Product, using ifort ver 11.1:
1) Including a conventional DO loop in a block of code.
[bash] c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_0+k) * B(B_ptr_0+k) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c [/bash]
2) Converting the DO Loop to Dot_Product, using array sections.
3) Wrapping the DO loop into a F77 style function
4) Wrapping the Dot_Product into a F77 style function, to avoid array sections.
5) Modified the DO loop with subscripts as temproary variables
[bash] c = 0 do k = JEQ_bot,J-1 ia = A_ptr_0+k ib = B_ptr_0+k c = c + A(ia) * B(ib) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c [/bash]
6) Modified DO loop for temporary subscripts as forauto increment
[bash] c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_b) * B(B_ptr_b) A_ptr_b = A_ptr_b+1 B_ptr_b = B_ptr_b+1 end do A(A_ptr_b) = A(A_ptr_b) - c [/bash]
A and B are real*8 vectors and all subscripts are integer*4
I have tested these on a Xeon processor with compiler options:
/o1, /o2 or /o3 (/o2 is default)
/Qvect or /Qvect- (/Qvect is default)
/QxHost
All coding options, EXCEPT option 3 show improvement from ( /o1or /o2 /Qvec- )to /o2, with /o3 similar, reducing from 13 seconds to 8 seconds, presumably benefiting from the vector instructions./QxHost does not have any significant effect on the results.
However for option 3, all /o1 or (/o2 /Qvec- /QxHost) took 13 seconds, but if /QxHostand (/o2 or /o3) were combined, with/Qvect default, the run time blew out to 22 seconds. /o3 without /QxHost reduced slightly to 11.5 seconds. This is much different from all other options.
I was expecting that /QxHost was to utilise the preferred coding for the processor installed, but in this case 3 it appears to fail.
Option 3 is a coding approach I have used in a lot of old F77 stylecodes, where the wrapper is a set of common calculations stored in a library of common routines. They are introduced as simple routines, that have been expected to benefit from optimisation at a local level. This approach suits a number of other compilers.
From this I conclude that for ifort:
/QxHost should not be used and
I should review my use of libraries of common calculations.
Why would the combination of /QxHost with /o2 or /o3 cause such a contra result in this coding example 3 ?
John
コピーされたリンク
3 返答(返信)
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
What make and exact model CPU are you using? I would not expect /QxHost to make things worse.
Your option 3 means that the compiler has no idea how long the loops are nor whether the arguments are aligned. Otherwise it looks ok. But I would suggest you look at the BLAS dot-product routines in MKL if performance is important.
Your option 3 means that the compiler has no idea how long the loops are nor whether the arguments are aligned. Otherwise it looks ok. But I would suggest you look at the BLAS dot-product routines in MKL if performance is important.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I have now updated ifort from Ver 11.1 to Ver 12.1.5.344, and the Xeon processor and to Win 7_64.
The problem I found with /QxHost has now been removed, so I am relieved that my old approach of using libraries of simple procedures can still be used. ( the problem still occured with Ver 11.1 and teh upgraded PC)
I now need to go back and see what other problems I was having.
The aim of this review ( and selection of ifort) has been to identify how to "parallel" a skyline direct solver for large sets of linear equations.
I have found that vectorising was easy to use, but I struggled with the advice I received last year on parallelizing my code. While I am very experienced in coding to F77 and F95, a consequence of this experience is that my age makes it more difficult to learn new techniques !
If you could recommend to me sections of the ifort documentation that I should first read to better understand how to approach parallelizing, it would be appreciated.
John
The problem I found with /QxHost has now been removed, so I am relieved that my old approach of using libraries of simple procedures can still be used. ( the problem still occured with Ver 11.1 and teh upgraded PC)
I now need to go back and see what other problems I was having.
The aim of this review ( and selection of ifort) has been to identify how to "parallel" a skyline direct solver for large sets of linear equations.
I have found that vectorising was easy to use, but I struggled with the advice I received last year on parallelizing my code. While I am very experienced in coding to F77 and F95, a consequence of this experience is that my age makes it more difficult to learn new techniques !
If you could recommend to me sections of the ifort documentation that I should first read to better understand how to approach parallelizing, it would be appreciated.
John
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I suggest that you do a build with Guided Auto Parallelization/Vectorization (GAP) on to see what the compiler has to say about what you might do differently. This is a build using /Qguide - it does not create an executable, but can output diagnostics with recommendations.
