- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I compile the following small code with "-O3 -shared-intel" on three different clusters:
- cluster1: Intel(R) Xeon(R) CPU X5675 with ifort 12.1.0
- cluster2: Intel(R) Xeon(R) CPU X5650 with ifort 12.1.0
- cluster3: Intel(R) Xeon(R) CPU E5-2650 v2 with ifort 15.0.0
program main c implicit none integer jma, kma, ntstepmax integer na integer nfx,nfy,nfz real lnx,lny,lnz parameter (jma = 139, kma = 16) parameter (ntstepmax = 100) parameter (nfx = 1180, nfy = 8, nfz = 14) parameter (lnx = 590, lny = 4, lnz = 7) parameter (na = 1) c integer ntstep integer i,j,k,i2,j2,k2,l real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) real xu(-nfx:nfx,-nfy+1:jma+nfy,-nfz+1:kma+nfz) real yu(1:jma+2,1:kma+2) c do ntstep = 1,ntstepmax c write (*,*) ' ........ ntstep = ',ntstep c l = 1 c do k = 1,kma do j = 1,jma yu(j+1,k+1) = 0.0 c do k2 = -nfz,nfz do j2 = -nfy,nfy do i2 = -nfx,nfx c$$$ do i2 = -nfx,nfx c$$$ do j2 = -nfy,nfy c$$$ do k2 = -nfz,nfz yu(j+1,k+1) = yu(j+1,k+1) + & xu(i2,j+j2,k+k2)*a(l,i2,j2,k2) enddo enddo enddo c enddo enddo c enddo c end
The results are quite strange:
- cluster1: 0m29s
- cluster2: 0m37s
- cluster3: 2m32s
Between cluster1 and cluster2 the difference of time is small and it could be linked to the difference of CPU frequency between both cluster.
But why is cluster3 so slow? The hardware is quite new (last year) in comparison with cluster1 and cluster2 (hardware and software from 2011).
Is it a problem with the code above? Or a problem of optimization?
any help or suggestions would be appreciated.
Best regards,
Guillaume De Nayer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As an educated guess, the a(l,i2,j2,k2), where l=1 and na=1 and thus real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) where the first rank=1, is not recognized as a stride-1 access when traversing the i2 index. This results in your loop running in scalar (or pseudo gather) mode.
As confirmation of this use:
real a(-nfx:nfx,-nfy:nfy,-nfz:nfz, na)
and
a(i2,j2,k2,l)
While the work around may return performance, your inclination may be "fix the compiler". While the compiler should be fixed, the fact that you have real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) implies to me that at some point in time (or even now) your eventual application will have an na > 1 and you will be performing a similar computation loop. At the point where na >1 you will experience a similar slow down. Moving the na dimension to the other end ensures your inner loop runs with stride 1 and thus can favorably be optimized.
FWIW use your above program, set na to a number representative of what you intend this to be, then re-run your tests on all three systems.
(then move the na index position to last and run again)
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As an educated guess, the a(l,i2,j2,k2), where l=1 and na=1 and thus real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) where the first rank=1, is not recognized as a stride-1 access when traversing the i2 index. This results in your loop running in scalar (or pseudo gather) mode.
As confirmation of this use:
real a(-nfx:nfx,-nfy:nfy,-nfz:nfz, na)
and
a(i2,j2,k2,l)
While the work around may return performance, your inclination may be "fix the compiler". While the compiler should be fixed, the fact that you have real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) implies to me that at some point in time (or even now) your eventual application will have an na > 1 and you will be performing a similar computation loop. At the point where na >1 you will experience a similar slow down. Moving the na dimension to the other end ensures your inner loop runs with stride 1 and thus can favorably be optimized.
FWIW use your above program, set na to a number representative of what you intend this to be, then re-run your tests on all three systems.
(then move the na index position to last and run again)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot for your answer.
Indeed you're right. real a(-nfx:nfx,-nfy:nfy,-nfz:nfz, na) "fix" the problem.
Thanks for the tip. I will check the whole code.
Best regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that this program is not useful as a 'benchmark' because it references an uninitialized array (xu).
Depending on the hardware used, the FPU exception flags in effect and the compiler options used, part of the uninitialized array may contain bit patterns which, when interpreted as IEEE floating point values, would cause exceptions to be triggered and handled in the background. Handling these exceptions billions of times might cause more of the program time to be spent in the exception handling code than doing your intended 'calculations'.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks you for your answer mcej4.
The 'benchmark', which I gave in my first post, is only a small part of our code. In the whole code all the values are initialized. And the performance problem are still present.
I have tested the 'benchmark' above with initialized arrays and it has no impact on the results.
Best regards
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page