已解决: Thanks a lot for your answer.

Guillaume_De_Nayer · ‎02-10-2015

Hi,

I compile the following small code with "-O3 -shared-intel" on three different clusters:

cluster1: Intel(R) Xeon(R) CPU X5675 with ifort 12.1.0
cluster2: Intel(R) Xeon(R) CPU X5650 with ifort 12.1.0
cluster3: Intel(R) Xeon(R) CPU E5-2650 v2 with ifort 15.0.0

      program main
c
      implicit none

      integer jma, kma, ntstepmax
      integer na
      integer nfx,nfy,nfz
      real lnx,lny,lnz
      parameter (jma = 139, kma = 16)
      parameter (ntstepmax = 100)
      parameter (nfx = 1180, nfy = 8, nfz = 14)
      parameter (lnx = 590, lny = 4, lnz = 7)
      parameter (na = 1)
c
      integer ntstep
      integer i,j,k,i2,j2,k2,l
      real  a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz)
      real  xu(-nfx:nfx,-nfy+1:jma+nfy,-nfz+1:kma+nfz)
      real  yu(1:jma+2,1:kma+2)
c
      do ntstep = 1,ntstepmax
c
        write (*,*) '  ........ ntstep = ',ntstep
c
        l = 1
c
        do k = 1,kma
          do j = 1,jma
            yu(j+1,k+1) = 0.0
c
            do k2 = -nfz,nfz
              do j2 = -nfy,nfy
                do i2 = -nfx,nfx
c$$$            do i2 = -nfx,nfx
c$$$              do j2 = -nfy,nfy
c$$$                do k2 = -nfz,nfz
                  yu(j+1,k+1) = yu(j+1,k+1) +
     & xu(i2,j+j2,k+k2)*a(l,i2,j2,k2)
                 enddo
               enddo
            enddo
c
          enddo
        enddo
c
      enddo
c
      end

The results are quite strange:

cluster1: 0m29s
cluster2: 0m37s
cluster3: 2m32s

Between cluster1 and cluster2 the difference of time is small and it could be linked to the difference of CPU frequency between both cluster.

But why is cluster3 so slow? The hardware is quite new (last year) in comparison with cluster1 and cluster2 (hardware and software from 2011).

Is it a problem with the code above? Or a problem of optimization?

any help or suggestions would be appreciated.

Best regards,

Guillaume De Nayer

jimdempseyatthecove · ‎02-10-2015

As an educated guess, the a(l,i2,j2,k2), where l=1 and na=1 and thus real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) where the first rank=1, is not recognized as a stride-1 access when traversing the i2 index. This results in your loop running in scalar (or pseudo gather) mode.

As confirmation of this use:

real a(-nfx:nfx,-nfy:nfy,-nfz:nfz, na)

and

a(i2,j2,k2,l)

While the work around may return performance, your inclination may be "fix the compiler". While the compiler should be fixed, the fact that you have real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) implies to me that at some point in time (or even now) your eventual application will have an na > 1 and you will be performing a similar computation loop. At the point where na >1 you will experience a similar slow down. Moving the na dimension to the other end ensures your inner loop runs with stride 1 and thus can favorably be optimized.

FWIW use your above program, set na to a number representative of what you intend this to be, then re-run your tests on all three systems.
(then move the na index position to last and run again)

Jim Dempsey

在原帖中查看解决方案

jimdempseyatthecove · ‎02-10-2015

As an educated guess, the a(l,i2,j2,k2), where l=1 and na=1 and thus real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) where the first rank=1, is not recognized as a stride-1 access when traversing the i2 index. This results in your loop running in scalar (or pseudo gather) mode.

As confirmation of this use:

real a(-nfx:nfx,-nfy:nfy,-nfz:nfz, na)

and

a(i2,j2,k2,l)

While the work around may return performance, your inclination may be "fix the compiler". While the compiler should be fixed, the fact that you have real a(na,-nfx:nfx,-nfy:nfy,-nfz:nfz) implies to me that at some point in time (or even now) your eventual application will have an na > 1 and you will be performing a similar computation loop. At the point where na >1 you will experience a similar slow down. Moving the na dimension to the other end ensures your inner loop runs with stride 1 and thus can favorably be optimized.

FWIW use your above program, set na to a number representative of what you intend this to be, then re-run your tests on all three systems.
(then move the na index position to last and run again)

Jim Dempsey

Guillaume_De_Nayer · ‎02-10-2015

Thanks a lot for your answer.

Indeed you're right. real a(-nfx:nfx,-nfy:nfy,-nfz:nfz, na) "fix" the problem.

Thanks for the tip. I will check the whole code.

Best regards

mecej4 · ‎02-10-2015

I think that this program is not useful as a 'benchmark' because it references an uninitialized array (xu).

Depending on the hardware used, the FPU exception flags in effect and the compiler options used, part of the uninitialized array may contain bit patterns which, when interpreted as IEEE floating point values, would cause exceptions to be triggered and handled in the background. Handling these exceptions billions of times might cause more of the program time to be spent in the exception handling code than doing your intended 'calculations'.

Guillaume_De_Nayer · ‎02-10-2015

Thanks you for your answer mcej4.

The 'benchmark', which I gave in my first post, is only a small part of our code. In the whole code all the values are initialized. And the performance problem are still present.

I have tested the 'benchmark' above with initialized arrays and it has no impact on the results.

Best regards

Performance problem between ifort 12.1 and ifort 15.0 on the same code