topic If you believe there's a in Software Tuning, Performance Optimization & Platform Monitoring

Global Variables with ifort

Rafael_Silva — Mon, 17 Sep 2012 13:45:11 GMT

Hello,

I'm using Fortran compiler from intel and I'm getting performance issues when I use global variables (arrays). Basically, I have a four dimension array, and a loop processing all of its elements. When I pass these array as a parameter for the subroutine, I have an execution time. When I use it directly inside the routine as a global variable, the execution time is the double from previous one. I'm guessing the compiler disables some optimizations when I use a global variable. Is it the case? If yes, how can I enable it again? If no, does anyone have any idea why this slow down?

The arrays are declared on a global module like this:

real, allocatable, target :: ux(:,:,:,:), uy(:,:,:,:)

And allocated with:

allocate(ux(nmin1-4:nmax1+4+u_pad, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))

Any idea what is happening?

If you've read the ifort

TimP — Mon, 17 Sep 2012 14:18:33 GMT

If you've read the ifort documentation but still are having difficulty understanding the compiler's optimization reports, you could follow up with an actual code sample on the Fortran forum appropriate to your platform (Windows or linux and MAC). Your description leaves too much up to the imagination.

I've read the documentation

Rafael_Silva — Mon, 17 Sep 2012 14:32:10 GMT

I've read the documentation and didn't found anything specific to the globals. Considering, this is related to optimization I thought this was the correct place to post. I can't see what is missing on the description: the central point is: I changed a array from parameter to global and it slown down the code doubling the execution time (the same code, same allocation, same types, same names). I would like to know, what is missing to describe...

If you believe there's a

TimP — Mon, 17 Sep 2012 19:33:42 GMT

If you believe there's a difference in compilation, comparing the results of opt-report-file option should verify it. If you don't want to give a working example, you could at least quote the differences in the reports, in case they mean more to us than to you. Again, you'd get more expert opinions on the relevant Fortran forum, in case that's what you are interested in.

I've posted on Intel Fortran

Rafael_Silva — Mon, 17 Sep 2012 23:24:50 GMT

I've posted on Intel Fortran Compiler forum, but got no answer. If you think is more accurate to move this topic there, please do it (or someone who can do that). This is the loop: do k=nmin3-4,nmax3+4 do j=nmin2-4,nmax2+4 do i=nmin1-4,nmax1+4 ux(i,j,k,3) = (20.*ux(i,j,k,2) - 6.*ux(i,j,k, 1) - 4.*ux(i,j,k,0) + ux(i,j,k,-1) + 12.*ux(i,j,k,3)*dt2)*ctt uy(i,j,k,3) = (20.*uy(i,j,k,2) - 6.*uy(i,j,k, 1) - 4.*uy(i,j,k,0) + uy(i,j,k,-1) + 12.*uy(i,j,k,3)*dt2)*ctt uz(i,j,k,3) = (20.*uz(i,j,k,2) - 6.*uz(i,j,k, 1) - 4.*uz(i,j,k,0) + uz(i,j,k,-1) + 12.*uz(i,j,k,3)*dt2)*ctt end do end do end do This is the declaration: real, allocatable, target :: ux(:,:,:,:), uy(:,:,:,:) real, allocatable, target :: uz(:,:,:,:) This is the allocation: allocate(ux(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3)) allocate(uy(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3)) allocate(uz(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3)) Case 1: They are declared/allocated inside a subroutine called: Source. This subroutine calls another one, called Update, passing ux,uy,uz as parameters, which are "defined" by Update subroutine like this: real :: ux(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3) real :: uy(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3) real :: uz(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3) In fact, it is the same definition from the caller subroutine, Source. Case 2: They are still allocated by Source subroutine, but declared as globals on another module accessible to Source and Update subroutines. So, the subroutine update just access them as global variables, with no need to define or pass as parameter. My problem is: case 2 is two times slower than case 1. In case 1, the line numbers are: 19: do k=nmin3-4,nmax3+4 20: do j=nmin2-4,nmax2+4 21: do i=nmin1-4,nmax1+4 The HLO report for case 1 on these lines is: LOOP DISTRIBUTION in update_3d_ at line 21 LOOP DISTRIBUTION in update_3d_ at line 21 LOOP DISTRIBUTION in update_3d_ at line 21 Loop Interchange not done due to: Original Order seems proper Advice: Loop Interchange might help Loopnest at lines: 19 20 21 : Original Order found to be proper, but by a close margin In case 2, the line numbers are: 12: do k=nmin3-4,nmax3+4 13: do j=nmin2-4,nmax2+4 14: do i=nmin1-4,nmax1+4 and the report says: LOOP DISTRIBUTION in update_3d_ at line 14 Loop Interchange not done due to: Original Order seems proper Advice: Loop Interchange might help Loopnest at lines: 12 13 14 : Original Order found to be proper, but by a close margin I can see that, there is two more LOOP DISTRIBUTION lines on the first case. This the only difference I see. Any idea what is causing this and why?

Are you saying that neither

TimP — Tue, 18 Sep 2012 00:12:43 GMT

Are you saying that neither case (or maybe both) show full vectorization, so you don't see a difference there? The compiler is more likely to perform distribution on vectorized loops, but I certainly wouldn't like to rely on that as the only indicator.

You're right, I was not

Rafael_Silva — Tue, 18 Sep 2012 20:47:38 GMT

You're right, I was not enabling the full report. The faster version (case 1) says: (629:13-629:13):VEC:sourcewave_: PARTIAL LOOP WAS VECTORIZED The slower version (case 2) says: HPO Vectorizer Report (update_3d_) src/Update.f90(14): (col. 13) remark: loop was not vectorized: existence of vector dependence. src/Update.f90(21): (col. 15) remark: vector dependence: assumed ANTI dependence between (unknown) line 21 and (unknown) line 15. src/Update.f90(15): (col. 15) remark: vector dependence: assumed FLOW dependence between (unknown) line 15 and (unknown) line 21. src/Update.f90(20): (col. 15) remark: vector dependence: assumed ANTI dependence between (unknown) line 20 and (unknown) line 15. (...) But, why? Considering it is the same code, why the compiler consider one code with dependence and the other not?

Forcing the vectorization,

Rafael_Silva — Wed, 19 Sep 2012 12:01:23 GMT

Forcing the vectorization, using the directive to say there is no loop carried dependence, makes the execution time almost the same, for the both cases. But I still can't understand why the compiler consider one case with dependences and the other not. The code is the same...I just change the declarations from local to global.