Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28637 Discussions

Very slow compilation of certain loop structures (FC 13.x)

Sait_U_
Beginner
665 Views

All versions of FC 13.x takes a very long time to compile certain loop structures. I have isolated the case to few of my subroutines and found a way to change this behavior. I am attaching routines that take a long time and changed version that does not. The compilation flags are: -c -free -warn all -nogen-interfaces -O3 -xHost -openmp CPU's are : Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz FOr your info, Sammy

0 Kudos
8 Replies
mecej4
Honored Contributor III
665 Views

I am unable to match your hardware and software set up, so take this note as consisting of suggestions only. You may try -O2 instead of -O3 if you want to improve compilation time. The variant code below may be more cache-friendly. [fortran] do iz = 1, ncolz do iy = 1, ncoly do ix = 1, ncolx fout(ix,iy,iz) = fout(ix,iy,iz) + & dot_product(d1x(ix,1:ncolx),finn(1:ncolx,iy,iz,1)) + & dot_product(d1y(iy,1:ncoly),finn(ix,1:ncoly,iz,2)) + & dot_product(d1z(iz,1:ncolz),finn(ix,iy,1:ncolz,3)) end do end do end do [/fortran]

0 Kudos
Sait_U_
Beginner
665 Views

Hi, thanks for the suggestios. I am sure those will compile faster...like the second example I attached.

My purpose for posting this is for the developers to fix this problem since I had to spend a lot of time trying to

figure it why it takes many more minutes the compile my large code and reduced it to a number of subroutines.

If you try to compile this little subroutine on my machine it literally takes minutes to compile!. While the altered version compiles in seconds. It could be the extra optimization it is trying to do for my processor.

Cheers,

Sait

0 Kudos
jimdempseyatthecove
Honored Contributor III
665 Views

Sait,

In looking at your code and mecej4's sample, I think there is something in mecej4's code that was glossed over. This is in FORTRAN, you should nest your loops such that the inner most loop is the left most index in the array subscripts. IOW structure for index order (left to right) be loop nest level (inner to outer). Your sample code has this reversed. The optimization process in the compiler at O3 may be working hard to try to invert the loop order, as well as trying to fuze the loops. I suggest you reorder the loops and see what happens to the compile time. Then also make a test to see what happens to the runtime. You may be pleasantly surprised on both accounts.

Jim Dempsey

0 Kudos
Sait_U_
Beginner
665 Views

Hello Jim,

I just tried your suggestion and made no difference to the compile time. Very large subroutines compile in a short time then I have to wait for three routines (one of which is the original gradient.f90) to compile. I run a parallel compile on a 16 processor machine.After the initial compile phase 3 processors work at 100% for about a minute to compile these routines. Do you think a compiler should spend that much time on such a small subroutine?

Regarding the loop ordering....that was something we were doing twenty years ago. I assumed the compilers are not clever enough so we don't have to worry about such things. Do you think it would really make a runtime difference? This would be dissapointing. Inlined my very large code and found that it runs slower, that was a dissapoinment. Perhaps I should try it again.

0 Kudos
Sait_U_
Beginner
665 Views

By the way.....ifort versions 12.x compile these very fast...there is no such delay.

0 Kudos
mecej4
Honored Contributor III
665 Views

By the way.....ifort versions 12.x compile these very fast

On my lowly PC with an E8400 CPU, 8 GB RAM, SATA 5400 r.p.m. HD, OpenSuse 12.2-X64, my findings regarding compilation time are quite different from yours. With the command

ifort -c -O3 -xHost -openmp divergence-long.f90

IFort 12.1.7.367 took 20 seconds, whereas IFort 13.1.1.163 took only 1.6 seconds.

0 Kudos
Sait_U_
Beginner
665 Views

On my computer the same identical command takes 78 seconds (yes more than a minute!). -O2 -XHost takes fraction of a second, only -O3 with no -xHost takes 16 seconds. This is Xeon E5-2687W (8 core, dual) with 64Gb. Perhaps it is the AVX optimization. I am surprised at your 12.x result being 20 seconds. In my case this is just a few seconds.

0 Kudos
Steven_L_Intel1
Employee
665 Views

When I try it, with -O3 and -xHost (on a Nehalem system), 12.1 takes 25 seconds and 13.1 takes 50 seconds. I will look into this further.

0 Kudos
Reply