Very slow compilation of certain loop structures (FC 13.x)

Sait_U_ · ‎04-10-2013

All versions of FC 13.x takes a very long time to compile certain loop structures. I have isolated the case to few of my subroutines and found a way to change this behavior. I am attaching routines that take a long time and changed version that does not. The compilation flags are: -c -free -warn all -nogen-interfaces -O3 -xHost -openmp CPU's are : Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz FOr your info, Sammy

mecej4 · ‎04-10-2013

I am unable to match your hardware and software set up, so take this note as consisting of suggestions only. You may try -O2 instead of -O3 if you want to improve compilation time. The variant code below may be more cache-friendly. [fortran] do iz = 1, ncolz do iy = 1, ncoly do ix = 1, ncolx fout(ix,iy,iz) = fout(ix,iy,iz) + & dot_product(d1x(ix,1:ncolx),finn(1:ncolx,iy,iz,1)) + & dot_product(d1y(iy,1:ncoly),finn(ix,1:ncoly,iz,2)) + & dot_product(d1z(iz,1:ncolz),finn(ix,iy,1:ncolz,3)) end do end do end do [/fortran]

Sait_U_ · ‎04-10-2013

Hi, thanks for the suggestios. I am sure those will compile faster...like the second example I attached.

My purpose for posting this is for the developers to fix this problem since I had to spend a lot of time trying to

figure it why it takes many more minutes the compile my large code and reduced it to a number of subroutines.

If you try to compile this little subroutine on my machine it literally takes minutes to compile!. While the altered version compiles in seconds. It could be the extra optimization it is trying to do for my processor.

Cheers,

Sait

jimdempseyatthecove · ‎04-11-2013

Sait,

In looking at your code and mecej4's sample, I think there is something in mecej4's code that was glossed over. This is in FORTRAN, you should nest your loops such that the inner most loop is the left most index in the array subscripts. IOW structure for index order (left to right) be loop nest level (inner to outer). Your sample code has this reversed. The optimization process in the compiler at O3 may be working hard to try to invert the loop order, as well as trying to fuze the loops. I suggest you reorder the loops and see what happens to the compile time. Then also make a test to see what happens to the runtime. You may be pleasantly surprised on both accounts.

Jim Dempsey

Sait_U_ · ‎04-11-2013

Hello Jim,

I just tried your suggestion and made no difference to the compile time. Very large subroutines compile in a short time then I have to wait for three routines (one of which is the original gradient.f90) to compile. I run a parallel compile on a 16 processor machine.After the initial compile phase 3 processors work at 100% for about a minute to compile these routines. Do you think a compiler should spend that much time on such a small subroutine?

Regarding the loop ordering....that was something we were doing twenty years ago. I assumed the compilers are not clever enough so we don't have to worry about such things. Do you think it would really make a runtime difference? This would be dissapointing. Inlined my very large code and found that it runs slower, that was a dissapoinment. Perhaps I should try it again.

Sait_U_ · ‎04-11-2013

By the way.....ifort versions 12.x compile these very fast...there is no such delay.

mecej4 · ‎04-11-2013

By the way.....ifort versions 12.x compile these very fast

On my lowly PC with an E8400 CPU, 8 GB RAM, SATA 5400 r.p.m. HD, OpenSuse 12.2-X64, my findings regarding compilation time are quite different from yours. With the command

ifort -c -O3 -xHost -openmp divergence-long.f90

IFort 12.1.7.367 took 20 seconds, whereas IFort 13.1.1.163 took only 1.6 seconds.

Sait_U_ · ‎04-11-2013

On my computer the same identical command takes 78 seconds (yes more than a minute!). -O2 -XHost takes fraction of a second, only -O3 with no -xHost takes 16 seconds. This is Xeon E5-2687W (8 core, dual) with 64Gb. Perhaps it is the AVX optimization. I am surprised at your 12.x result being 20 seconds. In my case this is just a few seconds.

Steven_L_Intel1 · ‎04-11-2013

When I try it, with -O3 and -xHost (on a Nehalem system), 12.1 takes 25 seconds and 13.1 takes 50 seconds. I will look into this further.