Degraded Run Time

Nick2 · ‎09-26-2012

In going from one version of my Fortran program to another, I've somehow increased the run time from 24:10 (24 minutes) to 35:34. Looking over the code changes, there's a few more algebraic expressions, changed FORMAT statements, some extra assignments, a few DATA statements - nothing that we haven't done before. This only happens in release mode:

In debug mode, the two code versions have pretty much identical run times. I tried to put major subroutines from the new version into the old version, one at a time, to figure out which one is causing the slow-down; apparently none of them are. I tried to run it through a trial version of VTune, with flags like this:

But I got identical run times between the two code versions again. With 15 MB of source code, I'm worried if I may be hitting some kind of compiler limit - one of my configurations caused this error:

catastrophic error: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error.

(I did report it btw)

Any ideas on what's the best way to figure out why my run time suddenly jumped up by 46%? By the way, I tried the same experiment on OpenVMS, and the difference in run times in release mode is only 28%. My test machine is Windows8/Xeon E5-2687W Intel(R) Visual Fortran Compiler XE 12.1.4.325 [Intel(R) 64]

Les_Neilson · ‎09-26-2012

Difficult to say without code (I know, your code may be too big or too private to post) but some ideas to consider : (a) An increase in the creation of temporary arrays in your "few more algebraic statements" (b) Increased paging - are you accessing matrices in the correct (most efficient) subscript order? (c) Increased I/O (d) Antivirus software interference. (e) Networking traffic. Les

Lorri_M_Intel · ‎09-26-2012

A couple of other things I noticed. First - the VTune run was with non-optimized (ie, debug enabled) code. You said there was little-to-no difference between the two program versions with non-optimized code, so finding no difference in VTune is probably expected. Second - I find it interesting that you're seeing a degradation with your OpenVMS implementation as well. The Intel Fortan compiler and OpenVMS Fortran compiler have the same heritage, but they diverged about 10 years ago and have virtually nothing in common anymore, especially related to optimizations. Therefore -- I think you should look at some of Les's suggestions in the previous reply. Are you accessing matrices differently now? Are there more temporary arrays? I *believe* that /check:arg_temp_created was supported in the OpenVMS version (it has been 10 years and all) so you may be able to use that to find any new temps. I hope this helps -- --Lorri

Nick2 · ‎09-26-2012

Thanks everyone for the great suggestions! The algebraic statements are of the kind: IF (ZSGWRTAP(2,IL).GT.TINY) THEN ZWSGWRL(IL)=MAX(0.E0,ZWDC2SG(IL)-ZSGWRTAP(1,IL)) FZWSGWRL(IL)=ZWSGWRL(IL)/MAX(TINY,ZSGWRTAP(2,IL)-ZSGWRTAP(1,IL)) ENDIF One of the tests I tried was to compile the "old' code with the "new" common blocks (where I store all the arrays), and this hasn't changed the run time. We haven't added any big arrays - all in all, we've added about 150 REAL variables. It's very possible that a small change can de-stabilize some large algorithm, but if that's the case I'm having trouble to find it. In terms of accessing the matrices in the most efficient subscript order, most, we do - others we don't. But we haven't added any big matrix operations in this round. The old code reads and writes about 205 MB of data, and the new one 193 MB. It's all done on a local machine, just one user logged on, and using the "Console" build to minimize any other effects on anything else. I've repeated this result enough times that I doubt anti-virus or any such interference is causing the difference. Maybe I'm not compiling the code correctly for VTune. When I try to run my Release code with VTune, I don't get any stack traces. That's why I started with the Debug version and added "Maximize Speed" and "Generate AVX Code." Is there a better way?

TimP · ‎09-26-2012

You don't show what TINY is. While the intrinsic names aren't reserved, this usage seems dangerous, particularly if you don't use implicit none. TINY(ZSGWRTAP(2,IL)) might make sense.

Nick2 · ‎09-26-2012

Sorry...we use statements like these very frequently: REAL LARGE,TINY,SMALL DATA LARGE/1.E6/,TINY/1.E-6/,SMALL/1.E-3/ Typical design intent is to ignore small physical quantities (eg. less than 1 mm or 1 g) more so than to ignore small numerical quantities. We have a "check for uninitialized variables" build; that should catch the type of error you're referring to, right?

jimdempseyatthecove · ‎09-26-2012

Try commenting out just the: IF (ZSGWRTAP(2,IL).GT.TINY) THEN ZWSGWRL(IL)=MAX(0.E0,ZWDC2SG(IL)-ZSGWRTAP(1,IL)) FZWSGWRL(IL)=ZWSGWRL(IL)/MAX(TINY,ZSGWRTAP(2,IL)-ZSGWRTAP(1,IL)) ENDIF statements. Leave the remainder of the code in place. This particular test in a loop may interfere with vectorization. Should the commented code return the performance then you may need to look at a vector friendly conditional statement. Also, if IL is your loop control variable, then consider swapping indicies on ZSGWRTAP such that it is IF (ZSGWRTAP(IL,2).GT.TINY) THEN ZWSGWRL(IL)=MAX(0.E0,ZWDC2SG(IL)-ZSGWRTAP(IL,1)) FZWSGWRL(IL)=ZWSGWRL(IL)/MAX(TINY,ZSGWRTAP(IL,2)-ZSGWRTAP(IL,1)) ENDIF Jim Dempsey

Nick2 · ‎09-27-2012

Thanks! I tried to comment that out, and I tried to swap out various pieces of code. Now, this gets even more interesting; I tried to run the "old" code vs. the "new" code, varying 32-bit vs. 64-bit platform, and whether I set RealKIND="realKIND8" or not (SX vs DX). It's only the double-precision builds that are taking longer to run...
Old Build SX 32 29:43 Old Build DX 32 34:37 Old Build SX 64 20:41 Old Build DX 64 24:09
New Build SX 32 30:13 New Build DX 32 48:06 New Build SX 64 20:15 New Build DX 64 33:55
How does that happen?

Nick2 · ‎09-27-2012

I just realized this ... "old build" used ifort 12.0.4.196, "new build" used ifort 12.1.4.325. So I tried
Old Build DX 64 (new compiler) 36:03 New Build DX 64 (old compiler) 26:09
It's not just the clock time increasing, it's also the number of timesteps taken to converge increasing (mainly in the subroutine that produced the "catastrophic error" - see Premier issue 630226)
What's changed in 12.1.4.325 to make this worse? Has this been improved in the 2013 release? Should I wait for "Update 1" ?

jimdempseyatthecove · ‎09-27-2012

In any code where you have convergence, the iterations are highly dependent upon the precision. The algorithm would affect the number of iterations to converge or fail dependent upon the precision... higher precision does not necessarily dictate faster/slower convergence. It is important to note that the default precision is REAL(4) in IVF. Therefore you should be aware that 1.0 and 123.45E+6 are REAL(4) literals, whereas 1.0D0 and 123.45D+6 are REAL(8) literals (compiler options can change default real size). Therefore IF(Real8Var .gt. 1.0) then Is a comparison of a REAL(8) to a REAL(4), which depending upon compile time options may result in runtime code that loads the REAL(4) literal, promotes it to REAL(8), then performs the test. Promotion of literals is not guaranteed at compile time. This is due to maintaining compatibility of results where promotion was only performed at run time. Jim Dempsey

Nick2 · ‎09-27-2012

We run a utility to convert things like 1.E0 to 1.D0 for production DX code. I also tried to run "un-converted" single-precision code as double-precision without the utility, by adding the RealKIND="realKIND8" flag alone, and the run-times between "un-converted" and "converted" code were pretty much the same (33:36 vs 33:55). Good job on that one Intel!
Now, here's the clock run-times, and number of time-steps for my different builds (keep in mind "old build" also has "old compiler" and "new build" also has "new compiler"):
Old Build SX 32 29:43 267762 Old Build DX 32 34:37 265440 Old Build SX 64 20:41 270827 Old Build DX 64 24:09 271496
New Build SX 32 30:13 272290 New Build DX 32 48:06 316374 New Build SX 64 20:15 271076 New Build DX 64 33:55 327141
Do you see why I'm very suspicious of the compiler here?

jimdempseyatthecove · ‎09-27-2012

You must look closely at how you compute the convergence. Not only at the source code level but also in combination with the compiler options relating to expression evaluations (default may have changed). Check/experiment with the /fp:... switches. You might consider inserting trace printouts during various sections of the convergence: Iterations 1:100 Iterations 1000:1100 Iterations 10000:10100 Iterations 100000:100100 Iterations: 270000:270100 Iterations 316200:316300 Iterations 327000:327100 (or use spans of +10) Seeing how the components of what determines convergence vary may yield information as to what is going on and how to fix it. You may find that your old code was so wrong that correct code may have yielded convergence in a fraction of the number of iterations (and with better results). Jim Dempsey

Nick2 · ‎10-12-2012

4 year old version of my code, today's version of my code - makes no difference. All single precision runs, and OpenVMS runs, G95/Linux runs, and double precision runs with the old compiler, take around 270,000 iterations to complete the calculation. Double-precision runs, with ifort 2011 update 6 or later take more iterations and more run time. Check out the file I attached, in release/win32. I compared run times with update 4 vs update 10. On a Core 2 Quad Q9450, new ifort -> 17.55 seconds to run; old ifort ->9.45 seconds to run On a Xeon E5-2687W, new ifort -> 14.96 seconds to run; old ifort -> 4.52 seconds to run Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" OmitFramePointers="false" GenAlternateCodePaths="codeForAVX" IntegerKIND="integerKIND8" LocalSavedScalarsZero="true" FloatingPointExceptionHandling="fpe0" FlushDenormalResultsToZero="true" Traceback="true" Now tell me still there is something wrong with my code?? What changed in Release 6???

jimdempseyatthecove · ‎10-15-2012

Look at the disassembly code for 32-bit Release do i=1,10000000000 !1e10 00EC108E mov dword ptr ,1 00EC1095 mov dword ptr [ebp-14h],0 a=a*b 00EC109C movsd xmm0,mmword ptr 00EC10A1 movsd xmm1,mmword ptr 00EC10A6 mulsd xmm0,xmm1 00EC10AA movsd mmword ptr ,xmm0 enddo 00EC10AF mov eax,dword ptr 00EC10B2 mov edx,dword ptr [ebp-14h] 00EC10B5 mov ecx,1 00EC10BA mov ebx,0 00EC10BF add eax,ecx 00EC10C1 adc edx,ebx 00EC10C3 mov dword ptr ,eax 00EC10C6 mov dword ptr [ebp-14h],edx 00EC10C9 mov eax,540BE4... 000000013F2810AD movsd xmm1,mmword ptr 000000013F2810B2 mulsd xmm0,xmm1 000000013F2810B6 movsd mmword ptr ,xmm0 enddo 000000013F2810BB mov eax,1 000000013F2810C0 add rax,qword ptr 000000013F2810C4 mov qword ptr ,rax 000000013F2810C8 mov rax,2540BE400h 000000013F2810D2 mov rdx,qword ptr 000000013F2810D6 cmp rax,rdx 000000013F2810D9 jge MAIN+0A8h (13F2810A8h) The compu...

jimdempseyatthecove · ‎10-15-2012

Sorry I could not embelish the above post with additional coments due to this $%&! forum editor going bonkers when your reply exceeds the size you can stretch this edit box. Suggestions now are: Enable IPO in the compiler (at least single file) *** and enable IPO in the linker. Jim Dempsey

jimdempseyatthecove · ‎10-15-2012

BTW, in prior to last post (INTEGER(8) loop control variable) Perform this optimization only if (when) IPO does not reduce the loop control overhead to an acceptible portion of the overall loop iteration. Also note, in your exampe, i was not used inside the loop, therefore the i=i+1 could be eliminated. Jim Dempsey

Nick2 · ‎10-17-2012

This is interesting...if I do 1e9 iterations, integer(4) or integer(8) doesn't make any difference, with either the new or the old compiler. It's the 1e10 iterations, that does make a difference. If I re-write it as two loops (one loop inside the other loop) of 1e5 iterations - both versions of the compiler run at the same speed (fast), AND also give me digit for digit identical results! I'll try my full-blown code with the new compiler and with changing integer(8) to integer(4) everywhere (though my "single-precision" build also has all integers as integer(8), so I'm not sure how much difference it will make. Changing all integers to integer(8) was just a quick cheap fix we did for a legacy hashing/integer packing algorithm).

Nick2 · ‎10-17-2012

Unfortunately, making all integers integer(4) didn't help my run time, or the number of iterations to converge in my release code. So, is the issue identified above another (unrelated) bug, Intel?

jimdempseyatthecove · ‎10-17-2012

>>Unfortunately, making all integers integer(4) didn't help my run time, or the number of iterations to converge in my release code. If your integer(4) loop control variables are iterating 2^31 or more times, then this will cause havok in those loops (note, this may not cause a program crash). You will need to assure that should a loop require larger interations, that you consider modifying the code as discussed above (nested loops) or use the integer(8) loop control. Jim Dempsey

Nick2 · ‎10-17-2012

jimdempseyatthecove wrote:
If your integer(4) loop control variables are iterating 2^31 or more times, then this will cause havok in those loops (note, this may not cause a program crash).

Understood...I didn't need integer(8) for loops, I needed it for a goofy utility integer-packing algorithm