Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

significant speed reduction when converting to dynamic code

tim_s_1
Beginner
6,712 Views

I have been updating some legacy code to make it dynamic. I have primarily converted COMMON blocks to modules. The code uses many very large multi-dimensional arrays, mostly of rank 3 but some of higher rank as well. I have seen about a 50% increase in run time when making this change. I am guessing that some of the optimization is now taking a hit now that the array sizes are not known at compile time. Is this type of speed reduction typical, and/or are there some flags I could set at compile time that would help. 

Thanks.

P.S. I am currently compiling for 64 bit windows but will also be compiling for 64 bit linux. I am using intel visual fortran composer xe 2013.1.119

0 Kudos
43 Replies
tim_s_1
Beginner
2,186 Views
Ok I ran a quick test using my last idea I modified #define A1(i,j,k) A(i,j,k,1:1) to #define A1(i,j,k) A(i,j,k,1) -----I got the exact same number of assembly lines for both versions now to see if it runs faster.---- THIS WAS WRONG (how come we can't use a del html tag?) thanks i'll let you know how it goes
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
Tim s, I used the preprocessor trick on an F77 conversion to F90 on a solution with 13 projects, ~750 files, 600,000 lines of code. The #define and USE were conditionalized at the top of the files such that for most of the conversion process the same source files could compile using modules or COMMON. This helped immensely in the first phase of the conversion. Once all was running as it should, then new features were added into the source files (making code no longer compileable as F77). I think you can start slowly, such as with the A1...A6 hack. This will be a diminishing returns type of thing. Hopefully a few such changes will be all that is required. Don't give up on modules too soon. On my conversion from F77 COMMONs to F90 ALLOCATE'ables, principally for conversion from serial to parallel programming with OpenMP, I was able to get the serial program to run 10x faster (finding and correcting some serious performance issues), added to that was the scaling from parallel coding. In end, on 4-core system I attained ~40x performance boost for my efforts. Your Mileage May Vary (this was non-typical). Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
Tim s, In your post beginning with: the structure of this protion of the code is of the form... If the frequency of IF(AP(I,J,K).GT.1.0E+19) is very low, then consider removing the test from the DO I loop, and add a second DO I loop following the current DO I loop, which (seldom) overstrikes the results generated in the first loop. This will improve vectorization. Jim Dempsey
0 Kudos
tim_s_1
Beginner
2,186 Views
Ok my last post was wrong I am not sure what happened but I must have looked at the wrong file give me a minute to sort things out.
0 Kudos
tim_s_1
Beginner
2,186 Views
OK so the number of assembly lines actually went up? I will try the new version though and see what the difference is.
0 Kudos
tim_s_1
Beginner
2,186 Views
Jim, I would like to understand the #define command better is using #define A1(i,j,k) A(i,j,k,1) somewhat like a find and replace done at compile time and each instance of A1(i,j,k) replaced with A(i,j,k,1), or is it smarter than that and will also replace A1(i,j-1,k) with A(i,j-1,k,1) or A1(3,j,k) with A(3,j,k,1), or A(l,m,n) with A1(l,m,n) also is it case sensitive? thanks
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
You are right (BTW this is almost the same as the C PreProcessor). There is a "gocha" though is that the substitution is "as-is" (or "as-was") for the token. What this means is you may require to enclose the dummy argument with ()'s in order to parsing priority issues. Example: #define myScale(a,b) a * b Now consider myScale(x + y, z + q) This will expand to "x + y * z + q" Not what you expect Whereas: #define myScale(a,b) (a) * (b) the above ()'s looks unnecessary, but is necessary it expands to "(x + y) * (z + q)" What you want Sometimes you may need additional ()'s #define myScale(a,b) ((a) * (b)) As to what is required, this will depend on your code requirements. The extra parens will most always work "((a) * (b))" Note, use #include for preprocessor directive files The FORTRAN "INCLUDE" may mislead you into thinking anything enclosed in the file has the scope of the subroutine/function. The #defines scope will extend to end of compilation unit (or until #undef yourMacroName). FORTRAN also has an ASSOCIATE / END ASSOCIATE I suggest you experiment with that before the #define I used the #define hack prior to ASSOCIATE / END ASSOCIATE Jim Dempsey
0 Kudos
JVanB
Valued Contributor II
2,186 Views
You're lucky that you don't know anything about assembly language because otherwise looking at the way the unoptimized code generated by the compiler crawls along would cause debilitating physical pain. The explosion of code is due to the compiler having to walk through the descriptor for allocatable array AE(59,59,*), but all that should be absent in optimized code. There are some who don't like global variables at all, whether in common or in modules. There is always the possibility that another invisible part of the program could touch your global variables. For example if your program does I/O in a loop, you might have installed code to hook the I/O which could result in your global variables being modified on return, so they would have to be reloaded every iteration of the loop. This could cause the compiler to recompute the address of the next array element from scratch at every iteration rather than just adding a constant offset to the address of the last array element. You could avoid this if it were a problem for optimized code by placing the compute-intensive part in a subroutine and passing in the global variables to be used as dummy arguments. Then the Fortran aliasing rules forbid changing or reallocating the global variables associated with dummy arguments, assuming the subroutine references only the dummy arguments, not the module variables. Your code doesn't seem to do I/O or invoke external procedures in the inner loop, though, so you probably don't have to go through the contortions of the last paragraph. Try posting the optimized assembly language because it gives a clearer picture of what your program may be doing to hinder optimization and is also easier to read.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
[fortran] ALLOCATE(A(NX,NY,NZ*6)) ... DO K = KSTART,KEND DO J = JSTART,JEND E(IST) = 0.0 F(IST) = B(IST,J,K) ! ASSOCIATE partition A(NX,NY,NZ*6) into 1D slices ASSOCIATE A1 => A(:,J,K), A2 => A(:,J,K+NZ), A3 => A(:,J,K+NZ*2), & & A4 => A(:,J,K+NZ*3), A5 => A(:,J,K+NZ*4), A6 => A(:,J,K+NZ*5) DO I = ISTART,IEND R = A1(I)*B(I,J+1,K)+A2(I)*B(I,J-1,K)+A3(I)*B(I,J,K+1)+A4(I)*B(I,J,K-1)+C(I,J,K) Y = T*(A1(I)+A2(I)+A3(I)) R = R - Y*B(I,J,K) X = 1.0/(A6(I)-F-A6(I)*E(I-1)) E(I) = A5(I,J,K)*X F(I) = (R+A6(I)*F(I-1))*X END DO END ASSOCIATE ! end of ASSOCIATED A1,... A6 DO I = ISTART,IEND IF(AP(I,J,K).GT.1.0E+19) THEN E(I) = 0.0 F(I) = B(I,J,K) ENDIF END DO DO II = ISTART,IEND I = NI+IEND-II B(I,J,K) = (E(I)*B(I+1,J,K))+F(I) C = C + MIN(ABS(E(I)*B(I+1,J,K)),ABS(F(I)))/MAX(1.0E-25,ABS(E(I)*B(I+1,J,K)),ABS(F(I))) END DO END DO END DO [/fortran] Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
In looking at the ASSOCIATE method (without looking at dissassembly) it looks as if the array descriptor accesses would be reduced, but the loop would not be able to keep the necessary data in registers 6 for A1-A6 5 for the variations on B 1 for C 1 for E 1 for F n for misc. Fully optimizing this loop may be a bit more difficult. Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
You might improve vectorization by the following: [fortran] ALLOCATE(A(NX,NY,NZ*6)) ... ! local variables REAL :: R(IEND) DO K = KSTART,KEND DO J = JSTART,JEND E(IST) = 0.0 F(IST) = B(IST,J,K) ! ASSOCIATE partition A(NX,NY,NZ*6) into 1D slices ASSOCIATE A1 => A(:,J,K), A2 => A(:,J,K+NZ), A3 => A(:,J,K+NZ*2), & & A4 => A(:,J,K+NZ*3), A5 => A(:,J,K+NZ*4), A6 => A(:,J,K+NZ*5) DO I = ISTART,IEND RTEMP = A1(I)*B(I,J+1,K)+A2(I)*B(I,J-1,K)+A3(I)*B(I,J,K+1)+A4(I)*B(I,J,K-1)+C(I,J,K) Y = T*(A1(I)+A2(I)+A3(I)) R(I) = RTEMP - Y*B(I,J,K) END DO DO I = ISTART,IEND IF(AP(I,J,K).GT.1.0E+19) THEN E(I) = 0.0 F(I) = B(I,J,K) ELSE X = 1.0/(A6(I)-F-A6(I)*E(I-1)) E(I) = A5(I)*X F(I) = (R(I)+A6(I)*F(I-1))*X ENDIF END DO END ASSOCIATE ! end of ASSOCIATED A1,... A6 DO II = ISTART,IEND I = NI+IEND-II B(I,J,K) = (E(I)*B(I+1,J,K))+F(I) C = C + MIN(ABS(E(I)*B(I+1,J,K)),ABS(F(I)))/MAX(1.0E-25,ABS(E(I)*B(I+1,J,K)),ABS(F(I))) END DO END DO END DO [/fortran] Jim Dempsey
0 Kudos
tim_s_1
Beginner
2,186 Views
So the optimized modified code with the preprocessor assignments was faster. Not fast enough yet, so I will have to look at other modules which may be slowing things down. Can the the define statement be used to adjust the shape of an array. i.e. treat the array Z(L,N,M) as a 1D array of size (L*N*M) or vice versa (Assume that Z is in a module and I can define it as an allocatable array of either shape) I am thinking it might not be too hard to go from 3D to 1D but the other way around might not be so easy. would #define Z(i,j,k) Z(i+((j)-1)*(L)+((k)-1)*(N)*(L)) work? If so would it kill any auto vectorization? Assume that the Z array is in a simple three tier nested for loop. I also just realized another issue in the code the module variables are sometimes passed to other routines. Something like call foo(A3,A4,r,t,j) where A3&A4 are included in the module in the calling routine but not the called routine. when using the #define A3 and A4 now seem to be treated as an uninitialized scalar passed to the called routine instead of the beginning memory location of an initialized array. I tried adding #define A3 A(:,:,:,3) but now it gives a warning that A3 macro is redefined and the intermediate preprocessed file has A(:,:,:,3)(I,J,K) in place of an A3(I,J,K) do I have to limit the code which each define works on (using #undef A3) or would it be better to change the original code to use call foo(A3(:,:,:),A4(:,:,:),r,t,j) I like the second option best as it seems to be less intrusive. Thanks again
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
Per my last post, loops containing the statements E(I-1) and F(I-1) will thwart vectorization. Breaking appart the loop into two loops, with the addition of the R(IEND) array, is designed to permit the first portion (more complex) loop execute with vectorization at the expense of writing to new array R(I). Should vectorization be attained the number of reads and arithmetic operations are reduced by 1/2, 1/4. 1/8 (dependent on float/double and on vector width and SSE or AVX). Your NX was stated as having a mean of 70. The mean of writes to R is a subset of 70 (ISTART, IEND). You did not disclose the relationship of NX to ISTART,IEND. Possibly this is 68. 68 floats is 272 bytes, or if variables are doubles 544 bytes, either case it will easily fit within L1 cache. The second loop, the reads of R(I) are almost free. BTW I just noticed your original code and my copy/paste rework has an error relating to the original statement: X = 1.0/(A6(I,J,K)-F-A6(I,J,K)*E(I-1)) (missing subscript on F?) Note, you should be able to partition the inner I loop into two inner loops when using the #define route. Jim Dempsey
0 Kudos
tim_s_1
Beginner
2,186 Views
Thanks for all of your help to this point. I now have that routine that was causing significant slow down tackled. That routine is now running just as fast as before (which is all I was hoping for at this point of the project i am working on. additional optimization will have to come later) I have what to me is an even more baffling problem a separate routine that was untouched (at least directly) by the common to module conversion is running about twice as slow as before. the source file for the routine was unchanged, and the resultant assembly is identical if i compile it alone and essentially identical when I look at the VTune "assembly" (I am guessing that this is what Jim refers to as disassembly, and that it is converted from the binary back into assembly) The routine itself is passed 60+ arguments, most of which are 3D arrays but only the name is passed and the size is defined in the routine. Many of these arrays are used from a module in a routine which passes them to a routine which pass them to another routine which then pass them to the routine in question. i.e. [fortran] subroutine sub1 use cblock !this module and many others have many arrays scalars and logicals. lets assume that for this example they contain A1(NX,NY,NZ),A2(NX,NY,NZ),A3(NX,NY,NZ), ... , A60(NX,NY,NZ) some useful code call sub2(A1,A2,A3,...,A60) more useful code end subroutine sub2(A1,A2,A3,...,A60) some useful code call sub3(A1,A2,A3,...,A60) more useful code end subroutine sub2(A1,A2,A3,...,A60) some useful code call sub3(A1,A2,A3,...,A60) more useful code end subroutine sub3(A1,A2,A3,...,A60) lots of useful code more useful code end [/fortran] in this example it is sub3 causing all of the trouble. As far as I can tell it should behave identically because it sees all of its arguments in the exact same way But it runs half as fast. There are about 10 lines that slow down surrounded by a bunch of other lines which run at normal speed. Most of the other other lines are conditionals or potentially bypassed by the conditionals so the fact that I don't see slow down throughout might be irrelevant. Any ideas on how this could be happening? Thanks
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
Are the names of the arguments on the top-level call the same names as in the module cblock? Are the dummies names in sub2 and sub3 the same as in cblock? Can you show at least the SUBROUTINE sub3(...) together with the dummy declarations? Can you show your loops with 10 lines that slow down? Is there anything your learned from the first issue that can apply to this issue? Jim Dempsey
0 Kudos
Steven_L_Intel1
Employee
2,186 Views
Passing 60 arguments in itself is time-consuming. If you can reduce that, perhaps by passing a derived type with the variables in it, or using module variables, that may help.
0 Kudos
tim_s_1
Beginner
2,186 Views
I would not have chosen to write the code the way that it was. I think that whoever did wanted to avoid global variables. Steve, So passing 60 arguments may be slow but why slower now than before? Jim, nothing I learned before seems to apply now as it seemed like using multiple variables where a common block was used added to register pressure. This routine should not have any difference as the same assembly is used. As far as I know the same variable names are used throughout from the module down through the three routines.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,186 Views
I think the main problem is the loops in the last level. The 60 args per call level may be a secondary issue. This (60 args) can be addressed by using a derived type to package the scalars (Steve's suggestion) and either use the arrays directly in module (should arrays be same on all calls) or package pointers to the arrays in a seperate derived type (module arrays will require TARGET attribute. The reasoning for the two packaging is that at some time in the near future you may wish to paralllelize this code, and the further out of the call stack you perform the parallelization, generally the better performance. The scalars would be used for the slice-and-dice (sectioning) of the arrays. These parameters would vary thread-by-thread. An alternate way is to take Repete Offender's advice and pass the arrays via F77 style call (pass a cell reference to subroutine with unknown interface). This works well when the dummy array can be expressed using lower rank. In the example you listed earlier, all three indicies of (some of) the arrays were being manipulated thus the dummy would have to construct a near duplicate of the original array descriptor (no advantage). As for optimizing the last level, none of us on this forum can offer advice without seeing the problem code. Jim Dempsey
0 Kudos
tim_s_1
Beginner
2,186 Views
I guess I am not looking for how to optimize the lowest level of the code, at least not yet. Right now I am really just trying to track down the cause of the slow down, which seems to have happened without any modification to the code. I could post the whole code but it is ugly, really ugly, and I am afraid that no one would focus on the real issue I am facing and only on how to rework the code. The real issue I am having with this routine is that it takes 2 times as long to run and I have not changed the routine at all. the only thing that has changed is how a routine several levels up gets access to the variables which get passed down to this routine. Once I figure the cause of that out I may have time to move on to optimizing the routine some. I guess what I really need is someone who understands the internal workings of how arguments are passed to a routine to explain how that could be changing and why that may cause slow down. I just added a counter and verified that the routine is called the same number of times (57000) in both cases. Thanks
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,117 Views
Look in the IVF Document Start | All Programs | Intel Parallel Studio XE | Documentation | Visual Fortran ... | (click on Link to html documentation) (I suggest to Intel that the VS Help have a direct link to this in addition to (or in lieu of) the link to the comingled documentation) Then select Index tab | declarations | for arrays This will give you the specifics on the different way arrays are passed as arguments: [fortran] SUBROUTINE SUB(N, C, D, Z) REAL, DIMENSION(N, 15) :: IARRY ! An explicit-shape array REAL C(:), D(0:) ! An assumed-shape array REAL, POINTER :: B(:,:) ! A deferred-shape array pointer REAL, ALLOCATABLE, DIMENSION(:) :: K ! A deferred-shape allocatable array REAL :: Z(N,*) ! An assumed-size array [/fortran] The older documents had a table illustrating the performance impact (in relative terms) on each of the calling conventions (faster to slowest). I am unable to locate this table in the current IVF document. Generally the order (faster to slower) is: explicit-shape, assumed size, assumed-shape, deferred-shape Steve may be able to locate the table in the reference Jim Dempsey
0 Kudos
Steven_L_Intel1
Employee
2,117 Views
I don't remember such a table, and I don't think the issue can be reduced to a table or list. It depends a lot on what the called routine does. But, seriously, speculation without hard evidence, such as VTune Amplifier XE analysis, is a waste of time. Once you identify the sections of code that are dragging down the performance, only then can you fruitfully think about ways to improve it.
0 Kudos
Reply