significant speed reduction when converting to dynamic code

tim_s_1 · ‎12-17-2012

I have been updating some legacy code to make it dynamic. I have primarily converted COMMON blocks to modules. The code uses many very large multi-dimensional arrays, mostly of rank 3 but some of higher rank as well. I have seen about a 50% increase in run time when making this change. I am guessing that some of the optimization is now taking a hit now that the array sizes are not known at compile time. Is this type of speed reduction typical, and/or are there some flags I could set at compile time that would help.

Thanks.

P.S. I am currently compiling for 64 bit windows but will also be compiling for 64 bit linux. I am using intel visual fortran composer xe 2013.1.119

TimP · ‎12-17-2012

Switching to modules ought not to prevent the compiler from knowing array sizes, although I agree I also have such concerns. Alignment will definitely be affected; you may want to try the 13.0 Fortran options like /align:array32byte. In the case of intel64 compilers, one would think the default should be like /align:array16byte, but it may be worth checking (e.g. with cloc()). I would never have guessed that "making it dynamic" meant changing from COMMON to MODULE. Do you mean also changing arrays to allocatable? That might kill optimizations based on array length.

TimP · ‎12-17-2012

Switching to modules ought not to prevent the compiler from knowing array sizes, although I agree I also have such concerns. Alignment will definitely be affected; you may want to try the 13.0 Fortran options like /align:array32byte. In the case of intel64 compilers, one would think the default should be like /align:array16byte, but it may be worth checking (e.g. with cloc()). I would never have guessed that "making it dynamic" meant changing from COMMON to MODULE. Do you mean also changing arrays to allocatable? That might kill optimizations based on array length, unless you add loop count directives.

tim_s_1 · ‎12-17-2012

Yes, I do also make the arrays alocatable and allocate them early on in the program.

jimdempseyatthecove · ‎12-18-2012

Can you run VTune on both revsions of the code? This may give you an insight to how to make small changes to improve performance. 50% difference in performance is rather large, finding out what accounts for this difference should yield a solution. Jim Dempsey

tim_s_1 · ‎12-18-2012

To be a bit more specific, the code had many common blocks with a mix of arrays and scalars most of which were implicitly typed. A representative example would be something like PARAMETER(NX=170,NY=59,NZ=80,NPS=12,NSL=650) COMMON /CBLOCK/ARRAY1(NX,NY,NZ),ARRAY2(NX,NY,NZ),NSCAL1,ARRAY3(NPS,NSL),NSCAL2,NARRAY4(NSL,NPS,NX,NY,NZ),SCAL3 DO K=1,NZ DO J=1,NY DO I=1,NX ARRAY1(I,J,K)=something END DO END DO END DO (note that the limits of the do loops were parameters) I made a module as follows MODULE CBLOCK REAL SCAL3 INTEGER NSCAL1,NSCAL2 REAL, ALOCATABLE :: ARRAY1(:,:,:),ARRAY2(:,:,:),ARRAY3(:,:) INTEGER, ALOCATABLE :: NARRAY4(:,:,:,:,:) CONTAINS SUBROUTINE ALLOC_CBLOCK(NX,NY,NZ,NPS,NSL) ALLOCATE(ARRAY1(NX,NY,NZ),ARRAY2(NX,NY,NZ),ARRAY3(NPS,NSL),NARRAY4(NSL,NPS,NX,NY,NZ)) END SUBROUTINE ALLOC_CBLOCK END MODULE CBLOCK I then replaced the COMMON block statement with "USE CBLOCK" and the PARAMETER statement with COMMON /PARAMS/NZ,NY,NZ,NPS,NSL in all of the routines which contained them. In a NEW routine called near the start of the code, I have the following USE CBLOCK COMMON /PARAMS/NZ,NY,NZ,NPS,NSL {read in some file info to determine NX, NY, NZ, NPS, and NSL} CALL ALLOC_CBLOCK(NX,NY,NZ,NPS,NSL) There were something like 120 different common blocks replaced in this manner with a total of several dozen different parameters. These parameters have been historically set for each case and a new code compiled using them. We want the code to be able to be compiled once and used many times without re-compiling for each case. During any execution of the code the array sizes will be constant once allocated, however between runs the sizes may differ. Was this the best way to replace these parameters with variables that can be determined at run time? If so is a 50% increase in run time reasonable? I think that the nested for loops that were limited by parameters and are now limited by variables may be where the optimization is hurting. I guess that this may be where loop count directives come in. I have never used these but it seems like they are an idea of the range of the do loop that is placed before each loop where a variable length has been listed, that way the compiler can have an idea on how to optimize the given loop. If that is actually the case, that's thousands of additional lines of code (which I would like to avoid if at all possible). Is it possible to define a directive for a given variable once per routine and let the compiler apply it to each loop? i.e. at the top of the routine I would give a directive stating that NX has a max of 300 a min of 20 and a mean of 70, with similar directive for each old parameter, then have the compiler apply that as a loop count directive wherever NX is used as a limit. Would such a broad range even be helpful to the compiler if it were given as a loop count directive or otherwise?

tim_s_1 · ‎12-18-2012

In response to Jim. I have run VTune on both versions, but I am not real sure what to make of the difference between the two. (This is my first time using VTune so I am still learning how to understand its outputs. Also I have never done any form of code optimization or profiling before which is probably evidenced in my obvious lack of knowledge on the subject). What would be your first suggestion to look at in the VTune output? I can see which routines are running slower (many are) but I am not sure how to determine the cause. Thanks

jimdempseyatthecove · ‎12-18-2012

Have you disabled the runtime checks for array index out of bounds? If not do so and compare runtimes again. If you have... There is a runtime option to emit a diagnostic report as to when an array temporary was created. Often when rearranging code for module use, the new code may generate array temporaries. Identifying these sections of code, and reworking to eliminate the array temporary may improve performance. Next issue can be the use of allocatable arrays require the code to generate and pass array descriptors (which are used to locate the array data) as oppose to the base address of the fixed array (when using the COMMON with fixed dimensions). Also, the use of allocatables might result in the creation of array temporaries where none were required before. This may be correctable with minor rework of code. VTune I suggest you configure a test program that runs 60 seconds or so on the old COMMON structured code. Then configure the new code to use the same size and data source (or copy thereof). Using two instances of Visual Studio, run and collect the data for each configuration (do not run at the same time, and do not futz with Performance Monitor while running your data collections). Then look at the reports side-by-side (dual monitors can help). Usually the report is a sorted table in runtime order (routine by routine). *** Note, run this test with at least O1 and IPO disabled. Also, remove runtime checks for array index out of bounds.

tim_s_1 · ‎12-19-2012

I included the /check:arg_temp_created option but did not see any output statements. Should they show up in the standard output or as some sort of diagnostic file. Also Jim mentioned,

Next issue can be the use of allocatable arrays require the code to generate and pass array descriptors (which are used to locate the array data) as oppose to the base address of the fixed array (when using the COMMON with fixed dimensions).

Is there a way to detect and correct this, or is it just one of the costs of using allocatable arrays?

Steven_L_Intel1 · ‎12-19-2012

It's a cost of using deferred-shape arrays - not just allocatable. But the overhead for that itself is relatively low.

jimdempseyatthecove · ‎12-19-2012

>>If so is a 50% increase in run time reasonable? How are you timing the 50% increase in runtime? (e.g. 0.5 seconds verses 0.75 seconds or 50 seconds verses 75 seconds) Is the timing inside the program (or wall clock from command line)? If your timing is for a relatively short duration, the COMMON block format loads in with the execuitable, whereas the allocatable loads later. Note, allocatables can (often) have deferred load until first touch. IOW: ! using COMMON T1 = omp_get_wtime() DO K=1,NZ DO J=1,NY DO I=1,NX ARRAY1(I,J,K)=something END DO END DO END DO T2 = omp_get_wtime() ElapseTime = T2-T1 ---- ! using module ALLOCATE(ARRAY1(NX,NY,NZ)) T1 = omp_get_wtime() DO K=1,NZ DO J=1,NY DO I=1,NX ARRAY1(I,J,K)=something END DO END DO END DO T2 = omp_get_wtime() ElapseTime = T2-T1 -------------------- Comments Although the above two are functionally the same, under the hood they are different. In the COMMON program, the virtual memory of ARRAY1 has already been "touched" by the program load. In the ALLOCATABLE program, the very first time the virtual memory of ARRAY1 has not been "touched" until ARRAY1(I,J,K)=something Virtual memory will be allocated in page size chunks (4KB or 4MB) at run time (not allocation time) for the first time the particular virtual memory is "touched". Should ARRAY1 be DEALLOCATed, then subsequently reALLOCATED, or something else be allocated to the same virtual address, then the virtual memory pages at that (those) addressed need not be allocated again. This first time allocation induces a latency into the program. Due to this latency, you should consider making multiple passes of the timed sections of your application such that you can discard the first touched runs. Note, if your application has the characteristics of use once then end program, then consider looking at your O/S runtime library for a routine than can perform this virtual memory allocation/first touch in one step. This may also be a Linker option or compile time option as well. Jim Dempsey

tim_s_1 · ‎12-19-2012

The program is one that typically runs for a week or more so a 50% increase in run time is an enormous cost. It performs many large iterations (made up of a hundred or more smaller iterations which are in turn made up of thousands of tiny iterations) that take on the order of a hour or more. I have compared the run time for running several of the largest iterations and that is what has increased by roughly 50%. From roughly 1 hr 30 min with the original code to 2 hrs 15 min with the dynamic version. I have run the same case with very small data sets which run much faster and gotten similar ratios. From 12 min to 17 min. In comparing the profiles between the old and new code, it seems as though the lines that have the greatest increase in cpu time are the ones that contain references to multiple of the multidimensional arrays, particularly when they are in nested loops. for example something like R=A1(I,J,K)*B(I+1,J,K)+A2(I,J,K)*B(I-1,J,K)+A3(I,J,K)*B(I,J+1,K)+A4(I,J,K)*B(I,J-1,K)+A5(I,J,K)*B(I,J,K)+A6(I,J,K) takes much longer for reference, the A arrays are included by a "use" statement within this routine, the B array is a dummy argument passed to the routine from its caller, and is included in each of the calling routines by "use" statements. The B array does not seem to have as much difficulty as the A arrays. Any thoughts

jimdempseyatthecove · ‎12-19-2012

Without looking at the disassembly code (VTune can show dissassembly), I will guess that what's at issue here is register pressure. In the module format the above statement could require 7 registers to hold the base of A1-A6,B (or address to descriptor for each), plus registers for R, etc... In the COMMON format, A1-A6,B are at a fixed address, the base of which is a known offset (not requiring a register). You can reduce the register pressure by colessing the An arrays into a single array ALLOCATE(A(NX, NY, NZ*6)) ... R=A(I,J,K)*B(I+1,J,K)+A(I,J,K+NZ)*B(I-1,J,K)+A(I,J,K+NZ*2)*B(I,J+1,K)+A(I,J,K+NZ*3)*B(I,J-1,K)+A(I,J,K+NZ*4)*B(I,J,K)+A(I,J,K+NZ*5) Or you might consider using the Fortran Preprocessor #define A1(i,j,k) A(i,,j,k) #define A2(i,j,k) A(i,j,k+NZ) ... #define A6(i,j,k) A(i,j,k+(NZ*5)) Then use original statement R=A1(I,J,K)*B(I+1,J,K)+A2(I,J,K)*B(I-1,J,K)+A3(I,J,K)*B(I,J+1,K)+A4(I,J,K)*B(I,J-1,K)+A5(I,J,K)*B(I,J,K)+A6(I,J,K) Jim Dempsey

tim_s_1 · ‎12-19-2012

would this register pressure be an issue still on other lines where a single array is referenced? I see similar increase in time from a given line with something like G(I)=A(I,J,K)*X Combining the arrays into a single array may be possible but its not simple. would the preprocessor command work if placed within the module? or would it have to be in each routine which uses it?

jimdempseyatthecove · ‎12-19-2012

The preprocessor statement processing happens at compile time. Also, the scope of the #define is from the #define to the end of the compilation unit. Use #include 'YourMacros.inc' and place your macros in there. Place the #include towards the top of each of the source files that you wish the macro to take effect ! fooBar - blabla ! Copyright(c)... ! ... #include 'YourMacros.inc' SUBROUTINE FOOBAR(... ... END SUBROUTING FOOBAR SUBROUTINE FEE(... (macros still in effect here) ... ------------------------ >>I see similar increase in time from a given line with something like G(I)=A(I,J,K)*X That shouldn't be an issue. DO I=1, NX G(I)=A(I,J,K)*X END DO If NX is larger than a few, the above should run the same for both configurations. Have you verified that the runtime check for subscript out of bounds is disabled? Same for uninitialized variable checks (and any other runtime checks) Also, did your code change to module also include a feature change (e.g. to OpenMP)? If so, and if you are misuing PRIVATE and/or REDUCE, you may have unecessary overhead. Jim Dempsey

tim_s_1 · ‎12-19-2012

Have you verified that the runtime check for subscript out of bounds is disabled? Same for uninitialized variable checks (and any other runtime checks)

I believe so. I am compiling from the command line with the following: ifort /fpp /O3 /align:array32byte /Qdiag-disable:8290,8291 *.f -o program.exe (the two disabled warnings are about print statement sizes being smaller than a suggested size) The bounds checking and uninitialized variable should both be off by default from my understanding. I suppose I could disable them explicitly just in case. In looking at the assembly, the line I discused before: G(I)=A(I,J,K)*X is blowing up. the assembly for this line in the original code was: movss xmm10, dword ptr [rdi+r14*1+0x1f30f0fc] mulss xmm10, xmm13 movss dword ptr [rdi+rsi*1+0x12660a7c], xmm10 now its: mov r15, qword ptr [rip+0x34028f] mov qword ptr [rbp+0x108], r15 mov rax, qword ptr [rip+0x3401fd] mov qword ptr [rbp+0x278], rax mov rdi, qword ptr [rbp+0x108] imul rdi, rsi mov rcx, qword ptr [rbp+0xb8] add rdi, qword ptr [rbp+0x148] mov qword ptr [rbp+0x1e0], rdi mov qword ptr [rbp+0x1d8], r8 mov qword ptr [rbp+0x1d0], r9 mov qword ptr [rbp+0x1c8], r10 mov qword ptr [rbp+0x1c0], r11 mov qword ptr [rbp+0x1b8], r12 mov qword ptr [rbp+0x1b0], r13 mov qword ptr [rbp+0x1a8], r14 mov qword ptr [rbp+0x1a0], rbx mov qword ptr [rbp+0x198], r15 mov qword ptr [rbp+0x190], rdx mov qword ptr [rbp+0x1f8], rcx mov qword ptr [rbp+0xd0], rsi mov qword ptr [rbp+0x188], rax mov qword ptr [rbp+0x200], r14 mov rdx, r14 mov r15, qword ptr [rbp+0x1e0] lea r15, ptr [r15+rax*4] mov qword ptr [rbp+0x208], r15 mov rax, qword ptr [rbp+0x208] movss xmm10, dword ptr [rdi+rax*1] mulss xmm10, xmm13 movss dword ptr [r15+r14*1+0x555405c], xmm10 I post this in hopes someone could see why the compiler would make all of these additional operation which would hopefully point to a more general solution. I myself am an engineer by trade and only program out of necessity. I know nearly nothing about assembly but it sure seems like a lot of extra work is going on in the new version of the code and I am assuming that is directly leading to the increased run time. Thanks

John_Campbell · ‎12-19-2012

Tim, . You stated: "I then replaced the COMMON block statement with "USE CBLOCK" and the PARAMETER statement with COMMON /PARAMS/NZ,NY,NZ,NPS,NSL in all of the routines which contained them." . I would place the variables NZ,NY,NZ,NPS,NSL in the module CBLOCK and not in a new COMMON, if possible. . A problem you might be having is that with these paramaters now changeable, the ability to increase the size of the problem might be exceeding your available physical memory. I would place a memory usage report in SUBROUTINE ALLOC_CBLOCK, to identify if this is a problem. SIZEOF can report the size of these arrays. Make sure it reports as INTEGER(8) then divide by 2.**30 for giga bytes. (report each of the arrays) . For arrays that have a large memory footprint, if you are extending beyond the available physical memory, you should check the array index order to address memory sequentially as much as possible/practical.This can result in reduced performance. . Changes as Jim has described can increase the complexity of the code, especially when coming back to change it later. If you adopt this, make sure it is well documented. . For "G(I)=A(I,J,K)*X", you could try using array syntax, such as: G(:)=A(:,J,K)*X or G(i1:i2)=A(i1:i2,J,K)*X or even a simple F77 wrapper like call vector_multiply ( G, A(1,j,k), X, N) You might find the compiler can clean this up. . I have done simular restructures to transfer COMMON arrays to allocatable arrays in a MODULE and found the conversion worked well. The size looks to be the most likely problem. . Modern compilers are (claim to be) better at managing higher rank arrays, so I would expect the problem is somewhere else. If the problem is not related to what I have suggested, I'd use VTUNE or a profiler and try to isolare where the reduced performance is occuring. . Hope this might help. . John

tim_s_1 · ‎12-20-2012

John,

I would place the variables NZ,NY,NZ,NPS,NSL in the module CBLOCK and not in a new COMMON, if possible.

I could put them there, and they are passed to the allocate routine that is within the module; however, I have hundreds of different common blocks which all share about 20 or so parameters, so if they were in each of these modules they would be repeated many times in a routine that has many of these modules. What is the benefit of putting them in the modules?

A problem you might be having is that with these paramaters now changeable, the ability to increase the size of the problem might be exceeding your available physical memory

Why would my physical memory usage now be any greater than it was before. as far as I can tell it should now be less than or equal to as before the arrays only had to be equal to or greater than my needed space and now they should match my needed space exactly. Also would these memory issues explain the increase of assembly code for a given line of fortran. My same subroutine goes from 167 lines of assembly (with the modules moethod) to 317 (with the common block method) when no optimization has been performed and from around 650 to 1500 when optimized using /O3. Both the optimized and un-optimized codes take around 50% longer to run when using allocatable arrays in modules vs. common blocks. the un-optimized code went from 5357 sec to 7935 sec, a 53% increase the optimized code went from 1613 sec to 2305 sec, a 43% increase Could a memory issue still explain that? Thanks

jimdempseyatthecove · ‎12-20-2012

In the dissassembly list for "G(I)=A(I,J,K)*X" most of the "mov" instructions were for saving registers that will be used later on in the code sequence. These happen to get "billed" to the listed statement. Discounting the saving register mov's you can still see when required to access the array descriptor that a fair amount of code is generated. The question to ask then becomes: How can I ammortize the array descriptor work over several (many) statements? Old code tened to be written to reduce loop overhead: DO I=1,NX (tens or hundreds of statements) END DO Due to a finite number of GP registers (16, ~14 usable), the body of the loop may exceed the ability to registerize all frequently referenced data (information from within array descriptors). To correct for this, where possible, we code to reduce the number of registers required. Consider DO I=1,NX (part of statements) END DO DO I=1,NX (other part of statements) END DO ... DO I=1,NX (remaining part of statements) END DO The compiler can do some of this for you, but the compiler cannot do this when it is questionable. These loops may also be replaced by coding as John Campbell suggests G(:)=A(:,J,K)*X (implied loop) Of course you will have to check the code to see if there are dependencies and code accordingly. Jim Dempsey

tim_s_1 · ‎12-20-2012

I think I can see a bit better what the problem may be related to. In the un-optimized code, the fortran lines which refer only to local, or common variables are translated to essentially identical assembly code. this is true even for an array which was passed to the routine, and given dimensions in the routine by parameters in this case. eg SUBROUTINE FOO ((ISTART,JSTART,KSTART,NI,NJ,NK,B) use CBLOCK PARAMETER( ... declarations of NX NY NZ and NBIG) DIMENSION B(NX,NY,NZ) COMMON /CBLOCK2/E(NBIG),F(NBIG) any lines such as F(I) = B(I,J,K) or R= R-D*B(I,J,K) come out in assembly essentially identical to how they did before the use of the modules but a line like E(I)=AE(I,J,K)*X (where AE came from the common block before but now a module) now goes from 27 lines of assembly to 57 the actual line of code (so you can mactch up spacing) is E(I) = AE(I,J,K)*DENOM and COEF is the name of the common block/module from mov eax, DWORD PTR [40+rbp] ;58.15 movsxd rax, eax ;58.15 imul rax, rax, 13924 ;58.23 lea rdx, QWORD PTR [COEF] ;58.23 add rdx, 6182256 ;58.23 add rdx, rax ;58.15 add rdx, -13924 ;58.15 mov eax, DWORD PTR [52+rbp] ;58.15 movsxd rax, eax ;58.23 imul rax, rax, 236 ;58.23 add rdx, rax ;58.15 add rdx, -236 ;58.15 mov eax, DWORD PTR [68+rbp] ;58.15 movsxd rax, eax ;58.23 imul rax, rax, 4 ;58.23 add rdx, rax ;58.15 add rdx, -4 ;58.15 movss xmm0, DWORD PTR [rdx] ;58.23 movss xmm1, DWORD PTR [112+rbp] ;58.23 mulss xmm0, xmm1 ;58.15 mov eax, DWORD PTR [68+rbp] ;58.32 movsxd rax, eax ;58.32 imul rax, rax, 4 ;58.15 lea rdx, QWORD PTR [COEFL] ;58.15 add rdx, rax ;58.15 add rdx, -4 ;58.15 movss DWORD PTR [rdx], xmm0 ;58.15 to lea rax, QWORD PTR [COEF_mp_AE] ;58.23 add rax, 56 ;58.23 mov edx, 48 ;58.15 add rax, rdx ;58.15 mov edx, DWORD PTR [40+rbp] ;58.15 movsxd rdx, edx ;58.23 imul rdx, QWORD PTR [rax] ;58.23 add rdx, QWORD PTR [COEF_mp_AE] ;58.15 lea rax, QWORD PTR [COEF_mp_AE] ;58.23 add rax, 56 ;58.23 mov ecx, 48 ;58.15 add rax, rcx ;58.15 mov rax, QWORD PTR [rax] ;58.23 lea rcx, QWORD PTR [COEF_mp_AE] ;58.23 add rcx, 64 ;58.23 mov ebx, 48 ;58.15 add rcx, rbx ;58.15 imul rax, QWORD PTR [rcx] ;58.15 sub rdx, rax ;58.15 lea rax, QWORD PTR [COEF_mp_AE] ;58.23 add rax, 56 ;58.23 mov ecx, 24 ;58.15 add rax, rcx ;58.15 mov ecx, DWORD PTR [52+rbp] ;58.15 movsxd rcx, ecx ;58.23 imul rcx, QWORD PTR [rax] ;58.23 add rdx, rcx ;58.15 lea rax, QWORD PTR [COEF_mp_AE] ;58.23 add rax, 56 ;58.23 mov ecx, 24 ;58.15 add rax, rcx ;58.15 mov rax, QWORD PTR [rax] ;58.23 lea rcx, QWORD PTR [COEF_mp_AE] ;58.23 add rcx, 64 ;58.23 mov ebx, 24 ;58.15 add rcx, rbx ;58.15 imul rax, QWORD PTR [rcx] ;58.15 sub rdx, rax ;58.15 mov eax, DWORD PTR [68+rbp] ;58.15 movsxd rax, eax ;58.23 imul rax, rax, 4 ;58.23 add rdx, rax ;58.15 lea rax, QWORD PTR [COEF_mp_AE] ;58.23 add rax, 64 ;58.23 mov rax, QWORD PTR [rax] ;58.23 imul rax, rax, 4 ;58.15 sub rdx, rax ;58.15 movss xmm0, DWORD PTR [rdx] ;58.23 movss xmm1, DWORD PTR [112+rbp] ;58.23 mulss xmm0, xmm1 ;58.15 mov eax, DWORD PTR [68+rbp] ;58.32 movsxd rax, eax ;58.32 imul rax, rax, 4 ;58.15 lea rdx, QWORD PTR [COEFL] ;58.15 add rdx, rax ;58.15 add rdx, -4 ;58.15 movss DWORD PTR [rdx], xmm0 ;58.15 like I said I don't know anything about assembly. These lines are from the file resulting from the command ifort /fpp /Od /S file.f to Steve Lionel, Is this additional assembly just the added overhead for the deferred shape arrays? It does not seem relatively low in this case. Thanks

tim_s_1 · ‎12-20-2012

the structure of this protion of the code is of the form 01DO K = KSTART,KEND 02 DO J = JSTART,JEND 03 E(IST) = 0.0 04 F(IST) = B(IST,J,K) 05 DO I = ISTART,IEND 06 IF(AP(I,J,K).GT.1.0E+19) THEN 07 E(I) = 0.0 08 F(I) = B(I,J,K) 09 ELSE 10 R = A1(I,J,K)*B(I,J+1,K)+A2(I,J,K)*B(I,J-1,K)+A3(I,J,K)*B(I,J,K+1)+A4(I,J,K)*B(I,J,K-1)+C(I,J,K) 11 Y = T*(A1(I,J,K)+A2(I,J,K)+A3(I,J,K)) 12 R = R - Y*B(I,J,K) 13 X = 1.0/(A6(I,J,K)-F-A6(I,J,K)*E(I-1)) 14 E(I) = A5(I,J,K)*X 15 F(I) = (R+A6(I,J,K)*F(I-1))*X 16 ENDIF 17 END DO 18 DO II = ISTART,IEND 19 I = NI+IEND-II 20 B(I,J,K) = (E(I)*B(I+1,J,K))+F(I) 21 C = C + MIN(ABS(E(I)*B(I+1,J,K)),ABS(F(I)))/MAX(1.0E-25,ABS(E(I)*B(I+1,J,K)),ABS(F(I))) 22 END DO 23 END DO 24END DO lines like 3,4,7,8,12,19,20,21 have identical assembly (at least when not optimized) while lines like 6,10,11,13,14,15 are much larger the variables in the first group of lines are all local, common, or passed as an argument and the size is declared locally. the second group contain variables from the module. Jim, you mentioned before that if I could combine the A1-A5 arrays that this may help. It seems as though the original code does not have the same issue as it treats all of the common block indexed from a single memory location and now the code needs to index from many different locations, 1 per module variable/array. I am now inclined to pursue this option for this module(although others may not work nearly as well). I would prefer to not change the main body of the code, as then I am much less likely to inadvertently change the results. (I am not working with a small piece of code but with hundreds of routines, many of which are quite complex, so any code changes will have to take place in a very general manner, like in each of the modules but not the code body.) In my limited experience it seems as though some changes will need to be made within the routines themselves and not just the module, or the program will still thing of each of the individual A1-A6 arrays as individual blocks of memory and not 1 large contiguous one. If I change the module such that the only variable in it is lets say ARRAY(NX,NY,NZ,6) then in the routine put a preprocessor statement like #define A1(i,j,k) A(i,j,k,1:1) #define A2(i,j,k) A(i,j,k,2:2) #define A3(i,j,k) A(i,j,k,3:3) etc. should that work? thanks again