I'm trying to see if it is possible to speed up a FORTRAN (F95) program. Sprinkling the original code with a great many "call cpu_time()" function calls and tracking the elapsed, I have narrowed it down to a small handful of lines that use up the majority of the run time. One such line is a surprisingly simple, single line that just re-initializes a 3-dimensional matrix to all double-length real zeroes (it is part of a huge matrix inversion solver setup). The array bounds are 150x150x160, so the total # elements is 3,600,000. The line is:
if(i_ray .eq. 2) then
arr_t1u(:, :, :) = 0.0d0
This array is a local variable used within a "contain"ed subroutine that is inside of a subroutine that is called from main. It is declared with
real(re_type), save :: arr_t1u(max_d, max_d, max_rays)
so the memory is not being freed after each loop and/or subroutine call (I've verified that the memory addresses remain the same throughout the run)
My initial hope was that invoking parallel code (Open MP) would help, but it actually made it run slightly slower. (I suspect a 3.6M element array is not in fact large enough for the parallelism to "pay off"). My timing outputs show that zeroing this array only takes about 2ms (0.00207 s) per call, however, due to the structure of the program, it is being called over 13000 times, and this is what it adding up to make for a long run elapsed time. I've tried a "forall" variation, as well as nested "do" loops (in different orders), and "arr_t1u = 0.0d0", and the code as above seems to be the fastest.
My question is: any suggestions on what to look into to possibly speed this up? Even if it could shave any time off, it would add up to a big improvement. I've read some things about compiling with -heap-arrays switch, but I'm not sure what that does. Also, would using allocatable array possibly make any difference?
Thanks in advance, any help/suggestions would be appreciated!
3.6 million 8-Byte elements is 28.8 million (decimal) bytes. This is potentially cache-containable on some of the larger Intel server chips, but probably won't be when considering other arrays in use.
At an execution time of 2 milliseconds per call, the correspond rate of stores is 14.4 GB/s. This is actually a very high rate for a single thread of execution. This loop should be large enough to benefit from parallelization, but if you are running on a multi-socket system you need to be sure that the threads are all "local" to the data.
If the "arr_t1u" array is too big to fit in cache, or if accesses to other arrays are expected to evict the "arr_t1u" array before it is used, then there may be a benefit in using streaming stores to bypass the cache. In Fortran the VECTOR NONTEMPORAL pragma should cause the compiler to generate streaming stores.
The relative performance of "ordinary" and "nontemporal" stores varies by processor generation, as well as by array size, in ways that are not easy to summarize. Fortunately adding one pragma makes this easy to test.