Intel compiler issue for a long code

Roberto_Pinto_Souto · ‎12-21-2017

Hello all,

This post is regarding an issue about compiling the routine that performs
the Jacobian calculation (chem_spack_jacdchemdc.f90), in a Numerical Weather Prediction model, named BRAMS (http://brams.cptec.inpe.br/).

BRAMS is based on the Regional Atmospheric Modeling System (RAMS) originally
developed at CSU/USA. BRAMS software is under a free license (CC-GPL).

This routine is one of the hotspots of chemistry module in BRAMS, and we are
trying to accelerate its performance. The routine was decoupled from
BRAMS, so we worked without the need to run with the forecast model,
and is now called 'chem_spack_jacdchemdc_offline.f90'.

There were two versions of the code 'chem_spack_jacdchemdc_offline.f90': main and function.
In the main version, the large loop is in the main program itself.
In the function version, this loop is in a function, which is called
by the main program, as it is done in BRAMS.

The two versions were compiled with Intel (2016 and 2017), gcc 5.3
and pgi 16.5. The times obtained are in the attached worksheet:

Only with Intel (2016 and 2017) the executable generated with -O3 in the function
version, could not optimize as well as in the main version.

The source codes are available at
http://www.lncc.br/~rpsouto/brams/chem_spack_jacdchemdc_offline.tar.
This is a case with chemistry scheme (RELACS_TUV) containing 47 species.

We are suspecting that it may be related to the size of the main loop
in this routine. We found Intel's report about this issue:
https://software.intel.com/en-us/ARTICLES/INTERNAL-THRESHOLD-WAS-EXCEEDED

For example, for the attached code 'chem_spack_jacdchemdc.f90', which
calculates the Jacobian for 72 chemical species (RACM_TUV scheme),
and has a loop with more than 2000 rows, returns the following
message when compiling:
$ ifort -O3 -c chem_spack_jacdchemdc.f90
Space exceeded in Data Dependence Test in jacdchemdc_
Subdivide routine into smaller ones to avoid optimization loss

Although this message does not occur with RELACS_TUV (loop of about
1000 lines), this may be part of the explanation.

Thanks in advance,

Roberto Pinto Souto
HPC analyst at National Laboratory for Scientific Computing (LNCC/Brazil)

jimdempseyatthecove · ‎12-21-2017

In looking at your attached .f90, as written compiler optimizations (-O3) would be completely ineffective. What I suggest you do is to enter in

END DO
DO ijk=ijkbeg,ijkend

in front of the second JacC(ijk,... = ... line

Then copy it into the clipboard,

Then advance and paste in front of the subsequent JacC(ijk,... = ... lines

IOW each JacC(ijk,... = ... line of your former code is inside its own loop.

What this will do for your code is to provide for it to be vectorized (when iteration count is .gt. 1).

While this may seem like a lot of work, it should be relatively easy to automate the edits using awk or other macro editing tool.

Jim Dempsey

Roberto_Pinto_Souto · ‎12-22-2017

Dear Jim Dempsey.

Thank you for your answer and suggestions.
But, we'd like also to discover basically two things:
i) why ifort is able to optimize the loop when inside the main, and can not optimize when the loop is called from a function by the main?
ii) why for these both cases (loop inside the main, and in a function called by the main), the gcc and pgi compilers are able to optimize this same loop?

Thanks.

Best regards,

Roberto Pinto Souto

jimdempseyatthecove · ‎12-22-2017

>>the gcc and pgi compilers are able to optimize this same loop?

Have you looked at the "optimized" code? It may be a case that those compilers gave up and reported the code as optimized (as it couldn't do anything about it).

This is not to say that those compilers couldn't effectively do what is outlined above.

An alternate method that is easier to do using find and replace all:

! *** remove      DO ijk=ijkbeg,ijkend                                                                                                                            
      JacC(ijkbeg:ijkend,  3,  4) =  + dw(ijkbeg:ijkend,  1,  4) &
                           + dw(ijkbeg:ijkend, 36,  4) &
                           + dw(ijk, 52,  4)
      JacC(ijkbeg:ijkend,  4,  4) =  - dw(ijkbeg:ijkend,  1,  4) &
                           - dw(ijkbeg:ijkend, 36,  4) &
                           - dw(ijkbeg:ijkend, 37,  4) &
                           - dw(ijkbeg:ijkend, 39,  4) &
                           - dw(ijkbeg:ijkend, 42,  4) &
                           - dw(ijkbeg:ijkend, 49,  4) &
                           - dw(ijkbeg:ijkend, 53,  4) &
                           - dw(ijkbeg:ijkend,116,  4) &
                           - dw(ijkbeg:ijkend,118,  4) &
                           - dw(ijkbeg:ijkend,121,  4) &
                           - dw(ijkbeg:ijkend,124,  4) &
                           - dw(ijkbeg:ijkend,127,  4) &
                           - dw(ijkbeg:ijkend,129,  4)
      JacC(ijkbeg:ijkend, 13,  4) =  + dw(ijkbeg:ijkend,  1,  4) &
                           - dw(ijkbeg:ijkend, 36,  4) &
                           - dw(ijkbeg:ijkend, 37,  4)
...

(and remove the end do)

Jim Dempsey

Roberto_Pinto_Souto · ‎12-22-2017

Dear Jim Dempsey.

Thanks!
This simple change in the code, makes finally it accelerate now.

Best regards,

Roberto Pinto Souto.