I have parallellized my code with MPI (OpenMP could not do that job) and have started to look at possibilities to
use vectorization. A short subroutine for the calculation of base function in a Finite Element program has been attached to this mail.
What can be done?
- Parallel Computing
In general, vectorization provides significant advantages if the performance is strongly limited by computation, rather than by data motion.
If the routine above is called in a loop (and consecutive iterations of the loop are not dependent on previous iterations), inlining the routine should allow the compiler to vectorize the computations across multiple sets of input data.
The compiler should be able to do this automatically in many circumstances. It is likely to work best if this subroutine is included in the same source file as the calling routine.
I am not particularly familiar with the Intel Fortran compiler, but a quick look at the manual shows support for the annotation:
!DIR$ ATTRIBUTES FORCEINLINE
Which, combined with the "-finline" compile line option, should increase the likelihood that the compiler will inline the routine.
You will need to pay careful attention to the optimization reports. A good start would be adding "-qopt-report=2 -qopt-report-phase=vec" to the compilation command. Sometimes more verbose output (e.g., "-qopt-report=5") is helpful.
It would help to know the compute intensive code that calls this subroutine.
The subroutine as listed, even with inlining, might not be sufficient enough to permit the compiler to (efficiently) vectorize the code. If you can show the loop that calls this, we can be able to better help you.
Thanks Johan and Jim.
The compute intense subroutine is an integral over space I (variable z) obtained by summing:
where f contains calls to a third party software (requiring most of the computing) and g is a product of simple functions such as gradient, surface area etc. written by myself. The sum goes typically over 16 subdivisions.
Maybe this can generate some more comments. The actual code can be sent over in the next step if necessary.
Neither function f nor function g argument lists match the arguments of CalculateQuadraticBaseFunctons.??
Do you have access to the third party functions? Are they part of "standard" libraries (Blas, MKL, etc...).
Before diving into the code details I would like to get a better understanding of inlining. I searched at the web but did not find what I looked for.
My first principal question is if my calculation of an one-dimensional integral I has the principal structure
I=SUM[ f(x)*g(x) ]*dx
where f(x) is a third party subroutine supplied as a DLL and g(x) is a function written by myself. g(x) is itself a product of x-dependent simple subroutines.
Will inlining work for this case? If the answer is yes, is there any requirements on g(x) in order to make the vectorization work?
If you do not have the source to function f, then the compiler cannot vectorize it, nor can it inline it. What you might experiment with is to produce an array fPrime(x) from f(x), then create a vectorizable function of g that accepts an element of fPrime and the array element represented by x in g(x). This will come at the cost of writing the temporary array, but at a benefit of vectorizing the multiplication, and depending on the compiler optimizatation capability, vectorizing the resultant sum without introducing an additional temporary.
Do not focus too much on inlining instead of focusing on vectorization. The two are not always mutually inclusive. Also, it is often a good practice to look at one or two levels above the hot spot function when optimizing. Often the ability to vectorize is determined by the code one or two levels up. Note, Inter-Procedural Optimization (IPO), either intra-file or inter-file will generally inline functions without explicit !dir$ inline .... directives. And, depending on the code (e.g. loop) containing the inlined functions it can be at times counter-productive to inline (i.e. when the code size of the loop no longer fits in L1 instruction cache).
There is a similar characteristic with parallelization
Normally inlining is not very important, but in this case the function is very short and has very few computations. The only hope for vectorization is to inline the body into (what I hope is) an enclosing loop.
>>The only hope for vectorization is to inline the body into (what I hope is) an enclosing loop
At issue is the 3rd party function (in DLL) which cannot be inlined, nor vectorized. Therefor a possible solution would be to make one complete loop obtaining the values from the DLL into an array, then a second loop that is vectorized. This, does have the additional overhead of writing to a temp array. As to if this is more efficient, that is to be determined.