How to vectorize efficiently constructs apart form loops in For

parallelworker · ‎03-29-2012

Hi,

hope there is someone from the intel ifort compiler team being able to answer my somehow strange questions ;-) .

Problem description:

I have a Fortran90 Code that should be vectorized in some specific place in one object file (this part of the code is indicated by hotspot analysis as the most promising code part with regard to any form of optimization).

So, I was advised to apply some vector optimzation (SSE or AVX) within a limited scope of the source code of one file.

But, as the relevant code snippets do not contain any for, do or while constructs, I wonder how to accomplish this taks
in an efficient and straightforward manner.

All examples well suited for vectorization (reading through forum posts, intel compiler docu and tutorials) do only
present examples with do, for or while constructs (in C/C++ resp. Fortran language).

The auto-vectorization facility of the ifort compiler is doing some vectorization optimization in two places (indicated
by the vec-report messages). I receive the message reading "BLOCK WAS VECTORIZED" .

In two other places the compiler annonces that no profitable vectorization of a loop construct wouldn't/shouldn't be possible. Eitherway, I can force the compiler to also vectorize these probable loops.
By vectorizing these parts of the code, again the message "BLOCK WAS VECTORIZED" is printed.

And here are my questions:

1) I wonder if the ifort compiler is anyhow able to (efficiently) vectorize any other code constructs related to arrays or
contigous data blocks with static/variable(?) sizes of the data blocks that are not covered by loop constructs? (I am not very familiar with this sort of stuff ;-) )

If so, how do these prototypes of constructs look like?

2) Am I right supposing a compiler is only able to vectorize loops resp. constructs if the extent of all array indices is
known at compile time and if all indices do not depend on each other?

3) Is it possible to determine the exent of blocks vectorized by the compiler without snooping around in the assembler code of the program?
Or, how can I instruct the compiler to tell me about the exent of blocks vectorized (the compiler only tells me about the
start source code line, but unfortunately doesn't inform me about the end of of a block resp. of the extent (in lines) of the block being vectorized :-( ) ?

4) Why does the compiler tells me about a loop being inefficent with respect to vectorization while doing no optimization,
whereas he tells me about a block being vectorized while forcing the compiler to vectorize all vectroizable constructs in
the object file?

Thanks in advance taking some time reading through all this stuff and my questions. ;-)

Greetings, Sebastian.

Steven_L_Intel1 · ‎03-29-2012

I suggest you try the -guide option (In Composer XE 2011). This will provide you explicit advice on what you can do to aid vectorization (and parallelization). There are directives you can add that can give the compiler more information. You may also find that using -ipo aids vectorization.

jimdempseyatthecove · ‎03-29-2012

Sebastian,

Have you experimented with seeing what is reported with 'hack':

DO
... ! your code here
EXIT
END DO

IOW create a run-once loop for the purpose of the compiler identifying a loop to optimize (and thus having a loop to report on).

Jim Dempsey

parallelworker · ‎03-29-2012

Hey Jim,

thanks for your reply. :-)

Unfortunately the DO ... [own code] EXIT ... END DO hack did not help anyway. Just tried this one out!

The ifort compiler writes out exactly the same messages and no more infos with regard to vectorization optimization is
given to me :-( .

I just forced the vectorization of all parts in question in the relevant source code file by passing -vec-threshold0 to ifort.
Not ideal, but it worked somehow?!?

I did some comparative measurements: Unfortunately, no luck with regard to performance enhancements. :-(

Any other helpful "hack" ideas, Jim ??? ;-)

parallelworker · ‎03-29-2012

Hey Steve,

I took some time to carefully test your tips with regard to the actual problem.

The option -ipo didn't help anyway. But even worse, it instructs the compiler to suppress the information of the
vec-report :-( .

Looking at the ifort man page I realized that this option is completely wrong for me as I am doing no interprocedural optimization between different files.
(The file in question is build with disregard to the other files of the same library - so there is no linking process while building that file.)

Unfortunately, the ifort compiler was not able to give out some more guidance tips with respect to vectorization while
using the -guide opt. Even with -guide4, ifort only tells me the following: "Number of advice-messages emitted for this compilation session: 0." . This is not very nice. ;-)

I also tried out the directive !DEC$ VECTOR in front of the fake DO-LOOP. But didn't gain any more compiler infos.

At last, I also tried out the option -fasm-verbose. But even by passing this option, I didn't gain any more profit. :-(
(no asemble file was created)

The option -save-temps seems not to be honoured by ifort: I am not able to loacte any intermediate/temporary files even in
the /tmp folder??? Solely the object file is created.

Thanks for your advices ;-) .

P. S.: Perhaps you know about some other hacks to elevate the verbosity of the compiler???

jimdempseyatthecove · ‎03-30-2012

Can you show a section of code that you believe should be vectorizable but is not?

Jim Dempsey

parallelworker · ‎05-02-2012

Hey Jim,

disregarding the explicit content of my source code file, are you able to clarify the meaning of the
message item "BLOCK WAS VECTORIZED" written out by the vectorizer module into the vec-report ???

The diagnostics written out do not give any information concerning the extension
(in source code lines) of the so called "block construct" - only the starting line is given in
the reports.

But looking at the source code in the neighborhood of/at the corresponding line, I am not able
to localize resp. recognize any construct resembling a block like.

So, I do not suppose any block construct existing within the interesting scope of the source code
file - but the compiler seems to recognize a block construct nevertheless.

As soon as I am able to effectively reduce the size of the corresponding source code file and to
localize the extent of the problem encountered, I'll submit a shortened version of the source code
file to the intel premier support. Then, one of your team will be able to launch into my problem. ;-)

Best greetings, Sebastian.

jimdempseyatthecove · ‎05-03-2012

Sebastian,

Vectorization refers to Single Instruction Multiple Data (SIMD). This requires that your code manipulations are, or can be made to, collect and use multiple data for use with a single instruction. This usually occures with array operations

ArrayA = ArrayB op ArrayC op Scalar

And in loops traversing arrays where adjacent elements are processed per iteration and the compiler can determine if the iterations can be unrolled and pairs or quads fused (joined) without introducing errors.

Non-loop strait line code may not lend itself to vectorization excepting when the code manipulates arrays or does preponderantly the same thing with multiple sets of data, and where the data is declared in a manner that is favorable to vectorization (IOW adjacent in memory).

real :: A,B,C,D
...
A = 0.0
B = 0.0
C = 0.0
D = 0.0
...
A = A + A * scale
B = B + B * scale
C = C + C * scale
D = D + D * scale
...
The above is a likely candidate for vectorization.

A "block" of code can be an entire subroutine, or DO/WHILE loops. And blocks can be nested and/or combined (when the compiler can do so without introducing errors).

Jim Dempsey

parallelworker · ‎05-09-2012

Hey Jim,

thanks for your detailed explanations of your last reply. :-)

I just had a closer look look to "a guide to vectorization with intel c++ compilers" (doc no. ???) and read some articles dealing with vectorization (compiler) techniques available on wikipedia's site.

As far as I read through the information on auto-vectorization techniques of compilers, the term vectorization is only used in order to account for the actions of unrolling, collapsing, merging loop constructs (and so on) in order to gain block like constructs as you described them in your foregoing reply.
These blocks are just ready for direct application of AVX//SSE/SIMD vecorization operation techniques at compile time - as far as the compiler is able to do so with (a) repeatingly executed instruction(s) on (an) array(s).

Normally, the compiler isn't able to vectorize equation constructs containing function calls residing in loops.

But there are a few (special) math funtions like cos, sin and exp built into the intel compiler libs that have a vectorized version with them.

I just translated a source code file into an auto vectrorized and a non vectorized (-no-vec) assembly file and had a look
at the calls to the exp function.
I recognized, as expected, a lot of reordering in the assembly code: Above all, changes in the usage of vector registers in the executalbe part before the call to the exp function, but also data segment reordering was visible comparing the two assembly files
But I was surprised that in both cases (vectorized or not vectorized block constructs) the call to the exp function is always coined exp.

Does the compiler determine at runtime which version of the exp should be implemented by the call to the exp function???

Thanks for your help and all the best, Sebastian.

jimdempseyatthecove · ‎05-09-2012

>>Does the compiler determine at runtime which version of the exp should be implemented by the call to the exp function?

I cannot directly answer this. However, only one entry point is required.

The EXP function can take a scalar value as argument (in XMM or YMM register) as well as small vector. The computation within the function to produce e**x can be coded to use vector values as opposed to scalar and the computation time will be the same. When scalar is operated on then the computation produces junk. Which may have overflow or NaN if the scalar load of x does not zero out the additional cells of the small vector. The compiler will (should) generate the appropriate load (movss or movsd), and therefore junk will not be produced in the unused cells of the small vector (when manipulating scalars). A similar thing can be done with other intrinsic functions.

Jim Dempsey

TimP · ‎05-09-2012

I don't know whether any of us are answering the question, as maybe we don't know what is being asked.
Supposing that a compile option which enables auto-vectorization is set, it will depend on whether the compiler determines that exp is called from a vectorizable DO loop. If so, it will choose one of the svml exp functions (4 at a time for single precision SSE, 2 for double precision SSE2, ....), depending on whether -fimf- options are set. If exp() is not called from a vectorized loop, an imf llbrary version will be chosen, with higher accuracy guarantees than the vectorized ones.
The default compilation allows run-time choice of architecture implementation of exp() according to the platform detected. -fimf-arch-consistency=true over-rules the run-time choice and picks one which will run on all SSE2 platforms.
The choice of math function implementation isn't controlled directly by the compiler architecture option you set, except possibly for 32-bit -mia32.
If you want the glibc x87 implementation of exp(), you may be able to get it by options which prevent linking Intel math libraries (or by editing out the Intel functions from the libraries you link against).

jimdempseyatthecove · ‎05-09-2012

Thanks Tim
Jim

parallelworker · ‎05-10-2012

Thanks Jim, thanks Tim,

that's just the sort of internal stuff I wanted to know about:
The compiler decides at runtime, which version of (serail/vectorized) exp function to call. ;-)

Just another question going further:

There also do exist vectorized versions of the (complex) exp function built into the MKL library beloing to the VML function class of MKL.
Do the VML variants of the exp function in the MKL be rather different from the vectorized version(s) of the exp function built into the imf library of the compiler???

Thanks in advance for a short reply, Sebastian.

TimP · ‎05-10-2012

Run-time decisions between vector and non-vector exp() would be based on array section length and alignment. However, the compiler was changed several versions ago to use the same svml function for full 128- or 256-bit segments and for remainder segments (by adding dummy array elements which are then discarded). This was done because there had been excessive variations in numerics according to alignment. It does mean that a vectorized loop will be slower when the vector is length 1.
In some cases there is multiple versioning; if you look in the opt-report and see that there are both vectorized and non-vectorized versions of a given loop, yes, then there would be a run-time decision among them.
There may be VML functions of various accuracies which you could select by name. There isn't a guarantee of identical results even between the full accuracy VML and corresponding svml functions, nor between either of those and the scalar functions.

parallelworker · ‎07-04-2012

Hi Tim, Hi Jim,

i did some experiments with the same fortran combustion simulation code:

I just implemented the vectorized version of the exp version from the MKL called vmdexp instead of doing several
calls to the serial exp function from the libc .

I looked at the assembler code and I realized that the compiler, unlike before, did implement the AV extensions into the
assembler code by populating ymm registers. :-)

Unfortunately, by using ymm regs on a machine with a CPU with AV extensions, my fortran program slows down by a
factor of three in the measured function. :-(

Do you think, there is a serious problem of vector alignment (the vector was written from hand with 138 elements
and I did not implement any do-loop creating the vector) or could there be any other serious design flaw concerning
this specific vector (something about copy-in copy-out issues) ???

Thanks for your suggests, Sebastian.

How to vectorize efficiently constructs apart form loops in Fortran code?