Vectorization in "large" routines

David_DiLaura1 · ‎03-11-2008

I have a relaively large routine (3000 Fortan statements). I discovered (with VTune) that the compiler was NOT generating vectorized code. In many places I'm operating with 4-vectors of 4-byte floats. I would have thought that the compiler would be using the SMID instructions that fetch and operate on 4 floats at a time. For example, with this source:

EmitterProjectedCornerPoint(1:4,2) = EmitterPartialProjectedVerts(1:4,2) + CalcPoint(1:4,2)*EmitterMu(1:4)

the compiler produces:

mov eax, -0x10

RENDERING+0x12ba7:movss xmm0, DWORD PTR [eax+027a58960h]

mulss xmm0, DWORD PTR [eax+027a588e0h]

addss xmm0, DWORD PTR [eax+027a58a64h]

movss DWORD PTR [eax+027a58990h], xmm0

add eax, 0x4h

jnz RENDERING+0x12ba7

Interestingly, it doesn't even unroll the loop.

If I takethis typical line of code and put it in a very small routine, the compiler generates the expected SMID instructions that are fetching 4 floats at a time. No loop involved: one move, one mult, and another move. My compiler options are:

/nologo /Zi /O3 /QxP /Qparallel /assume:buffered_io /free /module:"Release" /object:"Release" /libs:static /threads /c

In the compiler's defense (as it were) , it issues a message that it has run out of space and I get the following message:

Space exceeded in Data Dependence Test in _MAIN__

Subdivide routine into smaller ones to avoid optimization loss

And . . . if I use /QaxP the out of space message is NOT issued, but the compiler generates code that doesn't even use SMID instructions; the old arithmetic unit instructions are used.

So (finally!) my questions:

1) What 'space' is it that the compiler is running out of? Is there something that I can do/set/indicte?

2) Evidently I don't really understand the difference between /QxP and /QaxP. Shouldn't /Q axP also properly vectorize this code? I'm not getting a message that the compiler has run out of space . . .

Please don't send me to Premier Support. I've been going round and round with them for two week (TWO WEEKS!) and have gotten no where. Has anyone else encountered a difficulty getting code vectorized?

David

TimP · ‎03-11-2008

I'm not familiar with that specific limit problem.
If you have more than one subroutine in the file, but don't need interprocedural optimization, /Qipo- or /Qip- may help. If you do need ipo, there is /QipoN (make N object files rather than 1).
The big hammer, at your own risk, is to set -override_limits
The compiler cuts off optimization for large files in order to avoid danger of getting hung or out of memory.

jimdempseyatthecove · ‎03-11-2008

David,

While you are waiting for a fix you might try experimenting by creating a user defined type

type Vec4
real(4) :: v(4)
end type Vec4
...
type(Vec4) :: EmitterProjectedCornerPoint(nCorners), EmitterPartialProjectedVerts(nVerts), CalcPoint(nPoints)
type(Vec4) :: EmitterMu

EmitterProjectedCornerPoint(2)%v = EmitterPartialProjectedVerts(2)%v + CalcPoint(2)%v*EmitterMu%v

You might find that the compiler has less to think about when programmed this way

Jim Dempsey