Optimization feature request (SIMD)

jim_dempsey · ‎10-28-2005

I posted this on the premier support section as a feature request. I thought this would be a good subject for the forum and will welcome a discussion on this subject.

When optimizing for SSE3 or other SIMD instruction sets there is a non-neglegable amount of overhead in determining alignment issues. This overhead is significant during short iteration count function/subroutine calls. And due to uncertanties of alignment narrow aperature array functions which may or may not be inlined cannot take advantage of SIMD instructions. To overcome this and to increase the performance of your runtime code I have the following suggestions.

1) Add a "cDEC$ attributes interfacerequired :: SubOrFuncName" to be used in the source module that contains the code for the subroutine or function. The action this causes is to decorate the name such that it will only be found (at link time) if the programs referencing the subroutine/function also contains "cDEC$ attributes interfacerequired :: SubOrFuncName" in the interface declaration to the subroutine/function. The purpose of this feature is to require an interface to access to the subroutine/function.

2) Permit "cDEC$ attributes align: n :: var" to specify alignment requirements of arguments to the subroutine/function.

3) The subroutine/function is permitted to assert the alignment but does not contain code to work with misaligned arguments. i.e. you produce a runtime error.

cDEC$ attributes interfacerequired :: foo
subroutine foo(out,inA,inB, count)
cDEC$ attributes align: 16 :: out
cDEC$ attributes align: 16 :: inA
cDEC$ attributes align: 16 :: inB
integer :: count
real(8) :: out(count),inA(count),inB(count)
_asm { ! compiler inserts fast code to test for alignment
mov eax,dword ptr out
or eax,dword ptr inA
or eax,dword ptr inB
and eax,0Fh
jneReportError
// fall through to execute section
}
do i=1,count
out(i) = inA(i) + inB(i) // will process two at a time
end do
end subroutine foo
_asm {
ReportError:
call ArgAllignmentError(ModuleName)
}

4) When compiling source code, that uses the interface with alignment requirements, the compiler should assert the alignment. i.e. If the compiler knows the alignment does not meet the requirements then it issues an error at the calling statement versis at runtime in the routien called.

5) If the compile time assertion fails due to knowing misalignment then the assertion is an error.

6) If the compile time assertion fails due to not knowing misalignment then the assertion is a warning. (or maybe you can require declaration of alignment in the source code)

7) When inlining the function/subroutine and knowing the alignment then the inligned function/subroutine test for alignment can be removed.

Jim Dempsey

Intel_C_Intel · ‎10-28-2005

Dear Jim,

Hopefully you will be happy to hear that we are working on a Fortran equivalent of the Intel C/C++ __assume_aligned() hint. However, the compiler will not generate code that tests if the asserted property is actually true (since the whole purpose of such hints is to avoid any runtime overhead). As for any assertion, the compiler will simply optimize the code accordingly. If you give a wrong assertion, the code may break (cause a runtime exception in this case).

Aart Bik
http://www.aartbik.com/

jim_dempsey · ‎10-28-2005

assum_aligned (or what you work up for it) isfair. The user always has the option of:

#ifdef _DEBUG

if(mod(loc(arg1),16) .ne. 0) call ('Fool')

#endif

Or leave the test in the Release version at the programmer's choice.

The programmer can always look at the CallStack.

My suggestion was to catch the error _prior_ to debugging. As we both know, test cases sometimes pass debugging where end-user use of application will fail. Having the compiler assert the alignedness would verify proper calling sequence at compile time. You currently perform type checking if the user uses the interface block. I suggested that this be extended to include the alignment attribute in addition to type, rank, etc...

I think the modifications I suggested would be integratible into ifortrelatively easily. Everything is in place to do this type of test.

Jim Dempsey

Intel_C_Intel · ‎10-28-2005

Dear Jim,

The Intel compilers use rather advanced methods to propagate alignment information within and across subroutines (see Chapter 6 of the Software Vectorization Handbook, more information at http://www.intel.com/intelpress/sum_vmmx.htm). Unaligned or unknown alignments are still optimized using static and dynamic methods. The assertions are provided to avoid the overhead of the dynamic methods for situations where the programmer knowsmore about the alignment than the compiler.

If I understand you correctly, you want to make alignment part of the type system, i.e. something that is statically enforced (so calls to aligned arguments with unaligned arrays are rejected). Although interesting, such a solution seems rather intrusive for the programmer (I already encounter resistance if I ask customers to add one pragma to their code). Furthermore, programs that are provably type-correct (using your definition) are exactly the programs for which the compiler alreadypropagates exact alignment information using the existing methodology and a sufficiently large compilation scope (and, hence, need no further annotation). The only added value I see with your proposal is when modules are compiled in strict isolation, but I am not sure if this warrants such an elaborate extension (requiring cooperation between the language specs, compiler, linker and acceptance by sufficient programmers). But, the floor is open for input from others..

Aart Bik
http://www.aartbik.com/

Message Edited by abik on 10-28-2005 01:51 PM

TimP · ‎10-28-2005

We got default alignment of local arrays since ifort 8.0, to simply the problem.

Message Edited by tim18 on 10-29-2005 07:18 AM

Intel_C_Intel · ‎10-28-2005

Tim,

Please advise our customers correctly! The aligned annotation has *no* impact onassumed data dependences that prevent (unconditional) vectorization at all. So, the code:

!DIR$ VECTOR ALIGNED
do i = 1, n
a(i) = a(i+x) + 10
enddo

Still assumes loop-carried flow dependences. To do what you allude to, one would have to use the following annotations:

!DIR$ IVDEP
!DIR$ VECTOR ALIGNED
do i = 1, n
a(i) = a(i+x) + 10
enddo
end

Furthermore, Jims suggestion is to make alignment properties part of the type system itself.

Aart Bik
http://www.aartbik.com/

Message Edited by abik on 10-28-2005 02:19 PM

jim_dempsey · ‎10-29-2005

Art,

Yes, enforcement through the type-ing system.

Enforcement as it is done now works only when the entire set of source modules is at the disposal of the compiler. This fails to work effectively when you link in .lib or other .obj files for which the compiler has no source (or database tacked on as a resource). The suggestion for function name decoration is something borrowed from the CPP group.

subroutine foo(ivar,fvar,dvar)
integer(4) :: ivar
real(4) :: fvar
real(8) :: dvar

Might have the name "_foo_pi4pr4pr8" but currently is "_FOO"
Alignment requiremets can be mangled in there as well:

subroutine foo(ivar,fvar,dvar)
integer(4) :: ivar
!dec$ attributes align:16 :: fvar // proposed extension
real(4) :: fvar(ivar)
!dec$ attributes align:16 :: dvar // proposed extension
real(8) :: dvar

Might have the name "_foo_pi4pr4a16pr8a16"

Then if the user compiles without the interface the call goes to "_FOO" which is not found. Or you can make an alternate "oops" entry point that displayes an error message at runtime if the user manages to link to it.

The point is you are not requiring the user to use the pragmas. If they want the performance enhancement then they will willingly use the pragma (!DEC$). This is no different now than permitting the user to use !DEC$ to align data - the user is not obligated to use it but if the do use it then the program might benifit a bit.

If you look at the assembler code for a simple vector sum of two arrays into one (I know you have Aart but others here may not have) you will notice the compiler does a realy good job at figuring out if SIMD will work and what it needs to do to synchronize the the operation so that it can make effective use of SIMD. There is also a quick test for a loop threshold (for real(8) it is 9 iterations) to see if it is worth bothering to attempt to use SIMD. If the iteration count is low then the overhead is high (and at times has even worse performance than not using SIMD).

Tim,

!DEC$ VECTOR ALIGNED

This is true, but aligned to what? 2, 4, 8, 16? How many bytes?
This is not specified. I've tried !DEC$ VECTOR ALIGNED but without certain success. I get the impression from looking at the code that the alignment is assumed to be at 8 not 16. This means the SSE3 instructions are not optimal. What is needed is

!DEC$ VECTOR ALIGNED : 16

But there is no syntax for that. Also, which vectors are aligned? Maybe my loop has one aligned vector and one unaligned vector. What then?

This would all be moot if it is permitted to use "!DEC$ ATTRIBUTES ALIGN : 16 :: VAR" on the dummy arguments to function and subroutine calls. You are permitted to align automatic (stack) variables why not permit the declaration on the input arguments -- Because of the potential problem of the caller passing in unaligned data. Well "!DEC$ VECTOR ALIGNED : 16" isn't going to fix that potential problem either. Some programmers can aim between their toes.

Both,

I will have to admit the compiler writers (does that include you Aart?) are doing an amazingly good job.

BTW in a past life I wrote a post compile time optimizing utility that I sold from my software company. It was targeted towards Borland C++ but was generic enough to work with MS C+ +. This was way back in 1990/1991. What it would do is Shell run the C++ compiler modifyig the switches to produce .ASM files. Then it would edit the .ASM files and pass the result on to TASM or MASM. Then link. The optimizations would (in addition to general cleanup) convert all huge pointer references into flat model Scale Index and Base operations. The resulting program had to run in Real Mode (not VM) and ran with a replacement for the XMS manager. The replacement XMS manager kept the Granularity set to Large as well as provided an extended heap. The code was still restricted to the lower 640KB but the data could span to the full extent of physical memory (a few MB in those days).

double huge *array = new huge double[123456];
...
array[index] = var;

Would be compile to that uggly huge mode code but the conversion process would produce nice lean SIB mode code as well as replace the new handler for huge. Programs would run 400% faster with 10x the RAM. This gave much better performance than DOS extenders but required the real Real Mode DOS. Not many DOS customers.

Jim Dempsey

TimP · ‎10-29-2005

VECTOR ALIGNED asserts that all potentially vectorizable operands are 16-byte aligned, so the compiler can skip generation of code which checks and adjusts alignment, as I think you asked. If one of the arrays is not aligned, there may be less to be gained by telling the compiler about it.
If you know that all arrays which are written into are aligned, and had a way to assert that other arrays will not be aligned, you could save generation of unneeded code by your data alignment assertions. Certainly, in the case where an array is declared aligned, but the loop starts at a non-aligned interval from the start, your proposal could do the job.
Aart and Steve have scolded me before for working so much on old-fashioned code where there is much built-in information on alignment, much of which Aart has taught the compiler to take advantage of. I'm not meaning to argue much against this, but I'm somewhat skeptical of general acceptance of more specialized optimization stuff like what has been added to C. AMD waged a campaign to have other brands of Fortran include low-level SSE intrinsics, which came originally from Intel C. They weren't nearly as popular as auto-vectorization, so now all the compilers have copied from Aart's work.
Intel C people used to argue that all data should be declared with declspec, all malloc() replaced with non-standard functions, and whole program optimization used, if efficient alignments are wanted. They even turned down requests to have default minimum 64-bit alignment for 64-bit data. I'll admit my gratitude that Fortran isn't likely to adopt such policy, and try to stop making noise.

Intel_C_Intel · ‎10-31-2005

Hello.

The ultimate goal is to write source code with high performance that is invariant with compiler. This is now possible with IVF: even old fashioned Fortran-77 code vectorizes neatly (data were usually stored in a cache friendly unit stride fashion back in the seventies).

Adding many #pragma and !DEC statements everywhere makes the code dependent on one particular compiler and the readability of the code is usually reduced. Instead, all the automatic features of IVF, like automatic vectorization and automatic parallellization makes the life much easier for the programmer. (I am waiting for more automatic features in the future! :-).

If performance is crucial, I recommend to put all the functions that belong together in one source code file and compile with the options "/O3 /Qip /QxN /Qprof_use". Then allignement issues are not seen, and both compilation and program execution time is reduced.

Best Regards,

Lars Petter

Message Edited by lpe@scandpower.no on 10-31-2005 01:58 AM

jim_dempsey · ‎10-31-2005

Correct me if I am wrong...

As it stands now, all subroutines and functions (that are not inlined) have entry points that visible by the linker. As such, the compiler cannot know the particulars of all potential callers to the function or subroutine. Therefore the compiler is left with no choice to generate code that is not sensitive to alignment issues i.e. has overhead to workout the alignment prior to taking full advantage of SSEn instructions. For routines that are relatively small the routine can be forced inline and thus brought into the scope of the caller and as a result the alignment of the arguments can be taken into consideration during code generation. If the routine is not inlined then it has the potential of being called from a context where the alignment is unknown and thus the synchronization code is generated for the routine and executed upon call.

A potential comprimize to this quandary would be to generate twoor moreentry points into the routine. Such that when the compiler has access to the routine being called as well as the caller and if the alignments meet the alignment requirements of the routine then the compiler outputs a call to the alternate entry point that bypasses the test and synchronization code. This method would require no !DEC$ (other than for that to align the variables for allocation or static placement.).

Jim Dempsey

Intel_C_Intel · ‎11-01-2005

Hello,

You writeabout a"non-neglegable amount of overhead in determining alignment issues". In terms of CPU time, what is this overhead relative to the total time the program takes?

Lars Petter

jim_dempsey · ‎11-01-2005

Inmy first message in this thread I mentioned two circumstancessmall iteration count and narrow aperature. For small iteration counts the arrays are relatively small 1-50 entries. Narrow aperature is similar by being a small section of a larger array. If the arrays of real(8) are being processed thenfor iterations of9 or less the SIMD is bypassed resulting in all entries be processed one at a time (verses two at a time). If more than 9 entries are to be processed thencode is run to determine if the array(s) are aligned. If not then alignment will be attempted and if possible then the code will proceed with SIMD (two variables at a time).

If the array is .gt. 50 then the overhead is not significant, between 10 and 50 it is significant and below 10 SIMD is bypassed alltogether.

What this results in is the user, in order to take advantage of SIMD, must write snipits of code to be inlined (or write the code in place all over the place). Inline-ing is not objectionable for relatively simple code segments but not all code segments are simple.

For example: The code I am working with now performes a finite element analysis of tethers. Each point has several attributes to follow (mass, X, Y, Z, dX, dY, dZ, ddX, ddY, ddZ) plus external forces. The reaction at one point affects the action at adjacent points (thus my interest in narrow apperature). Due to vectors being in triplets of real(8) it would be advantagious to process the arrays two vectors at a time (6 variables at a time). The compiler has no choice but to generate SISD (Single Instruction Single Data) eventhough I take the effort to align the data.

I inline where possible but this is not practical in all cases (the code to inline is too large). Having the ability to !DEC$ align the dummy arguments would correct for this problem but we are not permitted to do this.

The SIMD instructions can significantly reduce the run times of your application. IVF does a marvelous job at generating SIMD code when it knows the data are aligned. I just want to be able to tell IVF "this data is aligned" but I cannot.

Jim Dempsey

Intel_C_Intel · ‎11-01-2005

Dear Jim,

>I just want to be able to tell IVF "this data is aligned" but I cannot.

The loop-oriented !DIR$ VECTOR ALIGNED and the upcoming data-oriented ASSUMED_ALIGNED annotations should suffice to accomplish just this. Vector code in such cases will be optimal. Your initial proposal went a lot further than just the objective you mention above, however.

Aart Bik
http://www.aartbik.com/

jim_dempsey · ‎11-01-2005

Will the ASSUME_ALIGNED be specified per argument? (And again aligned to what 2, 4, 8, 16, 32, ...)

If not then the programmer cannot use the feature if they know not all arguments are aligned. It is not unusual to call a major routine where you pass in 10 or so arguments. Some of which are aligned some are not.

What is the internal argument against permiting "!DEC$ attributes align: 16 :: var" on dummy arguments to a function or subroutine?

When the !DEC$alignment is used on static, automatic and allocatable data it does two things: a) aligns the data upon allocation/instansiation and b) enters an attribute into the compiler symbol table the alignment attribute. What is wrong with extending the !DEC$alignmentto dummy variables whereby you perform only step b) above? (this would eliminate the vague ASSUME_ALIGNED)

Jim

Intel_C_Intel · ‎11-01-2005

Dear Jim,

>Will the ASSUME_ALIGNED be specified per argument? (And again aligned to what 2, 4, 8, 16, 32, ...)

Of course! The idea is that something like

subroutine aart(p1, p2)
real p1(*), p2(*)
!DIR$ ASSUME_ALIGNED(p1,16)
!DIR$ ASSUME_ALIGNED(p2, 4)
.
end

provides a fine-grained method of conveying alignment information to the compiler (which analyzes the rest of the subroutine based on this per-argument information, much more flexible than the current loop-oriented alignment annotation). I even advocate more elaborate constructs like

subroutine aart(p1, p2)
real p1(*), p2(*)
!DIR$ ASSUME_ALIGNED(p1(1),16)
!DIR$ ASSUME_ALIGNED(p2(2),16)
.
end

which would indicate that array p1 starts 16-byte aligned (as above), and array p2 starts at an address a such that a mod 16 = 12.

Aart Bik
http://www.aartbik.com/

jim_dempsey · ‎11-01-2005

Aart,

One more point

!DEC$ vector aligned

as well as

ASSUME_ALIGNED

If following FORTRAN standards will have to mean the data is aligned to natural boundaries. This means real(4) is to multiple of 4 and real(8) is multiple of 8. SIMD requires alignment to multiple of 16 bytes (at least for now and later it might be 32).

Jim

jim_dempsey · ‎11-01-2005

sorry our messages crossed. Thanks.

jim_dempsey · ‎11-01-2005

I like the idea of specifying the additional alignment information. The programmer can probably work around not having it.

call Aart(array(22), array(17)) ! always even, always odd

Could be replaced with

call Aart(array(22), array(16)) ! always even, always one before the odd

Then the user would +1 the indexes into the second array

When will this be available?

Jim Dempsey

Intel_C_Intel · ‎11-01-2005

> When will this be available?

I am strongly pushing this feature for 9.1 (this obviously requires new FE support, support in the vectorizer is already there).

Steven_L_Intel1 · ‎11-02-2005

I'm pretty sure this will make it into 9.1. It might even sneak into a 9.0 update. Watch the Release Notes over the coming months.

jim_dempsey · ‎11-02-2005

It could tentatively go into a 9.0 updateas an undocumented feature. Hide it with a switch if need be. In this manner the production code can be used for testing by the few in the know of the hidden feature.

Jim