SSE Intrinsics.

ramon_2 · ‎06-23-2004

I would like to see SSE intrinsics supported by the Intel Fortran compiler. An alternative would be inline assembly.

There are some cases where the compiler is not able to vectorize code. For instance:

real*4 a(4), c

c = SQRT(SUM(a**2)) ! The module of a vector

This one is quite suprising. It is a very frequent operation in molecular simulation codes, to evaluate the distance between two points.

I understand that the development resources committed for compiler development are limited, and the compiler cannot support every posible case of vectorization. However, please allow the programmer to do it by hand. Most C++ compilers support SSE intrinsics. The Fortran compiler should support them as well.

g_f_thomas · ‎06-23-2004

Hear, hear. Keep in mind though that Intel's primary line of business is chips and software is quite secondary at best. It seems to me that Fortran is at the back of the bus when it comes to Intel software. Intel Fortran doesn't even come with a 'Hello World' sample to validate the installation but it does come with a fancy Tutorial on how to use, wait for it!, the Intel C++ compiler. Figure.

Anyways, you might find

http://www.codeproject.com/cpp/sseintro.asp

of interest but I sense that you're already aware of it.

Good luck,

Gerry T.

ramon_2 · ‎06-23-2004

A cheap option would be to allow inline assembly. This option is supported by the Salford Fortran compiler.

g_f_thomas · ‎06-23-2004

If Salford fulfils your needs (which I very much doubt, FWIW) why are you snivelling about IVF? FYI, it's possible to do mixed IVF ASM programming, so just do it.

Ciao,
Gerry T.

jean-vezina · ‎06-23-2004

I have compiled the following test program with

the compiler options /O3 /QxK (or /QxN for a Pentium IV

machine and the code is indeed vectorized.

Sample code:

real c,a(4)
a=3.
c = SQRT(SUM(a**2)) !
print *,c
end

Command used:

Pentium III: ifort /O3 /QxK /FAs vec.f90

Pentium IV: ifort /O3 /QxN /FAs vec.f90

the /FAs option is used to produce an assembly language

listing of the code generated by the compiler.

In both cases SSE or SSE2 instructions are generated,

showing that the code is vectorized.

Best regards,

Jean Vezina

TimP · ‎06-23-2004

I tried Jean's example, with ifort 8.0.050. While it is using a parallel instruction to assign values to a(:), it's using serial SSE instructions to perform the calculations. The option /QxP would be required, to allow the possibility of using a parallel instruction to perform the final addition of 4 operands. Have you been able to make a benchmark showing an advantage for performing a parallel multiplication, then a serial addition? The compiler's automatic vectorization of sum reduction is done with 8 partial sums, evidently not applicable to such a short vector.

Message Edited by tcprince on 06-23-2004 10:23 AM