With vector instructions and

Zhang_F_1 · ‎09-11-2014

Hello,

The question comes from following code:

float fa[128] __attribute__((align(64)));
float fb[128] __attribute__((align(64)));

for(j=0; j<100000000; j++)
{
for(k=0; k<128; k++)
{
fa=a*fa+fb;
}
}

When i compile it with icc and -no-vec option it takes about 124 s to complete and with auto-vectorization it only needs 1.5 s. This means there is a speedup of about 80x even though the vector units can only process 16 Floats at once.

Doing the same on an Intel Xeon E5-1620 v2 @ 3.70GHz results in 5,6 s with -no-vec and 1.5 s with auto-vectorization.

All testswere done using only 1 core.

Why does the Xeon Phi speed up so good with Vector Instructions and the Xeon doesnt? Shouldnt the Xeon speed up 8 times, as the Vector registers are 256 bit?

TimP · ‎09-11-2014

A couple of things to check:

1) does the compiler skip iterations in one or both cases, seeing that you don't use the intermediate results?

2) is prefetch more efficient in the vectorized case (1 prefetch to L2 and 1 to L1 per cache line)?

Zhang_F_1 · ‎09-11-2014

to 1) As fa is used on both sides of the assignment no iteration should be discardable

2) Souldnt everything fit in the L1 cache after the first outer iteration anyway? So prefetches shouldnt be too significant.

McCalpinJohn · ‎09-11-2014

With vector instructions and aligned data the compiler can incorporate a memory reference into the FP add operation and into the FP multiply operation. In scalar code these will have to be separate MOV instructions (since most will not be 64-byte aligned), which could double the number of instructions in the inner loop.

It should be easy to check the assembly code to see if this is part of the problem.

Why does the Xeon Phi speed up more than 16 times when using Vector Instructions?