Auto vectorisation with AVX not as expected

magicfoot · ‎02-27-2012

I have converted one of my numerical processed to use AVX on an 15-2500k and achieve good performance improvements in this way as shown in the attached graphic. As the number of cores used increases, the performance improves.

AUTO=Auto vectorisation with Composer
AVX=Manual vectorisation of loops and AVX
Regular=No Vectorisation at all, No AVX or SSE

Using the Autovectorisation with the Intel C compiler "Composer" I only achieve a good speedup for the one core. Does that sound right ? Using more cores only gives a marginal speedup.

This process uses openMP.

Georg_Z_Intel · ‎02-27-2012

Hello,

we're continuously improving our compiler optimizations to achieve high performance out of the box. Thus we're interested in cases where the auto vectorization of the compiler does not work as expected. Your case seems quite interesting and I'd like to take a look at it.

But first we should distinguish auto vectorization from multi threading. Auto Vectorization uses SIMD capabilities from each core independently (data parallelism per core). Your data, however, shows the speedup with using multiple cores (multi threading). Auto Vectorization should not have (direct) impact on multi threading, provided the same basic algorithm is used.

That's also my first question:
Did you use the same algorithm for AVX & auto vectorization? And, can you exclude side-effects from changes made by manually using AVX vectorization?
My first impression from your graph is that the auto vectorization example does not scale at all. Hence it seems unlikely that auto vectorization from the compiler is the root cause. Are there data dependencies?

Furthermore I'd need more information:
What compiler version did you use? Recent 12.1 compilers have seen some improvements in (outer-) loop vectorization. Did you use the same compiler options in all cases? And, did you also make use of "/parallel" (Windows) or "-parallel" (Linux)?

The best would be to narrow down the performance incident to a small reproducer (preferably a single loop). Could you kindly provide one?

Thank you & best regards,

Georg Zitzlsberger

magicfoot · ‎02-27-2012

Hello Georg,

The compiler used was Intel C++ Compiler XE 12.1 on a trial basis for 1 month, several months ago. I do recall that I used the .bat files that were with the compiler and modified those. If you get one of your guys to send me a few days extension for the trial then I can re-create the flags I used with the compiler to build the code.

The test code in openMP form consists of several loops as listed below section a). For single core remove the openMP stuff.
Fragment of code with manual AVX. i.e. intrinsics section b)

I did not use the /parallel flag in the compiler as far as I recall.

Let me know if you need the full programs.
Regards

SECTION A)
.....

#pragma omp parallel for private(i,j)

for (i = 1; i < ie; i++) { for (j = 0; j < je; j++) {

ey = caey * ey + cbey * ( hz[i-1] - hz );

}

}

.....

SECTION B)

#pragma omp parallel for private(i,j,nt)
for (i = 1; i < ie; i++) {
for (j = 0; j < je; j+=8) {

__m256 oh0=_mm256_load_ps(&ey);
__m256 oh1=_mm256_load_ps(&caey);
__m256 oh2=_mm256_load_ps(&cbey);
__m256 oh3=_mm256_load_ps(&hz[i-1]);
__m256 oh4=_mm256_load_ps(&hz);

__m256 m1 = _mm256_mul_ps(oh0,oh1);
__m256 m2 = _mm256_sub_ps(oh3,oh4);
__m256 m3 = _mm256_mul_ps(oh2,m2);
__m256 m4 = _mm256_add_ps(m1,m3);

_mm256_store_ps(ey+j,m4);

}
}

Georg_Z_Intel · ‎02-28-2012

Hello,

thank you for the information. I'll look into that now.

Best regards,

Georg Zitzlsberger

Chaitali_C_ · ‎11-26-2014

Hello Georg,

I have questions related to AVX, first one is how do I calculate expected speedup using AVX, say for example if I am doing vector addition using floats and my processor supports AVX, then 8 floats will be processed in single iteration, so can I expect 8X speedup in this case? And if I am not getting it what can be the reasons like alignment problem or what? What is that it stops from getting 8X as there are no dependencies in vector addition code?

Other question is , is it possible to use all YMM registers in parallel by making changes in assembly or intrinsic code, say in vector addition for bigger vector,if ymm0 is holding array elements for input array1[0-7] and ymm1 is holding array elements for input array2 [0-7] and ymm2 is holding results, then at the same time can ymm3 and ymm4 can hold array elemets [8-15] and ymm5 can hold results array elements [8-15] and so on?

Thanks in advance,

Chaitali

TimP · ‎11-26-2014

You talk as if it doesn't matter which CPU generation you use, although full width stores and l2 access were introduced with Haswell. Also, you seem to ignore memory bandwidth. Loop enrolling is a usual optimization although not as fantastic as you expect.

Chaitali_C_ · ‎11-27-2014

Processor is intel Xeon..E5-2670....with icc 13.0.1...my question is can it be done.? (not considering memory bandwidth).

TimP · ‎11-27-2014

Can what be done? The compiler attempts to calculate a maximum vector speedup if that's what you mean. That estimate can be displayed e.g. opt-report4 with current compilers. It takes into account more than you like but by no means all important factors. Nor does it offer a reliable comparison among various architecture options on a given CPU.

Chaitali_C_ · ‎11-27-2014

can below mentioned combination can be done on Intel Xeon processor supporting avx ?

" is it possible to use all YMM registers in parallel by making changes in assembly or intrinsic code,

say in vector addition for bigger vector,if ymm0 is holding array elements for input array1[0-7]

and ymm1 is holding array elements for input array2 [0-7] and

ymm2 is holding results,

then at the same time

can ymm3 and ymm4 can hold array elements [8-15] and ymm5 can hold results array elements [8-15] and so on?

So all in one how many floats(4 bytes) can be packed in 16 YMM registers at a time (through assembly)

Bernard · ‎11-27-2014

>>>So all in one how many floats(4 bytes) can be packed in 16 YMM registers at a time (through assembly)>>>

One YMM register can hold 8 FP SP values so 16 YMM registers can hold 128 floats.

Regarding the first part of your question and if I understood it correctly when using AVX intrinsics intrinsics load function will access float array returning __m256 type which directly maps to YMM register and passing it to store intrinsics which will write it to float array type. You can see it clearly in post #3.

jimdempseyatthecove · ‎11-27-2014

>>So all in one how many floats(4 bytes) can be packed in 16 YMM registers at a time (through assembly)

You also have to manipulate the data.

You cannot effectively use these registers as a "register cache" unless your only operations are fetch, store, and single integer operations. Using any floating point operations require the use of SSE/AVX registers (unless on IA32 using FPU instructions), other than for some special cases, this will require some AVX registers to perform the operations.

Most of the time it is best to let the compiler decide, in a MRU manner, what and how much to cache in AVX registers.

Processor designs change, usually for the better, do you want to rewrite your code, if you add FMA instruction (this may permit one more register to be cached), or later when you upgrade to a CPU supporting AVX512, or way later, AVX1024 (which will permit more data to be cached in registers).

Jim Dempsey

Chaitali_C_ · ‎11-27-2014

Thanks a lot Jim !!