support of gcc vector extensions required - Page 2

Nathanael_S_ · ‎06-06-2013

Hello,

I have a code thtat uses gcc vector extensions and achieves 80% of the nominal peak performance with avx vectors.
The gcc vector extensions allow me to write explictely vectorize code, with only very few intrisics (sum, products and the like are all simply written a+b, a*b, etc...)
The performance is awsome, but when compiled with icc, it falls back to scalar data-types, and it turns out that the performance is horrible, nearly four times slower (reaching 20% of the nominal peak performance).
Clearly despite trying hard, icc is not able to vectorize my inner loops correctly.

It would not bother me that much, because gcc is available almost everywhere, but currently I'm trying to run this on the mic (xeon phi), but I have to go through icc, which leads to poor vectorization, and poor performance. (20% of the peak of the mic will be less than 80% of the peak of a 16-core avx machine...)

Please, support gcc vector extensions in icc !

http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

SergeyKostrov · ‎06-19-2013

Thanks for these numbers. >>GCC 4.4.6 : synthesis = 7.6 ms analysis = 7.3 ms >>ICC 14.0 : synthesis = 9.9 ms analyisis = 7.5 ms For the analysis case GCC outperforms ICC for 2.66% ( almost by 3% ) and based on my statistics I would consider it as acceptible. If it would be greater than 5% than improvements in code generation are really needed.

Shenghong_G_Intel · ‎06-19-2013

Nathanael S. wrote:

Quote:

Two suggestions for you guys at intel:

the support for vector extensions should be mentioned somewhere on this page,

the need for explicit cast between vector types and _mmXXX types should be considered for removal.

Hi Nathanael,

Regarding your suggestions, I will submit a request to add some description in the document. Regarding the 2nd suggestion, I already submitted to developer to fix. I will update you when it is fixed.

Also, for ther performance issue, I will do some investigation, but if you can help (becuase you are more familiar with the code) to provide some samler test cases to show bad performance of ICC (related to specific vector extension? or related to specific loop? etc.), it will be helpfu. I will update you when I make some progress.

Thanks,

Shenghong

Nathanael_S_ · ‎06-20-2013

Hello,

An update on the matter: the previous results where obtained using -O3 optimization. When using -O2, the synthesis comes on par with gcc while the analysis slows down a little :

GCC -O3 : synthesis = 7.6ms, analysis = 7.3ms
ICC -O3 : synthesis = 9.9ms, analysis = 7.5ms
ICC -O2 : synthesis = 7.6ms, analysis = 8.6 ms

I suspect that with -O3 option, icc merges loops that should not be merged. If you want to look at the code generated, the relevant function name for synthesis is SH_to_spat_fly2_l

SergeyKostrov · ‎06-20-2013

>>...I suspect that with -O3 option, icc merges loops that should not be merged... You could compare generated codes. Intel C++ compiler /O3 option does more optimizations compared to /O2 option and in one case reviewed recently usage of /O3 did not improve performance at all.

TimP · ‎06-20-2013

Nathanael S. wrote:

Hello,

An update on the matter: the previous results where obtained using -O3 optimization. When using -O2, the synthesis comes on par with gcc while the analysis slows down a little :

GCC -O3 : synthesis = 7.6ms, analysis = 7.3ms

ICC -O3 : synthesis = 9.9ms, analysis = 7.5ms

ICC -O2 : synthesis = 7.6ms, analysis = 8.6 ms

I suspect that with -O3 option, icc merges loops that should not be merged. If you want to look at the code generated, the relevant function name for synthesis is SH_to_spat_fly2_l

Beginning with 12.0 release, there is no apparent cost/benefit analysis for loop fusion, and "#pragma nofusion" was introduced as the option to place a barrier against fusion (replacing prior usage of "#pragma distribute point" for this purpose).

Fusion should be reported in opt-report. Common cases where you should test with #pragma nofusion may be:

a) 2nd loop uses result of 1st loop with alignment offset. Best is to adjust loops for alignment by explicit peeling.

b) fusion suppresses partial vectorization

c) .....

Discussion of the beta compiler is discouraged on this forum. You should report issues with it on premier.intel.com in hope they may be taken up later and perhaps bring them up when they appear in a release.

vincent_b_ · ‎06-20-2013

Hello,

Is there an mkl support for MIC in the latest compiler version (14 beta) ?
I can't find the path /opt/intel/composerxxx/mkl/lib/mic/ which was present in previous compiler versions.
So for the moment, the code compiles and works fine on CPU but can't be compiled with latest version for MIC.

What's funny is that apparently previous icc versions do compile the gcc vector extensions code for MIC since no errors are returned and executable is created and runs on MIC !

So earlier versions than 14 compile gcc vector extensions for MIC but not for CPU ? What's the trick ?

And 14 beta version doesn't have mkl support for MIC?

Thank you,

james_B_8 · ‎06-20-2013

Try /opt/intel/beta/composerxxx/mkl/lib/mic/ for the library path in the beta compilers ;).

SergeyKostrov · ‎06-20-2013

>>...Is there an mkl support for MIC in the latest compiler version (14 beta) ? If you're a Beta tester of Intel C++ compiler version 14 than you could ask that question on Intel Premier Support website. Regarding MKL support for MIC review the latest Release Notes.

TimP · ‎06-20-2013

vincent b. wrote:

Is there an mkl support for MIC in the latest compiler version (14 beta) ?
I can't find the path /opt/intel/composerxxx/mkl/lib/mic/ which was present in previous compiler versions.

As we've said, the place to report this is premier.intel.com, and I've done so. Looks like an oversight, in my opinion one which must be fixed before release.

TimP · ‎06-24-2013

The stated intention for that beta compiler is that MKL for MIC should be installed by re-entering the menu after initial installation and using the modify option to add these libraries. This is planned to change before release. I'm still falling back on the 13.1 released compiler installation for those libraries.

Shenghong_G_Intel · ‎06-26-2013

Hi Nathanael,

I have some issues to build the code with ICC, but it can work with GCC. I am not sure whether there are some configuration issues. See my build script:

cd nschaeff-shtns-341463846739
PATH_FFTW=/opt/intel/composer_xe_2013_sp1.0.040/mkl/include/fftw/

#./configure --enable-openmp --enable-mkl CC=gcc CFLAGS=-I$PATH_FFTW
./configure --enable-openmp --enable-mkl CC=icc CFLAGS=-I$PATH_FFTW
make clean
make
#make
make time_SHT
./time_SHT 1023 -fly -iter=5 -oop

I am building on a 64bit machine, and the errors I get are:

./libshtns.a(sht_init.o): In function `SH_to_point':
sht_init.c:(.text+0x72a2): undefined reference to `__builtin_ia32_vec_ext_v2df'
sht_init.c:(.text+0x72c0): undefined reference to `__builtin_ia32_vec_ext_v2df'
./libshtns.a(sht_init.o): In function `SH_to_grad_point':
sht_init.c:(.text+0x7856): undefined reference to `__builtin_ia32_vec_ext_v2df'
sht_init.c:(.text+0x7880): undefined reference to `__builtin_ia32_vec_ext_v2df'
sht_init.c:(.text+0x78ca): undefined reference to `__builtin_ia32_vec_ext_v2df'
./libshtns.a(sht_init.o):sht_init.c:(.text+0x78e4): more undefined references to `__builtin_ia32_vec_ext_v2df' follow
make: *** [time_SHT] Error 1

Seems like it should work on ia32bit mode only? should I use ICC IA32 to build (i tried, but some other issues missing compatible libraries for IA32)?

By the way, if you can isolate a small test case to show the performance issue, that will be better. I am wondering whther it is caused by wrong code generated for vector extension.

Thanks,

Shenghong

Nathanael_S_ · ‎06-26-2013

Hi Shenghong,

Yes, these are errors that I also had to fix. You should download the latest revision:

https://bitbucket.org/nschaeff/shtns/get/tip.tar.gz

Concerning the small test case, it will be difficult I think. Both analysis and synthesis share the same loop structure, only the operations therin are a little different.
But I will try to pinpoint the exact loop that has performance issues with -O3.

regards,

Shenghong_G_Intel · ‎06-26-2013

Hi,

I get the same error with this version. You may confirm whether this is the latest version, as I still need to change code in 'sht_func.c' to fix the issue we mentioned (cast convention). Maybe you did not check in your code?

If you could pinpoint the loop which generated bad performance code, that is also helpful, I guess we can add a timing function on it to make sure the issue is related to a single loop.

Thanks,

Shenghong

Nathanael S. wrote:

Hi Shenghong,

Yes, these are errors that I also had to fix. You should download the latest revision:

https://bitbucket.org/nschaeff/shtns/get/tip.tar.gz

Concerning the small test case, it will be difficult I think. Both analysis and synthesis share the same loop structure, only the operations therin are a little different.
But I will try to pinpoint the exact loop that has performance issues with -O3.

regards,

Nathanael_S_ · ‎06-27-2013

Hi,

I'm really sorry, I forgot to include the last sht_func.c
Can you try again with the same link (updated) ?

shenghong-geng (Intel) wrote:

Hi,

I get the same error with this version. You may confirm whether this is the latest version, as I still need to change code in 'sht_func.c' to fix the issue we mentioned (cast convention). Maybe you did not check in your code?

vincent_b_ · ‎06-28-2013

Hello,

Thank you for the comments about MIC and beta version of the compiler.

Now, I'm facing another problem : Nathanael's program has been developped for SSE / AVX. So, at some point he works on 128 bits vectors (composed of two doubles). I wish to keep that vector format of 128 bits while porting this code on MIC. Since there is no support for SSE nor AVX on MIC, is there any intrinsic to cast these 128 bits vectors into 512 bits vectors ? I couldn't find one.

Also, gcc vector extensions work well on MIC even for 128 vectors.
Here's a code example that works fine on MIC :
typedef double v2d __attribute__ (( vector_size(8*2) ));
typedef union { v2d i; double v[2]; } vec_v2d;
v2d a = {1,1};
v2d b = {4,5};
v2d c = a + b;
vec_v2d d;
d.i = c;
printf("%lf,%lf\n",d.v[0],d.v[1]);

gives as output :
5,6

So gcc vector extensions are ok for simple operations such as +,-,*,/, etc
However, when more complex instructions are needed we're stuck with the fact that no instructions casting from 128 bits to 512 bits exist on MIC.

Please can you confirm this or give a solution ?

Thank you,