Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
92 Views

support of gcc vector extensions required

Hello,

I have a code thtat uses gcc vector extensions and achieves 80% of the nominal peak performance with avx vectors.
The gcc vector extensions allow me to write explictely vectorize code, with only very few intrisics (sum, products and the like are all simply written a+b, a*b, etc...)
The performance is awsome, but when compiled with icc, it falls back to scalar data-types, and it turns out that the performance is horrible, nearly four times slower (reaching 20% of the nominal peak performance).
Clearly despite trying hard, icc is not able to vectorize my inner loops correctly.

It would not bother me that much, because gcc is available almost everywhere, but currently I'm trying to run this on the mic (xeon phi), but I have to go through icc, which leads to poor vectorization, and poor performance. (20% of the peak of the mic will be less than 80% of the peak of a 16-core avx machine...)

Please, support gcc vector extensions in icc !

http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

0 Kudos
35 Replies
Highlighted
85 Views

Hi,

Thank you for submitting the issue. I will file a feature request for you.

Thanks,

Shenghong

0 Kudos
Highlighted
Valued Contributor II
85 Views

>>...The performance is awsome, but when compiled with icc, it falls back to scalar data-types, and it turns out that >>the performance is horrible, nearly four times slower (reaching 20% of the nominal peak performance). >>Clearly despite trying hard, icc is not able to vectorize my inner loops correctly. I don't think that you've used All capabilities of Intel C++ compiler and you have Not provided any test cases with performance numbers. At the same time I admit that GCC-like C++ compilers, for example, MinGW for Windows, are doing a good job when it comes to performance. I just completed a verification with version 12 ( using a latest update ) of Intel C++ compiler and here are technical details with real performance numbers: Vec_samples.zip was used from ..\Composer XE\Samples\en_US\C++ folder ( for a Windows platform ) [ Test 1 - Generic settings - Release ] ROW:256 COL: 256 Execution time is 12.750 seconds GigaFlops = 0.673720 Sum of result = 1279224.000000 [ Test 2 - Vectorization & Alignment & Inlining & IPO & /O3 are used - Release ] ROW:256 COL: 256 Execution time is 4.734 seconds GigaFlops = 1.814519 Sum of result = 1279224.000000 As you can see Test 2 is ~2.7 times faster then Test 1.
0 Kudos
Highlighted
Beginner
85 Views

Hello Sergey,

I don't see the relation of your test case with icc's lack of gcc vector extensions. Anyway, you're right I have not provided a test case. So here we go with SHTns, a real-world library used in high-performance computing.

This last line will run a timing program which will display several execution times.

Repeat by replacing CC=icc when running the configure script. Feel free to modify the Makefile with any options you'd like.
See how the executable produced by gcc is about 3 times faster than icc (assuming you have an AVX cpu).

Cheers.

0 Kudos
Highlighted
Valued Contributor II
85 Views

>>So here we go with SHTns, a real-world library used in high-performance computing. >> >>- download the latest version : https://bitbucket.org/nschaeff/shtns/get/default.zip >>- ./configure --enable-openmp --enable-mkl CC=gcc >>- make >>- make time_SHT >>- ./time_SHT 1023 -fly -iter=5 -oop >> >>This last line will run a timing program which will display several execution times. >> >>Repeat by replacing CC=icc when running the configure script. Feel free to modify the Makefile with any options you'd like. >>See how the executable produced by gcc is about 3 times faster than icc (assuming you have an AVX cpu). Thanks for the information about SHTns. I think Intel software engineers of Intel C++ compiler team should look at the library and test cases because 3x difference is impressive. And, if you did some measurements please post results in order to demonstrate to everybody that there are significant improvements in code generation of GCC C++ compiler.
0 Kudos
Highlighted
85 Views

Hi Nathanael,

Thank you for your update. Intel compiler actually does support some of vector operations.
For example:

-bash-4.1$ cat > gnu.c

typedef int v4si __attribute__ ((vector_size (16)));
v4si a, b, c;

void func()
{
    c = a + b;
}
^C
-bash-4.1$ icc -S gnu.c

ICC generates the following code:

func:
..B1.1:                        # Preds ..B1.0
..___tag_value_func.1:                                          #6.1
        movdqa    a(%rip), %xmm0                                #7.6
        paddd    b(%rip), %xmm0                                #7.14
        movdqa    %xmm0, c(%rip)                                #7.6
        ret                                     

The list of supported expressions grows from older ICC version to the newer ICC versions.

13.0 compiler has only the initial supports of vector operatios: +, -, *, unary -, /
14.0 compiler supports more operations (compari and supports more types of vectors (vector_length x vector_element).

If you can provide a list of vector operations you want them to be supported first (or vector operations which are generated with bad performance), it will be better as we can implemented them according the priority from your list.

Regarding the SHTns test case, I will have a try, to see why ICC is generating bad code.

Thanks,
Shenghong

0 Kudos
Highlighted
85 Views

Hi Nathanael,

I get below errors while building your test case (.hg not found):
gcc -march=native -O2 -I/usr/local/include -L/usr/local/lib -ffast-math -fomit-frame-pointer -std=gnu99 -fopenmp  -D_GNU_SOURCE -L/usr/local/lib -I/usr/local/include -O2 -D_HGID_="\"`hg id -ti`\"" -c sht_init.c -o sht_init.o
abort: there is no Mercurial repository here (.hg not found)
In file included from sht_init.c:31:
sht_private.h:29:19: error: fftw3.h: No such file or directory

Any suggestions for this?

By the way, can this test case prove the bad performance caused by vector operations? Again, please help to list the vector operations which are used in your test case and will cause bad performance, I may want to check whether the generated code of these vector operations are correct and report issues to developers if yes. I need first of all to know whether the bad performance of ICC is because of vector operations support, or other possible reasons.

Thanks,
Shenghong

0 Kudos
Highlighted
Valued Contributor II
85 Views

Hi Shenghong, >>...sht_private.h:29:19: error: fftw3.h: No such file or directory If you have MKL on your development / test computer try to use fftw3.h from: [ ICCInstallDir ]\Composer XE\MKL\Include\fftw folder.
0 Kudos
Highlighted
Beginner
85 Views

Hello Shenghong,

This is good news. I actually need only simple arithemtic operations, so the support in version 13 should be enough.
Is there documentation somewhere for this vector support ?

When changing your example file into :

#include <pmmintrin.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d a, b, c;
void func()
{
    a = _mm_set1_pd(1.0);
    c = a + b;
    b = c + _mm_set1_pd(1.0);
}

The last line " b = c + _mm_set1_pd(1.0);" reports an error, while the first line a = _mm_set1_pd(1.0); works well. Why is it so ?
What is the preferred way to set all elements of a vector to the same value ?

PS: for the reported error, Sergey gave the right answer, thank you Sergey.

0 Kudos
Highlighted
Beginner
85 Views

Sergey Kostrov wrote:

>>So here we go with SHTns, a real-world library used in high-performance computing.
>>
>>- download the latest version : https://bitbucket.org/nschaeff/shtns/get/default.zip
>>- ./configure --enable-openmp --enable-mkl CC=gcc
>>- make
>>- make time_SHT
>>- ./time_SHT 1023 -fly -iter=5 -oop
>>
>>This last line will run a timing program which will display several execution times.
>>
>>Repeat by replacing CC=icc when running the configure script. Feel free to modify the Makefile with any options you'd like.
>>See how the executable produced by gcc is about 3 times faster than icc (assuming you have an AVX cpu).

Thanks for the information about SHTns. I think Intel software engineers of Intel C++ compiler team should look at the library and test cases because 3x difference is impressive.

And, if you did some measurements please post results in order to demonstrate to everybody that there are significant improvements in code generation of GCC C++ compiler.

Here are the results I obtain with 16 threads on a 16-core SandyBridge machine (2.6 GHz if I recall well) :

  • ICC (auto-vectorization)
    ./time_SHT 1023 -fly -iter=5 -oop
    synthesis = 29ms, analysis=27ms
  • GCC (vectorization using gcc vector extensions)
    ./time_SHT 1023 -fly -iter=5 -oop

    synthesis = 8.7ms, analysis = 8.4ms  [which corresponds to more than 80% of the peak flops]

With icc 13, I still cannot compile the vector extensions in SHTns due to a strange "internal error" reported by icc :

internal error: 0_1000_3

which does not help a lot...

0 Kudos
Highlighted
Beginner
85 Views

Hi Shenghong,

As mentionned by Sergey, you have to include the path PATH_FFTW=.../composer_version/mkl/include/fftw as -I$PATH_FFTW
You also need to comment the #undef _GCC_VEC_ in sht_private.h in order to compiler with gcc vec extensions (the program detects wether it is compiled with icc or gcc and disables gcc vec extensions for icc).

Errors still occur during compilation with icc and gcc vector extensions.

Sincerely,

Vincent

0 Kudos
Highlighted
85 Views

Hi Vincent,

Thank you for these information, I will check according to your suggestions, and update here.

Thanks

Shenghong

0 Kudos
Highlighted
Valued Contributor II
85 Views

You have a compilation error because _mm_set1_pd doesn't return a value of type double. That intrinsic function is declared as follows: [ emmintrin.h ] ... extern __m128d __ICL_INTRINCC _mm_set1_pd( double ); ... and there is No decleared / implemented C++ operator +. Please try to use API from dvec.h header file because this is: ... Definition of a C++ class interface to Intel(R) Pentium(R) 4 processor SSE2 intrinsics. ...
0 Kudos
Highlighted
85 Views

Nathanael S. wrote:

Hello Shenghong,

This is good news. I actually need only simple arithemtic operations, so the support in version 13 should be enough.
Is there documentation somewhere for this vector support ?

When changing your example file into :

#include <pmmintrin.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d a, b, c;
void func()
{
    a = _mm_set1_pd(1.0);
    c = a + b;
    b = c + _mm_set1_pd(1.0);
}

The last line " b = c + _mm_set1_pd(1.0);" reports an error, while the first line a = _mm_set1_pd(1.0); works well. Why is it so ?
What is the preferred way to set all elements of a vector to the same value ?

PS: for the reported error, Sergey gave the right answer, thank you Sergey.

Please try to use like: b = a + (v2d)_mm_set1_ps(200.0);

I have verified and GCC can accept the conversion, but ICC cannot. I will repor it to developer to see whether it can be supported/fixed.

Note: I have verified and the conversion will not cause any error (I print the results and all are as expected).

void func() {
// a=(v2d){1.0f,2.0f,3.0f,4.0f}; // works
    a = _mm_set1_ps(100.0);   // works
//  b = a + _mm_set1_ps(200.0);  // gcc works, icc not work
    b = a + (v2d)_mm_set1_ps(200.0);  // works
}

Feel free to let me know all the unsupported cases of vector operations you meet, as developer may need specific test case to fix. It will save me some time to find our the unsupported code in your test case, as your project has lots of code. Appreciate much.

Thanks,

Shenghong

0 Kudos
Highlighted
Beginner
85 Views

Hi Sergey,

Sergey Kostrov wrote:

You have a compilation error because _mm_set1_pd doesn't return a value of type double. That intrinsic function is declared as follows:

I don't want a double, I want a vector of double. An explicit cast to my vector type (v2d) solves the compilation problem of this tiny example, and generates the correct machine instructions. It is strange that this explicit cast is needed though. All this shows that icc has indeed support for vector extensions, which is great news. However, in a full scale application (SHTns), strange "internal errors" reported above make the compilation fail, and I have no clue of what to do with these ! So something seems broken...

Sergey Kostrov wrote:

and there is No decleared / implemented C++ operator +.

You do not seem to know the vector extensions that are the heart of this discussion. Vector extensions DO define the natural behaviour of arithmetic operations. It is not C++, just an extension of C (you could compare it to what OpenCL does with vector types)

Sergey Kostrov wrote:

Please try to use API from dvec.h header file because this is:
...
Definition of a C++ class interface to Intel(R) Pentium(R) 4 processor SSE2 intrinsics.

Thanks for the suggestion, but it is not suitable for at least two reasons:

  • it is a C++ class (my code is in plain C)
  • the header found in dvec.h says : "Speed and accuracy are sacrificed for utility." which completely defeats my purpose.

0 Kudos
Highlighted
Beginner
85 Views

shenghong-geng (Intel) wrote:

Please try to use like: b = a + (v2d)_mm_set1_ps(200.0);

I have verified and GCC can accept the conversion, but ICC cannot. I will repor it to developer to see whether it can be supported/fixed.

Feel free to let me know all the unsupported cases of vector operations you meet, as developer may need specific test case to fix. It will save me some time to find our the unsupported code in your test case, as your project has lots of code. Appreciate much.

Thanks,

Shenghong

Thank you very much for your help.

Yes, using the explicit cast to (v2d) works in your example. I have added explicit conversions in my code, which supresses all the errors.
However, the compilation still fails with "internal error 0_1000_3". No clue what that is... can you get more information on this ?

0 Kudos
Highlighted
85 Views

Hi Nathanael,

Great that you have added that to your code. I am doing the same work, but I need to update one by one and compile and compile again and again...is it possible for yout to share with me the updated code with explicit cast for your code? And I will investigate the internal eror you have.

 

By the way, below are the places I have add explicit cast, but still there are some similiar errors:

in sht_private.h:
 #define vdup(x) (v2d)_mm_set1_pd(x)

in sht_func.c(135, 137):
 ((v2d*)q0)[(ntheta-1-k)*lmax +(l-1)] += (v2d)_mm_xor_pd(sgnt, qc);
 ((v2d*)q0)[(ntheta+k)*lmax +(l-1)] += (v2d)_mm_xor_pd( sgnt, qc );


in sht_private.h:
 #define vxchg(a) (s2d)_mm_shuffle_pd(a,a,1)
 #define vall(x) (rnd)_mm256_set1_pd(x)
 _mm256_loadu_pd
 _mm256_set1_pd

Thanks,

Shenghong

0 Kudos
Highlighted
Beginner
85 Views

shenghong-geng (Intel) wrote:

Hi Nathanael,

Great that you have added that to your code. I am doing the same work, but I need to update one by one and compile and compile again and again...is it possible for yout to share with me the updated code with explicit cast for your code? And I will investigate the internal eror you have.

Sure, please find attached:

  • SHT.tar.gz should replace the content of the SHT sub-directory
  • sht_private.h should replace the file of the corresponding name.

The whole library is not converted yet, but if you type

"make sht_ltr.o"

you should read :

SHT/spat_to_SHst_fly.c(124) (col. 6): internal error: 0_1000_3

Your help is much appreciated.

0 Kudos
Highlighted
Beginner
85 Views

And here are the files ...

0 Kudos
Highlighted
85 Views

Hi Nathanael,

I can reproduce the internal error with 13.1 compiler. It seems to be related to below peice of code (I expanded the macros):

v2d reo[2*NLAT_2];
double *zl;

q0  += ((double*)reo)[2*(i)] * vdup(zl);

In my understanding, this is easy to build a small test case (just from these code). But when I verify the issue with 14.0 compiler (in beta now), it is fixed. is it possible you try the 14.0 beta, and wait for 14.0 release?

Thanks,

Shenghong 

0 Kudos