Victor,

Victor_L_2 · ‎08-26-2013

I have a very complex project which has being heavily optimized to take advantage of SSSE2 and SSSE3 instructions. For example in my project I have the following defines:

    #define ADDS8(a, b) (_mm_adds_epi8((a),(b)))
   #define SUB8(a, b) (_mm_sub_epi8((a),(b)))
   #define SUBS8(a, b) (_mm_subs_epi8((a), (b)))
   #define ABS8(a) (_mm_abs_epi8((a)))
   #define ADD64(a, b) (_mm_add_epi64((a), (b)))

I'm trying to evaluate if this program might benefit from running on Xeon Phi. My initial approach for porting was to implement __m128i intrinsics in C++ and hope that Intel compiler will vectorize the resulting code. So I've done something along those lines:

#ifndef __MIC__
    #include <emmintrin.h>
   #include <tmmintrin.h>

   typedef __m128i M128i;

   #define SUB8(a, b) (_mm_sub_epi8((a),(b)))
   #define ABS8(a) (_mm_abs_epi8((a)))
   #define ADD64(a, b) (_mm_add_epi64((a), (b)))
#else
   #include <stdint.h>

   typedef union
   {
       int64_t i64[2];
       int32_t i32[4];
       int16_t i16[8];
       int8_t i8[16];
    } M128i;

    inline M128i SUB8(const M128i& a, const M128i& b)
    {
        M128i res;
        for (int i=0; i<16; i++) {
            res.i8 = a.i8 - b.i8;
        }
        return res;
    }

    inline M128i ABS8(const M128i& a)
    {
        M128i res;
        for (int i=0; i<16; i++) {
            res.i8 = abs(a.i8);
        }
        return res;
    }

    inline M128i ADD64(const M128i& a, const M128i& b)
    {
        M128i res;
        res.i64[0] = a.i64[0] + b.i64[0];
        res.i64[1] = a.i64[1] + b.i64[1];
        return res;
    }
#endif

My ported code runs on Phi. I've verified that Intel compiler vectorized SUB8 and ABS8. But my code still runs 10x times slower on Phi than on Xeon processor. Is there any other way I can port __m128i instructions to Phi without rewriting the whole project? Perhaps using __m512i instructions? I would appreciate any advice.

Frances_R_Intel · ‎08-27-2013

Are you also multithreading your code - such as OpenMP or PThreads? To get maximum benefit from the coprocessor, the code should be both parallelized and vectorized.

Victor_L_2 · ‎08-27-2013

I'm using pthread directly. I've experimented with running between 180 and 240 threads. Unfortunately my app segfaults on Phi when I choose 240 threads - I assume I'm running out of memory. On Xeon E5-2670 I run only 32 threads and still significantly outperform Phi.

Victor_L_2 · ‎08-27-2013

Yes, I'm using pthreads. Running 180 threads on Phi.

Frances_R_Intel · ‎08-27-2013

One thing I notice is that by trying to mimic the SSE2 and SSE3 instructions, you are only using 1/4 of each vector register. Besides multithreading, the Intel(r) Xeon Phi(tm) coprocessor only gets maximum performance if you are making use of the extra wide (512) vector registers. I'm not sure what else might be going on in your code (you might want to use VTune, if you have it, to see if trying to mimic the SSE registers is also interfering with optimal cache use) but the short vectors are probably one of the factors you are not seeing performance.

From your comment that this is a complex project, I take it that you are reluctant to either remove the old vector intrinsics or modify the code to allow the coprocessor to deal with longer vectors. That may be what is needed, however. It may seem counter intuitive that the parallelism needs to increase that much to get speed out to the coprocessor, but the papers linked at http://software.intel.com/en-us/articles/is-intelr-xeon-phitm-coprocessor-right-for-you go into some of this.

Victor_L_2 · ‎08-27-2013

Frances, thank you for quick reply.

By running VTune I've noticed that my vectorization intensity is very low. I assume this is because I can't declare my SUB8(), ADD64() etc functions as "elemental functions" ( __attribute((vector)) ), since they return Union. Any suggestions in that regard? Perhaps returning a pointer instead of union?

John_O_Intel · ‎09-09-2013

You can re-write using 512-bit width intrinsics, or you can re-write the intrinsics in C and have the Intel Compiler vectorize it. This allows portability to other instruction sets.

This is my compiler options: icpc -vec-report3 MIC.cc -S -O2 -mmic

Here is my C++ versions of the loops. Not sure about performance compared to your version. I'd be curious what you see in your example. Note MIC currently doesn't have byte level instructions, so the int8 routines below have a loop trip count limit of 16, to get more iterations use an outer loop. Good luck!

John

void subs_int8( const int8_t *a, const int8_t *b, int8_t *d)

#pragma ivdep

#pragma vector aligned

for (int i=0; i<16; i++) { // trip count has to be 16

d = ( a - b < 0) ? 0 : a - b;

}

void sub_int8( const int8_t *a, const int8_t *b, int8_t *c)

{

#pragma ivdep

#pragma vector aligned

for (int i=0; i<16; i++) { // trip count has to be 16

c = a - b;

}

void abs_int8( const int8_t *a, int8_t *c)

{

#pragma ivdep

#pragma vector aligned

for (int i=0; i<16; i++) { // trip count has to be 16

c = abs(a);

}

void add_int64( const int64_t *a, const int64_t *b, int64_t *c)

{

#pragma ivdep

#pragma vector aligned

for (int i=0; i<16; i++) {

c = a+b;

}

void add_int8( const int8_t *a, const int8_t *b, int8_t *c)

{

#pragma ivdep

#pragma vector aligned

for (int i=0; i<16; i++) { // trip count needs to be 16

c = a+b;

}

John_O_Intel · ‎09-09-2013

By the way, I used #pragma ivdep and #pragma vector aligned to simplify the assembly the compiler generates. You should only use the pragmas when your code applies (i.e. no vector dependencies among your arrays, and the arrays are aligned).

cheers,

john

Victor_L_2 · ‎09-10-2013

Thanks John,

My existing code has a lot of complex expressions. For example (and this is still a simple example :) )

^{M128i currentTrace = OR128(currentTrace,
OR128(AND128(V_WINS, MASK_2[j&1]),
AND128(H_WINS, MASK_1[j&1])));}

Will I still get a benefit of vectorization if I modify the functions you've proposed to return a pointer? Something along those lines:

^{int8_t* add_int8( const int8_t *a, const int8_t *b, int8_t *c)}
^{
^{#pragma ivdep}
^{#pragma vector aligned}
^{for (int i=0; i<16; i++)}
^{c = a+b;}

^{return c;}
^}

This way I hope to minimize the amount of changes in my original code.

George_B_1 · ‎09-16-2013

I have a similar need to port a very complex SSE2/3 heavily multi-threaded code to Phi. The prime motivation for the porting has been the performance gain. The original vectorized code utilities heavily SSE2/_mm_madd_epi16 instruction to vectorize upon x8 operands. Once every bit is not “wasted” 16-bit mantissa is totally sufficient for rendering so, porting vectorization to x8 doubles offers no performance benefits in such case. I hoped that 512-bit SIMD may be effectively utilized as 4x128 SIMD units; however, the way Intel suggests porting indicates that whole 512 array is used to perform the same x8 vectorization. Apparently, the new development with x8 doubles vectorization may benefit from less tedious optimization but hardly it is going to offer a speed advantage over alreadyx8 vectorized code via SSE2. I would really glad to be wrong an to learn how to take advantage from Phi 512-bit SIMD to speed-up the code which already uses x8 vectorization via SSE2.

Thanks,

George

Ozan_S_ · ‎10-16-2013

Victor,

As I read the book about Intel Xeon Phi Coprocessor High Performance Programming, I notice that you should not leave many things to the compiler if you can do it yourself as the compiler will have to validate several requirements, which may prove unnecessary as in your case.

In that sense, I would recommend you to vectorize your loops with #pragma simd, taking full responsibility on your side, for example. Or better, dont leave anything to compiler and use new vector instrinsics directly. Create _m512i variables from the pointers you pass into those functions and use the corresponding vector intrinsic operators directly.

And whenever creating M128i type variables, dynamically or statically, ensure that they are cache-aligned, using _mm_alloc or __attribute__((align(64))) during allocation if you will not use the vector intrinsics of Xeon Phi.

You can always get the level of vectorization in your loops passing -vec-report to the compiler.

George_B_1 · ‎10-23-2013

Well, I've never relied on any automatic vectorization in the production code; only explicit vectorization and explicit multi-threading, all float math in time sensitive parts used to be implemented via integers (twice wider x8x16 over x4xfloats) besides some rare exceptions; it does require a tedious bit counting but doubles the performance. Btw, _mm_alloc had an issue back ICC_8.0, from those times I've used my memory management classes it also has tools to ensure coherency of cache access. A lot tools has been developed to ensure the top performance of heavily multi-threaded 128 bit SSE code thus IT IS REALLY PITY that Xeon-Phi does not support SSE128 natively as 4x128. 4-threads per core is a perfect match to support 4x128... I hope Intel reads the customer's wishes, it's the only reason I write here...

James_C_Intel2 · ‎10-24-2013

george@fovia.com wrote:

IT IS REALLY PITY that Xeon-Phi does not support SSE128 natively as 4x128. 4-threads per core is a perfect match to support 4x128... I hope Intel reads the customer's wishes

We have already announced that Knights Landing (the next generation Xeon Phi processor) will implement AVX-512, for which there is a full specification available via http://software.intel.com/en-us/blogs/2013/avx-512-instructions.

Ozan_S_ · ‎10-24-2013

george@fovia.com wrote:

A lot tools has been developed to ensure the top performance of heavily multi-threaded 128 bit SSE code thus IT IS REALLY PITY that Xeon-Phi does not support SSE128 natively as 4x128. 4-threads per core is a perfect match to support 4x128... I hope Intel reads the customer's wishes, it's the only reason I write here...

Same goes for me!

Porting __m128i instructions to Phi