MIC performance-single threaded

Bharat_N_ · ‎04-07-2013

To get a better idea of MIC's single core, single threaded performance, I tried the following simple experiment:

The following is a simple, unvectorized code, where I take two vectors "arr1" and "arr2" of length=LENGTH and multiply them their corresponding elements with each other, LOOP number of times. I have kept LENGTH short enough so that both vectors fit in the L1 cache, so this shouldn't be memory bound. For ex: LOOP = 1000000 and LENGTH < 256 (should fit within L1 cache).

I compiled without using any optimization flags.

[cpp]

for(size_t j=0;j<LOOP;j++){
for(int i=0;i<LENGTH;i+=2){
real[i/2] = arr4*arr5;
im[i/2] = arr4[i+1]*arr5[i+1];
}
}

[/cpp]

I get performance as 0.28 Gflops, when run on the MIC. I count the number of floating point operations as "LENGTH*LOOP" since I perform two floating point operations for every iteration of inner loop which runs LENGTH/2 times.

Now I try vectorization as per the following code:

[cpp]

for(size_t j=0;j<LOOP;j++){

for(int i=0;i<LENGTH;i+=8){
__m512d m = _mm512_load_pd(&arr1);
__m512d in = _mm512_load_pd(&arr2);

t0 = _mm512_mul_pd(m,in);

}
}

[/cpp]

I get performance as 0.6 Gflops, when run on the MIC. The number of floating point operations executed is the same.

I tried a non-trivial scenario in the vectorized case i.e. hadamard product of the two vectors as follows:

[cpp]

for(size_t j=0;j<LOOP;j++){
for(int i=0;i<LENGTH;i+=8){
__m512d m = _mm512_load_pd(&arr1);
__m512d m_r = _mm512_swizzle_pd(m,_MM_SWIZ_REG_CDAB);
__m512d in = _mm512_load_pd(&arr2);
__m512d in_r = _mm512_swizzle_pd(in,_MM_SWIZ_REG_CDAB);

__m512d reals = _mm512_mask_swizzle_pd(m,0xAA,m,_MM_SWIZ_REG_CDAB);
__m512d imags = _mm512_mask_sub_pd(m,0x55,zero,m_r);

t0 = _mm512_mul_pd(reals,in);
t0 = _mm512_fmadd_pd(imags,in_r,t0);

}
}

[/cpp]

I get performance as ~1.2 Gflops. I did account for the different number of floating point operations.

Shouldn't I get 1 Gflops for the unvectorized case and ~8 Gflops for the vectorized case?

Thanks,

Bharat.

Evgueni_P_Intel · ‎04-07-2013

Hi Bharat,

Vectorization does not guarantee a speedup of 8x because it may add instructions necessary to reorder the elements of vectors (as in the case of complex multiply) before or after computation.

To let the compiler vectorize more code, you may want to indicate that the input and output arrays do not overlap using #pragma ivdep for the loops and/or the restrict specifier for the pointers. Using the standard complex numbers declared in <complex.h> may also help.

Thanks,

Evgueni.

robert-reed · ‎04-08-2013

Another thing to consider, Bharat. Because of the two-stage decoder, it is not possible to schedule successive instructions from one thread on adjacent clocks in a core. You need to run a minimum of two threads to take advantage of all a core's resources.

Bharat_N_ · ‎04-08-2013

I modified the code to the following, for now assuming simple multiply and i did a "export OMP_NUM_THREADS=2".

I also declared the pointers ("arr1", "arr2", "arr4" and "arr5" as __restrict__). Since the complex multiply has overhead of reordering the elements i.e. the swizzles etc, I haven't included it here.

Using vector instrinsics:

[cpp]

#pragma ivdep

#pragma omp parallel for
for(size_t j=0;j<LOOP;j++){
for(int i=0;i<LENGTH;i+=8){
__m512d m = _mm512_load_pd(&arr1);
__m512d in = _mm512_load_pd(&arr2);

t0 = _mm512_mul_pd(m,in);

}

[/cpp]

Unvectorized:

[cpp]

#pragma ivdep
#pragma omp parallel for
for(size_t j=0;j<LOOP;j++){
for(int i=0;i<LENGTH;i++){
real = arr4*arr5;
}
}

[/cpp]

Again, I didn't use any compiler optimizations since I don't want the compiler to vectorize the code, just use the intrinsics that I have provided. I see a performance decrease i.e. vectorized code gives me 0.5 Gflops.

If I remove parallelism and use only the ivdep pragma and the restricted pointers, I get ~3 Gflops when LENGTH is 128 and

~0.5 Gflops when LENGTH is 256, even though both sizes fit in the cache.

For the case of simple multiply, how do I completely utilize a single core's resources, so as to get single core peak performance?

TaylorIoTKidd · ‎08-28-2013

Bharat,

Unfortunately, sometimes forum posts fall through the cracks.

Did you get a resolution to this issue?

Regards
--
Taylor

Vladimir_Dergachev · ‎08-28-2013

I see very similar behaviour. The short explanation is that the scalar part of Xeon Phi is not fast enough to run all the instructions necessary for the loop (counter management, address computation, etc) to make full use of the vector unit with this simple loop.

Add some more vector instructions and you'll see performance improve.

TaylorIoTKidd · ‎08-29-2013

Yes. Each core is a Pentium generation machine with some architectural improvements and innovations. But from a scalar standpoint, it's a Pentium.

I'm sure you have all heard the following a hundred times.

Performance of the vector engine: You can measure this but remember that it is only the performance of the vector engine and not that of the coprocessor.

Size of problem set: Even looking only at the vector engine, if the size of the problem isn't large enough, the scalar part (setup, tear down, etc) will kill your performance. Scalar code runs at Pentium speeds.

Performance of the coprocessor: To judge the performance of the coprocessor, you need to exploit both the large number of cores, the vector engine, and the multiple threads per core.

Regards
--
Taylor