Cilk algorithm slower than the scalar counterpart

Stephan_V_ · ‎07-04-2016

Hello everyone,

I am writing a greyscale both in cilk and scalar as a project for university to compare cilk with "normal" code. Each pixel is represented as RGB (floats), as the title already states my issue is that the cilk is slower than the scalar part which i don't really understand so maybe could maybe take a look and tell me if i am doing something wrong and what would be the correct approach to implement such an algorithm in Cilk.

The code can be found here: https://gist.github.com/feliwir/808d30647501d0f8bb3b31a5f97711dd

There functions of relevance are greyscale_cilk and greyscale_scalar. I tested the code both with the g++ 5.3 and the Intel C++ compiler, with both compilers the cilk code performing slower than the scalar code.

Thanks for your help in advance

Hansang_B_Intel · ‎07-11-2016

It looks like your routine "grayscale_cilk" requires at least 2000+ pixels (width*height) to get some performance benefit according to my quick experiment using icpc. Performance may vary between different machines, but you can quickly check if your data set is big enough. That is for the part that doesn't use "cilk_for", and I am not sure if your four different implementations are intended to do the same computation by the way since "n" is unknown. If you are comparing performance of "CILK" vs "SCALAR", and "CILK MULTITHREADED" vs "SCALAR MULTITHREADED" respectively (in terms of your algorithm description), whether array notation is used or not is the only difference, so there is not much you can do except increasing the data size as I mentioned.

Royi · ‎07-23-2016

@Hans,

I don't understand, leaving the Multi Threading aside, why woudln't it gain form Vectorization?

I thought CILK would be as efficient as writing in SSE Intrinsics dealing with arrays.

Hansang_B_Intel · ‎07-26-2016

I would say the vectorization performance depends on the complexity of AN operation (and resulting vectorization cost). The code below shows the original example (average1) and a simpler example (average2) only with unit-stride accesses.

// Original example - average of 3 adjacent elements
void average1(float* p, int length) {
  len = length/3;
  p[0:len:3] += p[1:len:3];
  p[0:len:3] += p[2:len:3];
  p[0:len:3] /= 3;
  p[1:len:3] = p[0:len:3];
  p[2:len:3] = p[0:len:3];
}

// Simpler example - average of elements from 3 sub arrays
void average2(float* p, int length) {
  len = length/3;
  p[0:len] += p[len:len];
  p[0:len] += p[2*len:len];
  p[0:len] /= 3;
  p[len:len] = p[0:len];
  p[2*len:len] = p[0:len];
}

The two examples perform similar operations, but the second one performs much better as shown below (average is a sequential version).

average of 500 elements took 0.27 usec
average1 of 500 elements took 0.66 usec
average2 of 500 elements took 0.33 usec

average of 1000 elements took 0.47 usec
average1 of 1000 elements took 1.09 usec
average2 of 1000 elements took 0.42 usec

average of 2000 elements took 2.36 usec
average1 of 2000 elements took 2.11 usec
average2 of 2000 elements took 0.67 usec

average of 5000 elements took 6.41 usec
average1 of 5000 elements took 4.89 usec
average2 of 5000 elements took 1.26 usec