Showing results for

- Intel Community
- Software Products
- Software Archive
- Cilk algorithm slower than the scalar counterpart

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Stephan_V_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2016
03:06 AM

14 Views

Cilk algorithm slower than the scalar counterpart

Hello everyone,

I am writing a greyscale both in cilk and scalar as a project for university to compare cilk with "normal" code. Each pixel is represented as RGB (floats), as the title already states my issue is that the cilk is slower than the scalar part which i don't really understand so maybe could maybe take a look and tell me if i am doing something wrong and what would be the correct approach to implement such an algorithm in Cilk.

The code can be found here: https://gist.github.com/feliwir/808d30647501d0f8bb3b31a5f97711dd

There functions of relevance are greyscale_cilk and greyscale_scalar. I tested the code both with the g++ 5.3 and the Intel C++ compiler, with both compilers the cilk code performing slower than the scalar code.

Thanks for your help in advance

3 Replies

Highlighted
##

It looks like your routine "grayscale_cilk" requires at least 2000+ pixels (width*height) to get some performance benefit according to my quick experiment using icpc. Performance may vary between different machines, but you can quickly check if your data set is big enough. That is for the part that doesn't use "cilk_for", and I am not sure if your four different implementations are intended to do the same computation by the way since "n" is unknown. If you are comparing performance of "CILK" vs "SCALAR", and "CILK MULTITHREADED" vs "SCALAR MULTITHREADED" respectively (in terms of your algorithm description), whether array notation is used or not is the only difference, so there is not much you can do except increasing the data size as I mentioned.

Hansang_B_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-11-2016
01:29 PM

14 Views

Highlighted
##

Royi

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-23-2016
03:39 AM

14 Views

@Hans,

I don't understand, leaving the Multi Threading aside, why woudln't it gain form Vectorization?

I thought CILK would be as efficient as writing in SSE Intrinsics dealing with arrays.

Highlighted
##

Hansang_B_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-26-2016
09:15 AM

14 Views

I would say the vectorization performance depends on the complexity of AN operation (and resulting vectorization cost). The code below shows the original example (average1) and a simpler example (average2) only with unit-stride accesses.

// Original example - average of 3 adjacent elements void average1(float* p, int length) { len = length/3; p[0:len:3] += p[1:len:3]; p[0:len:3] += p[2:len:3]; p[0:len:3] /= 3; p[1:len:3] = p[0:len:3]; p[2:len:3] = p[0:len:3]; } // Simpler example - average of elements from 3 sub arrays void average2(float* p, int length) { len = length/3; p[0:len] += p[len:len]; p[0:len] += p[2*len:len]; p[0:len] /= 3; p[len:len] = p[0:len]; p[2*len:len] = p[0:len]; }

The two examples perform similar operations, but the second one performs much better as shown below (average is a sequential version).

average of 500 elements took 0.27 usec

average1 of 500 elements took 0.66 usec

average2 of 500 elements took 0.33 usec

average of 1000 elements took 0.47 usec

average1 of 1000 elements took 1.09 usec

average2 of 1000 elements took 0.42 usec

average of 2000 elements took 2.36 usec

average1 of 2000 elements took 2.11 usec

average2 of 2000 elements took 0.67 usec

average of 5000 elements took 6.41 usec

average1 of 5000 elements took 4.89 usec

average2 of 5000 elements took 1.26 usec

For more complete information about compiler optimizations, see our Optimization Notice.