- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I am writing a greyscale both in cilk and scalar as a project for university to compare cilk with "normal" code. Each pixel is represented as RGB (floats), as the title already states my issue is that the cilk is slower than the scalar part which i don't really understand so maybe could maybe take a look and tell me if i am doing something wrong and what would be the correct approach to implement such an algorithm in Cilk.
The code can be found here: https://gist.github.com/feliwir/808d30647501d0f8bb3b31a5f97711dd
There functions of relevance are greyscale_cilk and greyscale_scalar. I tested the code both with the g++ 5.3 and the Intel C++ compiler, with both compilers the cilk code performing slower than the scalar code.
Thanks for your help in advance
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like your routine "grayscale_cilk" requires at least 2000+ pixels (width*height) to get some performance benefit according to my quick experiment using icpc. Performance may vary between different machines, but you can quickly check if your data set is big enough. That is for the part that doesn't use "cilk_for", and I am not sure if your four different implementations are intended to do the same computation by the way since "n" is unknown. If you are comparing performance of "CILK" vs "SCALAR", and "CILK MULTITHREADED" vs "SCALAR MULTITHREADED" respectively (in terms of your algorithm description), whether array notation is used or not is the only difference, so there is not much you can do except increasing the data size as I mentioned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Hans,
I don't understand, leaving the Multi Threading aside, why woudln't it gain form Vectorization?
I thought CILK would be as efficient as writing in SSE Intrinsics dealing with arrays.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would say the vectorization performance depends on the complexity of AN operation (and resulting vectorization cost). The code below shows the original example (average1) and a simpler example (average2) only with unit-stride accesses.
// Original example - average of 3 adjacent elements void average1(float* p, int length) { len = length/3; p[0:len:3] += p[1:len:3]; p[0:len:3] += p[2:len:3]; p[0:len:3] /= 3; p[1:len:3] = p[0:len:3]; p[2:len:3] = p[0:len:3]; } // Simpler example - average of elements from 3 sub arrays void average2(float* p, int length) { len = length/3; p[0:len] += p[len:len]; p[0:len] += p[2*len:len]; p[0:len] /= 3; p[len:len] = p[0:len]; p[2*len:len] = p[0:len]; }
The two examples perform similar operations, but the second one performs much better as shown below (average is a sequential version).
average of 500 elements took 0.27 usec
average1 of 500 elements took 0.66 usec
average2 of 500 elements took 0.33 usec
average of 1000 elements took 0.47 usec
average1 of 1000 elements took 1.09 usec
average2 of 1000 elements took 0.42 usec
average of 2000 elements took 2.36 usec
average1 of 2000 elements took 2.11 usec
average2 of 2000 elements took 0.67 usec
average of 5000 elements took 6.41 usec
average1 of 5000 elements took 4.89 usec
average2 of 5000 elements took 1.26 usec

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page