- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to utilize Intel Cilk_plus for parallelization on a multi-core CPU. I have read in one of the Intel website that the best way to introduce parallelism would be to just replace a simple for loop with cilk_for with the header included. I have tried with one of the loops in my code that can be used for parallelism as I don't think there is much interdependency between the loop iterations. I have provided the code snippet where I have introduced the Cilk_for for parallelism. However, it consumes more amount of time rather than a simple for loop. I would like to know if I am missing something with the implementation of the cilk_for. Thanks in advance for any help with this issue.
Original Code:
for (i = 0; i < 8; i++){
Total += +abs(index[3]) \
+abs(index[2]) \
+abs(index[1]) \
+abs(index[0]) \
+abs(index[7]) \
+abs(index[6]) \
+abs(index[5]) \
+abs(index[4]);
}
Changed Code:
__cilkrts_set_param("nworkers","4");
cilk_for (i = 0; i < 8; i++){
Total += +abs(index[3]) \
+abs(index[2]) \
+abs(index[1]) \
+abs(index[0]) \
+abs(index[7]) \
+abs(index[6]) \
+abs(index[5]) \
+abs(index[4]);
}
Regards
Harrisson
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Harrisson,
I think your code is so small and cilk plus have wasted more time to scheduling tasks on processors.
Thanks,
Tam Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tam,
Thank you for your reply. I have considered your suggestion and confirmed that time consumed by the provided loop in my code. The time taken is 35.283 seconds and the overall time for my project to run is 1026 seconds. Do you think this much time would be enough to utilize the Intel Cilk Plus and parallelize the loop. Also, I would like to know if I have used the cilk_for keyword correctly as I am not getting much performance out of it and inturn losing performance. I would like to hear your views on this issue.
Regards
Harrisson
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Assuming the code example above is the exact loop you are running, as Tam mentioned, I don't see a lot of work in the body of the loop.
As an rather approximate rule of thumb, I tend to think of the overheads of coordinating between threads in Cilk Plus (and in fact many parallel runtimes) as being on the order of microseconds. If the body of a loop is more fine-grained than that, then speedup may be limited. In this case, ensuring that the body of the loop vectorizes (e.g., using pragma simd) might be a better approach to getting speedup...
I'm not sure where "Total" is being declared, there is a data race on it if it is declared outside the loop body. You might use a reducer_opadd to eliminate this race.
More information about reducers and pragma simd is linked off the Cilk Plus website.
https://www.cilkplus.org/cilk-plus-tutorial
Cheers,
Jim

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page