Hi Harrisson,

Harrisson_M_ · ‎09-15-2014

Hi,

I am trying to utilize Intel Cilk_plus for parallelization on a multi-core CPU. I have read in one of the Intel website that the best way to introduce parallelism would be to just replace a simple for loop with cilk_for with the header included. I have tried with one of the loops in my code that can be used for parallelism as I don't think there is much interdependency between the loop iterations. I have provided the code snippet where I have introduced the Cilk_for for parallelism. However, it consumes more amount of time rather than a simple for loop. I would like to know if I am missing something with the implementation of the cilk_for. Thanks in advance for any help with this issue.

Original Code:
           for (i = 0; i < 8; i++){
           Total += +abs(index[3])   \
               +abs(index[2])   \
               +abs(index[1])   \
               +abs(index[0])   \
               +abs(index[7])   \
               +abs(index[6])   \
               +abs(index[5])   \
               +abs(index[4]);
           }
Changed Code:
           __cilkrts_set_param("nworkers","4");
           cilk_for (i = 0; i < 8; i++){
           Total += +abs(index[3])   \
               +abs(index[2])   \
               +abs(index[1])   \
               +abs(index[0])   \
               +abs(index[7])   \
               +abs(index[6])   \
               +abs(index[5])   \
               +abs(index[4]);
           }

Regards

Harrisson

Tam_N_ · ‎09-16-2014

Hi Harrisson,

I think your code is so small and cilk plus have wasted more time to scheduling tasks on processors.

Thanks,

Tam Nguyen

Harrisson_M_ · ‎09-16-2014

Hi Tam,

Thank you for your reply. I have considered your suggestion and confirmed that time consumed by the provided loop in my code. The time taken is 35.283 seconds and the overall time for my project to run is 1026 seconds. Do you think this much time would be enough to utilize the Intel Cilk Plus and parallelize the loop. Also, I would like to know if I have used the cilk_for keyword correctly as I am not getting much performance out of it and inturn losing performance. I would like to hear your views on this issue.

Regards
Harrisson

Jim_S_Intel · ‎09-17-2014

How many times is the loop executed in that 35.283 seconds? Even though the program may spend a lot of time executing the loop, if each instance of the loop is short, then you may still more time in overhead of starting a cilk_for loop than running in parallel.

Assuming the code example above is the exact loop you are running, as Tam mentioned, I don't see a lot of work in the body of the loop.
As an rather approximate rule of thumb, I tend to think of the overheads of coordinating between threads in Cilk Plus (and in fact many parallel runtimes) as being on the order of microseconds. If the body of a loop is more fine-grained than that, then speedup may be limited. In this case, ensuring that the body of the loop vectorizes (e.g., using pragma simd) might be a better approach to getting speedup...

I'm not sure where "Total" is being declared, there is a data race on it if it is declared outside the loop body. You might use a reducer_opadd to eliminate this race.

More information about reducers and pragma simd is linked off the Cilk Plus website.

https://www.cilkplus.org/cilk-plus-tutorial

Cheers,

Jim

Performance degrade with Intel Cilk Plus