(very) Bad parallel efficiency, cilkview interpretation

Matthieu_Ospici · ‎11-19-2012

Hello,

I tried to parallelise a function with cilk plus (the function is basicaly a periodical convolution with transposition).

The function has 3 nested "for" loops. Basicaly, in a first implementation I only have changed the "for" to "cilk_for". I tried to change only the first one, or the two first, but without change in performances. The function is "convSerial_cilk", printed at the end of this post. The iteration space can be large (the first for loop iterates from 0 to 20000)

Because I had poor performance, I tried to usethe "cilkview" tools (from the SDK).

I call my function like this (with the cilkview API to profile my code) : [cpp] cilkview_data_t d;
__cilkview_query(d);
convSerial_cilk(num_elements_dim1*num_elements_dim3,num_elements_dim2,out_data_cilk,h_data,f_data,fSIZE);
__cilkview_report(&d, NULL, "main_tag", CV_REPORT_WRITE_TO_RESULTS); [/cpp]

I get these results :

Whole Program Statistics
1) Parallelism Profile
   Work :                   3,280,552,525 instructions
   Span :                   1,512,348,513 instructions
   Burdened span :               1,513,138,473 instructions
   Parallelism :               2.17
   Burdened parallelism :           2.17
   Number of spawns/syncs:           84,500
   Average instructions / strand :       12,940
   Strands along span :               65
   Average instructions / strand on span :   23,266,900
   Total number of atomic instructions :    84,506
   Frame count :               169,000

2) Speedup Estimate
     2 processors:   1.12 - 2.00
     4 processors:   1.19 - 2.17
     8 processors:   1.23 - 2.17
    16 processors:   1.25 - 2.17
    32 processors:   1.26 - 2.17
    64 processors:   1.27 - 2.17
   128 processors:   1.27 - 2.17
   256 processors:   1.27 - 2.17

Cilk Parallel Region(s) Statistics - Elapsed time: 7.392 seconds
1) Parallelism Profile
   Work :                   1,768,253,582 instructions
   Span :                   49,570 instructions
   Burdened span :               839,530 instructions
   Parallelism :               35671.85
   Burdened parallelism :           2106.24
   Number of spawns/syncs:           84,500
   Average instructions / strand :       6,975
   Strands along span :               32
   Average instructions / strand on span :   1,549
   Total number of atomic instructions :    84,506
   Frame count :               169,000
   Entries to parallel region :           2

2) Speedup Estimate
     2 processors:   1.90 - 2.00
     4 processors:   3.80 - 4.00
     8 processors:   7.60 - 8.00
    16 processors:   15.20 - 16.00
    32 processors:   30.40 - 32.00
    64 processors:   60.80 - 64.00
   128 processors:   116.10 - 128.00
   256 processors:   212.30 - 256.00

In the Cilk specific part, cilkview indicates that I can expect to have good performance. Nevertheless the cilk version of my function is slover than the sequential one ! Furthermore, if I increase the number of worker, there is no effect on the performance of my function !

With cilkview, I have generated a plot (enclosed with this post). (launched on a dual Xeon E5-2670, I can use up to 16 CPU cores)

We can see that the theoretical speed-up should be good (burdened speed-up). But the measured speed-up is very bad (trials)

So why I get so much differences between the cilkview estimation and my real measures ? What should I check to increase my cilk plus performance ?

Thanks,

The function :

[cpp] void convSerial_cilk(unsigned int n1,unsigned int n2,double *restrict tab_out, double *restrict tab_in,const double *restrict in_f,int nf)
{
unsigned int mod;
cilk_for(unsigned int i=0;i < n1;++i)
    {
      for(unsigned int j=0;j < n2;++j)
        {
           double tmp = 0;
           mod = j;
           for(unsigned int k=0 ;k < nf;++k)
             {
              if(mod >= n2)
                mod = 0;
              tmp += tab_in[i*n2 + mod]*in_f;
              ++mod;
            }
          tab_out[j*n1 + i] = tmp;
        }
    }
} [/cpp]

Barry_T_Intel · ‎11-19-2012

Which OS and compiler are you using? Please give us specific versions. Also, can you give us a version we can run? Thanks. - Barry

Matthieu_Ospici · ‎11-19-2012

I'm using the intel compilers v.2013.1.117 on a linux platform ( rhel 6.2). Hardware is based on dual Xeon E5-2670 (2*8 cores). I use this command to compile (convSeria.cpp is enclosed) : icpc -xAVX -vec-report -I../cilkutil/include/cilktools -std=c++0x -restrict -O3 convSeria.cpp -o test to run the test case : ./test 64 64 64 The output indicates : Ref : 759234 µs (time of the original serial code) Cilk : 1552656 µs (time of the cilk version of the code) Thanks,

Barry_T_Intel · ‎11-20-2012

I've started looking at the problem, and found at least one flaw. Your implementation of convSerial_cilk has a race. Since you declare "mod" outside of the cilk_for loop, it is shared among all of the workers. This is causing memory stalls waiting for the cacheline containing mod. Moving the declaration of mod inside the cilk_for loop improved the performance. But it's still not performing like it should....

TimP · ‎11-21-2012

If you're trying to accumulate mod from multiple workers, you need to do more than simply move the definition in and out of various scopes. From the context, it looks like Barry's assumption that you mean it to be local to a worker may be correct. You should also be looking at whether the reduction in tmp is optimized, and whether your use of unsigned int is the cause of a performance problem.

Matthieu_Ospici · ‎11-22-2012

Bary, Thanks for you help. As you said, performance are still bad even if mod is locally declared. Perhaps the overhead is due to the tasks creation? How much time need cikl plus to create a task ? I tried with intel TBB with the same code (with parallel_for construction instead of cilk_for) and performance are significantly better than cilk. Tim, mod must be local to each task, so Barry's assumption is right. The reduction is not vectorized by icc, but it is also the case in the no-parallel C case. So even if the reduction is sub optimal, I expect an acceleration compared with the no-parallel C case. Thanks,

Barry_T_Intel · ‎11-26-2012

OK, in addition to the race, I've found a couple of things: The first is that OpenMP is keeping its worker threads around after the parallel section is completed, and they're taking up CPU time. Reversing the two sections so the Cilk code goes first fixes that. The Cilk runtime suspends the threads when your code leaves the parallel region, so they don't effect the OpenMP timings. If you want to be sure, you can tell the Cilk runtime to shutdown completely by calling __cilkrts_end_cilk(). I'm told that the OpenMP "hang" time can be controlled using the KMP_BLOCKTIME environment variable, but I didn't go back and try it. The second is that the compiler is known to have some problems with parallelizing inside a cilk_for loop. This was being called out by the lask of remarks from the compiler about vectorization in the Cilk implemenation. I pulled the cilk_for body into its own function which I called from the cilk_for loop, and the timings tightend up alot. The compiler folks know about this and are looking at possible solutions. Next I cut down on the number of worker slots the Cilk runtime is allocating. When the runtime attempts to steal work, it generates a random number and uses that to index into the array of workers structs. This is a simple, linear array, which is allocated when the runtime starts. By default, the runtime allocates 4P slots in that array; P-1 for the system workers, and (3P)+1 for the user workers. The extra slots allow multiple threads to bind to the runtime. But since your application will only ever have 1 user worker which binds to the runtime, the extra 3P slots will never have work that can be stolen. Cutting down on the extra slots improves the chance of finding work to steal quickly. Finally, there's some indication that you've got insufficient parallelism for 8 workers. With the above changes, I see the following results: > ./test 64 64 64 Ref : 0.088 sec Cilk : 0.101 sec > ./test 128 128 128 Ref : 2.185 sec Cilk : 2.154 sec - Barry

Barry_T_Intel · ‎11-26-2012

I've been assured that setting KMP_BLOCKTIME=0 will cause the OpenMP threads to block immediately after the parallel region. The default is that they will spin for 200 milliseconds, waiting for additional work. - Barry