- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
Hello,
I tried to parallelise a function with cilk plus (the function is basicaly a periodical convolution with transposition).
The function has 3 nested "for" loops. Basicaly, in a first implementation I only have changed the "for" to "cilk_for". I tried to change only the first one, or the two first, but without change in performances. The function is "convSerial_cilk", printed at the end of this post. The iteration space can be large (the first for loop iterates from 0 to 20000)
Because I had poor performance, I tried to usethe "cilkview" tools (from the SDK).
I call my function like this (with the cilkview API to profile my code) : [cpp]  cilkview_data_t d;
    __cilkview_query(d);    
       convSerial_cilk(num_elements_dim1*num_elements_dim3,num_elements_dim2,out_data_cilk,h_data,f_data,fSIZE);
  __cilkview_report(&d, NULL, "main_tag", CV_REPORT_WRITE_TO_RESULTS); [/cpp]
I get these results :
Whole Program Statistics
1) Parallelism Profile
   Work :                     3,280,552,525 instructions
   Span :                     1,512,348,513 instructions
   Burdened span :                 1,513,138,473 instructions
   Parallelism :                 2.17
   Burdened parallelism :             2.17
   Number of spawns/syncs:             84,500
   Average instructions / strand :         12,940
   Strands along span :                 65
   Average instructions / strand on span :     23,266,900
   Total number of atomic instructions :      84,506
   Frame count :                 169,000
2) Speedup Estimate
     2 processors:     1.12 - 2.00
     4 processors:     1.19 - 2.17
     8 processors:     1.23 - 2.17
    16 processors:     1.25 - 2.17
    32 processors:     1.26 - 2.17
    64 processors:     1.27 - 2.17
   128 processors:     1.27 - 2.17
   256 processors:     1.27 - 2.17
Cilk Parallel Region(s) Statistics - Elapsed time: 7.392 seconds
1) Parallelism Profile
   Work :                     1,768,253,582 instructions
   Span :                     49,570 instructions
   Burdened span :                 839,530 instructions
   Parallelism :                 35671.85
   Burdened parallelism :             2106.24
   Number of spawns/syncs:             84,500
   Average instructions / strand :         6,975
   Strands along span :                 32
   Average instructions / strand on span :     1,549
   Total number of atomic instructions :      84,506
   Frame count :                 169,000
   Entries to parallel region :             2
2) Speedup Estimate
     2 processors:     1.90 - 2.00
     4 processors:     3.80 - 4.00
     8 processors:     7.60 - 8.00
    16 processors:     15.20 - 16.00
    32 processors:     30.40 - 32.00
    64 processors:     60.80 - 64.00
   128 processors:     116.10 - 128.00
   256 processors:     212.30 - 256.00
In the Cilk specific part, cilkview indicates that I can expect to have good performance. Nevertheless the cilk version of my function is slover than the sequential one ! Furthermore, if I increase the number of worker, there is no effect on the performance of my function !
With cilkview, I have generated a plot (enclosed with this post). (launched on a dual Xeon E5-2670, I can use up to 16 CPU cores)
We can see that the theoretical speed-up should be good (burdened speed-up). But the measured speed-up is very bad (trials)
So why I get so much differences between the cilkview estimation and my real measures ? What should I check to increase my cilk plus performance ?
Thanks,
The function :
  [cpp] void convSerial_cilk(unsigned int n1,unsigned int n2,double *restrict tab_out, double *restrict tab_in,const double *restrict in_f,int nf)
{ 
  unsigned int mod;
  cilk_for(unsigned int i=0;i < n1;++i)
    {
      for(unsigned int j=0;j <  n2;++j)
        {
           double tmp = 0;
           mod = j;
           for(unsigned int k=0 ;k < nf;++k)
             {
              if(mod >= n2)
                mod = 0;
              tmp += tab_in[i*n2 + mod]*in_f
              ++mod;
            }
          tab_out[j*n1 + i] = tmp;
        }
    }
}  [/cpp] 
Link Copied
- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
- Mark as New
 - Bookmark
 - Subscribe
 - Mute
 - Subscribe to RSS Feed
 - Permalink
 - Report Inappropriate Content
 
- Subscribe to RSS Feed
 - Mark Topic as New
 - Mark Topic as Read
 - Float this Topic for Current User
 - Bookmark
 - Subscribe
 - Printer Friendly Page