New with Cilk .... need a little help

black__edgar · ‎10-04-2013

Hi,

I have this application which according to Intel advisor should get about 3.5 of speedup using 4 threads.

Using OpenMP I was able to get only ~ 2.35 with 2 and 2.85 with 4 threads.

Now I am learning and applying Cilk to see if I can improve this performance. My compiler is icc (ICC) 14.0.0 20130728.

I am using only cilk_for (similar to what I did with openMP). Running the application with one processor I get the same performance with Cilk and openMP. However adding more processors hurts the Cilk performance big time.

Below is the essential part of mi code. The parallel loops are inside the functions preM() and posM() functions which I also listed below.

Notice that both functions preM() and posM() are called form inside a double nested loop which are inside a while loop. Could that be the problem? too much overhead? I have being playing with #pragma cilk grainsize without luck.

Any help and/or comments will be appreciated.

Thanks in advance.

......

while (myBool && iter < 14) {
myBool=false;
iter++;
for (i=0; i<ncol; i++) {
for (j=nrow-1; j>i; j--) {
if (j >= ncol ) {
symmetric(a, 0.0, a, 0.0, &c1,&s1 );
} else {
symmetric(a, a, a, a,&c1,&s1 );
} // end if //
c = 1.0;
s = 0.0;
preM(a, a,ncol,c1,-s1);
if (j<ncol) {
diagonal(a,a, a, &c,&s);
preM(a, a,ncol,c,s);
posM(&a[0], &a[0],nrow,ncol,c,s);
posM(&v[0], &v[0],ncol,ncol,c,s);
} // end if //
preM(u, u,nrow,(c1*c + s1*s),(c1*s - s1*c));
} // end for //
} // end for //

.....

} // end while //

.....

++++++++++++++++++++ preM.c +++++++++++++++

#include <stdio.h>
#include <stdlib.h>
#include <cilk/cilk.h>

#include "real.h"
void preM (real *a,real *b, unsigned int nCol, real c, real s) {

unsigned int col;
// int gs = (nCol/__cilkrts_get_nworkers());
// #pragma cilk grainsize = 63
cilk_for (col=0; col<nCol;col++) {
real ri = a[col];
real rj = b[col];
a[col] = c*ri - s*rj;
b[col] = s*ri + c*rj;
} // end for //
} // end preM() //

++++++++++++++++++++ posM.c +++++++++++++++

#include <stdio.h>
#include <stdlib.h>
#include <cilk/cilk.h>

#include "real.h"
void posM (real *a,real *b, unsigned int nRow, unsigned int nCol, real c, real s) {

unsigned int row;
// int gs = (nCol/__cilkrts_get_nworkers());
// #pragma cilk grainsize = 63
cilk_for (row=0; row < nRow*nCol;row+=nCol) {
real ci = a[row];
real cj = b[row];
a[row] = c*ci - s*cj;
b[row] = s*ci + c*cj;
} // end for //
} // end posM() //

Jim_S_Intel · ‎10-04-2013

I haven't looked too deeply at your code, but I think you are probably right, in that you may have a problem with parallel overhead. How big are "nRow" and "nCol"? Generally you will get better performance with Cilk Plus if you are able to restructure your code to parallelize the outermost loops.

In a recent article, I summarized some of the common performance issues people run into using Cilk Plus. For this example, I would probably look items #1, #2, #4, and possibly #8?
http://software.intel.com/en-us/articles/why-is-cilk-plus-not-speeding-up-my-program-part-2

In this case, it is also possible that OpenMP is getting some cache-locality benefits from its static scheduling of parallel loops. Because Cilk Plus will use work-stealing to redistribute work for every cilk_for loop, it won't get the same benefits. But I'd be interested in doing some additional investigation before jumping to that conclusion, because there are sometimes other less obvious issues that can be at fault.

Would it be possible for you to post a full program (or a simplified kernel) that exhibits the same behavior, so we can try and reproduce the behavior you are seeing?

Cheers,

Jim

Nick_M_3 · ‎10-05-2013

The other very common reason is cache inefficiency (and occasionally TLB). posM does not look good for that. That is checkable using the performance counters.

black__edgar · ‎10-05-2013

Thanks for responding. I will try to give some feedback to address your question or comments:

The program obtains the singular value decomposition of a rectangular dense matrix using an inefficient procedure similar to a jacobi iteration. It was initially developed as a school project using MPI. The results were not good (failure).

A CUDA version of the program was developed and the program shows a respectable scalability. The speedup grows as the size of the matrix grows. For the largest matrix I was able to test (3500x3500) the speedup was > 6. (I was allowed only one hour of GPU time per run. The serial version taken as base required > 6 hours to run). There should be a limit in the speedup growth, but I coudn't find it given my limitation on GPU time.

The openMP version of the program produce a speedup < 3 using 2,4,8 processors. Initially I had a problem with openMP similar to the one I am having now with Cilk. I was using "omp parallel for" inside the preM() and posM() functions. I was not getting any speedup. Then, I create a parallel region before the nested loops (inside the while loop), using "omp single" to protect some code that needs to be executed by a single thread, and use work-sharing "omp for" inside the preM() and posM(). The overhead was then reduced and I am getting speedup but only < 3.

Then I decided to give a chance to Cilk, without luck yet.

Jim: I have use square dense matrices ranging form 300 to 3500. I have not found a way to move parallelization to the outermost loops. I do not know if that can be done in this case. As I see it, in the outermost loops the process is very serialized. I will read the article you suggested and see if I can find a solution. Thanks.

Nick: yes you are right. posM() process two columns of the matrix, which is not favorable for cache. preM() process two rows, and therefore performs better. As a mater of fact, the very last line inside the double loop (preM(u, u,nrow,(c1*c + s1*s),(c1*s - s1*c)); ) should "theoretically" be a call to posM(). However, in that particular case, it is known that preM() can be used to produce the transpose of the required result. Realizing that fact helped to improve (10-20%) the speedup of the CUDA and openMP versions.

Thanks again.

Nick_M_3 · ‎10-05-2013

With a structure like that, it's probably hopeless, and you need to go back to the mathematics. This is a very common problem when writing parallel code, and is the main reason that parallelism is still making very slow progress after 30-40 years of it being obviously the direction the future would take. See if you can reorder the mathematical operations to raise the level at which you can use parallelism; that is possible surprisingly often; trying that at the code level is much harder. If not, the only solution is to use a different algorithm. I am not an SVD expert, but they were notoriously hard to parallelise efficiently 20 years ago (relative to the solution of equations), so will still be hard today, but there may be more parallel-friendly algorithms nowadays.