Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

## Question about performance of Intel cilk sample code

Beginner
694 Views

Dear all:

When I look at the cilk sample code under the path "IntelSWTools\samples_2016\en\compiler_c\psxe\cilk.zip\matrix-multiply\matrix-multiply.cpp", I found there are some comments in the source code:

``` // This is the only Intel(R) Cilk(TM) Plus keyword used in this program
// Note the order of the loops and the code motion of the i * n and k * n
// computation. This gives a 5-10 performance improvment over exchanging
// the j and k loops.
```

but why?

I wrote some codes without cilk, and exchange the j and k loops order, the test result is  just the opposite, function with normal order has better performance than exchanged order, about double times better

It makes me confused , I want to know why, anybody can help me?

Below is the codes without cilk that I wrote for testing the influence of loops order:

```void matrix_multiply_without_cilk_with_normal_loop_order(double* A, double* B, double* C, unsigned int n)
{
for (int i = 0; i < n; ++i) {
int itn = i * n;
for (int j = 0; j < n; ++j) {
for (int k = 0; k < n; ++k) {
int ktn = k * n;
A[itn + j] += B[itn + k] * C[ktn + j];
}
}
}
}

void matrix_multiply_without_cilk_with_exchanged_loop_order(double* A, double* B, double* C, unsigned int n)
{
for(unsigned int i = 0; i < n; ++i) {
int itn = i * n;
for (unsigned int k = 0; k < n; ++k) {
int ktn = k * n;
for (unsigned int j = 0; j < n; ++j) {
A[itn + j] += B[itn + k] * C[ktn + j];
}
}
}
}```
1 Solution
Moderator
694 Views

It may be able to explain what's going on.

3 Replies
Honored Contributor III
694 Views

Several factors you haven't addressed might enter into this comparison.  It's certainly likely that a "normal" dot product organization might be most efficient, particularly for larger problems with thread parallelism.  Early implementations of Cilk(TM) plus had a poor implementation of sum_reduction. I still wouldn't bet on Cilk_for when thread affinity is needed.

Moderator
695 Views