Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

Question about performance of Intel cilk sample code

Raymond_S_
Beginner
487 Views

Dear all:

When I look at the cilk sample code under the path "IntelSWTools\samples_2016\en\compiler_c\psxe\cilk.zip\matrix-multiply\matrix-multiply.cpp", I found there are some comments in the source code:

 // This is the only Intel(R) Cilk(TM) Plus keyword used in this program
		// Note the order of the loops and the code motion of the i * n and k * n
		// computation. This gives a 5-10 performance improvment over exchanging
		// the j and k loops.

but why?

I wrote some codes without cilk, and exchange the j and k loops order, the test result is  just the opposite, function with normal order has better performance than exchanged order, about double times better

It makes me confused , I want to know why, anybody can help me?

Below is the codes without cilk that I wrote for testing the influence of loops order:

void matrix_multiply_without_cilk_with_normal_loop_order(doubleAdoubleBdoubleCunsigned int n)
{
	for (int i = 0; i < n; ++i) {
		int itn = i * n;
		for (int j = 0; j < n; ++j) {
			for (int k = 0; k < n; ++k) {
				int ktn = k * n;
				A[itn + j] += B[itn + k] * C[ktn + j];
			}
		}
	}
}
 
void matrix_multiply_without_cilk_with_exchanged_loop_order(doubleAdoubleBdoubleCunsigned int n)
{
	for(unsigned int i = 0; i < n; ++i) {
		int itn = i * n;
		for (unsigned int k = 0; k < n; ++k) {
			int ktn = k * n;
			for (unsigned int j = 0; j < n; ++j) {
				A[itn + j] += B[itn + k] * C[ktn + j];
			}
		}
	}
}
0 Kudos
1 Solution
MikeP_Intel
Moderator
487 Views
0 Kudos
3 Replies
TimP
Honored Contributor III
487 Views

Several factors you haven't addressed might enter into this comparison.  It's certainly likely that a "normal" dot product organization might be most efficient, particularly for larger problems with thread parallelism.  Early implementations of Cilk(TM) plus had a poor implementation of sum_reduction. I still wouldn't bet on Cilk_for when thread affinity is needed.

0 Kudos
MikeP_Intel
Moderator
488 Views

Please check out this article: https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1

It may be able to explain what's going on.

0 Kudos
Bradley_K_
New Contributor I
487 Views

What is the compiler version?  What is the optimization settings (and other compiler arguments)?  How big is n?

0 Kudos
Reply