Solved: Please check out this article

Raymond_S_ · ‎04-20-2016

Dear all:

When I look at the cilk sample code under the path "IntelSWTools\samples_2016\en\compiler_c\psxe\cilk.zip\matrix-multiply\matrix-multiply.cpp", I found there are some comments in the source code:

 // This is the only Intel(R) Cilk(TM) Plus keyword used in this program
		// Note the order of the loops and the code motion of the i * n and k * n
		// computation. This gives a 5-10 performance improvment over exchanging
		// the j and k loops.

but why?

I wrote some codes without cilk, and exchange the j and k loops order, the test result is just the opposite, function with normal order has better performance than exchanged order, about double times better

It makes me confused , I want to know why, anybody can help me?

Below is the codes without cilk that I wrote for testing the influence of loops order:

void matrix_multiply_without_cilk_with_normal_loop_order(double* A, double* B, double* C, unsigned int n)
{
	for (int i = 0; i < n; ++i) {
		int itn = i * n;
		for (int j = 0; j < n; ++j) {
			for (int k = 0; k < n; ++k) {
				int ktn = k * n;
				A[itn + j] += B[itn + k] * C[ktn + j];
			}
		}
	}
}
 
void matrix_multiply_without_cilk_with_exchanged_loop_order(double* A, double* B, double* C, unsigned int n)
{
	for(unsigned int i = 0; i < n; ++i) {
		int itn = i * n;
		for (unsigned int k = 0; k < n; ++k) {
			int ktn = k * n;
			for (unsigned int j = 0; j < n; ++j) {
				A[itn + j] += B[itn + k] * C[ktn + j];
			}
		}
	}
}

MikeP_Intel · ‎04-20-2016

Please check out this article: https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1

It may be able to explain what's going on.

View solution in original post

TimP · ‎04-20-2016

Several factors you haven't addressed might enter into this comparison. It's certainly likely that a "normal" dot product organization might be most efficient, particularly for larger problems with thread parallelism. Early implementations of Cilk(TM) plus had a poor implementation of sum_reduction. I still wouldn't bet on Cilk_for when thread affinity is needed.

MikeP_Intel · ‎04-20-2016

Please check out this article: https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1

It may be able to explain what's going on.

Bradley_K_ · ‎04-20-2016

What is the compiler version? What is the optimization settings (and other compiler arguments)? How big is n?

Question about performance of Intel cilk sample code