Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
1699 Discussions

Question about performance of Intel cilk sample code

Raymond_S_
Beginner
1,131 Views

Dear all:

When I look at the cilk sample code under the path "IntelSWTools\samples_2016\en\compiler_c\psxe\cilk.zip\matrix-multiply\matrix-multiply.cpp", I found there are some comments in the source code:

 // This is the only Intel(R) Cilk(TM) Plus keyword used in this program
		// Note the order of the loops and the code motion of the i * n and k * n
		// computation. This gives a 5-10 performance improvment over exchanging
		// the j and k loops.

but why?

I wrote some codes without cilk, and exchange the j and k loops order, the test result is  just the opposite, function with normal order has better performance than exchanged order, about double times better

It makes me confused , I want to know why, anybody can help me?

Below is the codes without cilk that I wrote for testing the influence of loops order:

void matrix_multiply_without_cilk_with_normal_loop_order(doubleAdoubleBdoubleCunsigned int n)
{
	for (int i = 0; i < n; ++i) {
		int itn = i * n;
		for (int j = 0; j < n; ++j) {
			for (int k = 0; k < n; ++k) {
				int ktn = k * n;
				A[itn + j] += B[itn + k] * C[ktn + j];
			}
		}
	}
}
 
void matrix_multiply_without_cilk_with_exchanged_loop_order(doubleAdoubleBdoubleCunsigned int n)
{
	for(unsigned int i = 0; i < n; ++i) {
		int itn = i * n;
		for (unsigned int k = 0; k < n; ++k) {
			int ktn = k * n;
			for (unsigned int j = 0; j < n; ++j) {
				A[itn + j] += B[itn + k] * C[ktn + j];
			}
		}
	}
}
0 Kudos
1 Solution
MikeP_Intel
Moderator
1,131 Views
0 Kudos
3 Replies
TimP
Honored Contributor III
1,131 Views

Several factors you haven't addressed might enter into this comparison.  It's certainly likely that a "normal" dot product organization might be most efficient, particularly for larger problems with thread parallelism.  Early implementations of Cilk(TM) plus had a poor implementation of sum_reduction. I still wouldn't bet on Cilk_for when thread affinity is needed.

0 Kudos
MikeP_Intel
Moderator
1,132 Views

Please check out this article: https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1

It may be able to explain what's going on.

0 Kudos
Bradley_K_
New Contributor I
1,131 Views

What is the compiler version?  What is the optimization settings (and other compiler arguments)?  How big is n?

0 Kudos
Reply