- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all:
When I look at the cilk sample code under the path "IntelSWTools\samples_2016\en\compiler_c\psxe\cilk.zip\matrix-multiply\matrix-multiply.cpp", I found there are some comments in the source code:
// This is the only Intel(R) Cilk(TM) Plus keyword used in this program // Note the order of the loops and the code motion of the i * n and k * n // computation. This gives a 5-10 performance improvment over exchanging // the j and k loops.
but why?
I wrote some codes without cilk, and exchange the j and k loops order, the test result is just the opposite, function with normal order has better performance than exchanged order, about double times better
It makes me confused , I want to know why, anybody can help me?
Below is the codes without cilk that I wrote for testing the influence of loops order:
void matrix_multiply_without_cilk_with_normal_loop_order(double* A, double* B, double* C, unsigned int n) { for (int i = 0; i < n; ++i) { int itn = i * n; for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { int ktn = k * n; A[itn + j] += B[itn + k] * C[ktn + j]; } } } } void matrix_multiply_without_cilk_with_exchanged_loop_order(double* A, double* B, double* C, unsigned int n) { for(unsigned int i = 0; i < n; ++i) { int itn = i * n; for (unsigned int k = 0; k < n; ++k) { int ktn = k * n; for (unsigned int j = 0; j < n; ++j) { A[itn + j] += B[itn + k] * C[ktn + j]; } } } }
- Tags:
- Parallel Computing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please check out this article: https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1
It may be able to explain what's going on.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Several factors you haven't addressed might enter into this comparison. It's certainly likely that a "normal" dot product organization might be most efficient, particularly for larger problems with thread parallelism. Early implementations of Cilk(TM) plus had a poor implementation of sum_reduction. I still wouldn't bet on Cilk_for when thread affinity is needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please check out this article: https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1
It may be able to explain what's going on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the compiler version? What is the optimization settings (and other compiler arguments)? How big is n?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page