Software Archive
Read-only legacy content
17061 Discussions

Enforcing Loop Vectorization with Array Notations

Jorge_Martinis
Beginner
949 Views
I wonder what would be preventing the compiler from vectorizing the innermost loop in the following function (e.g.):
template inline void MatrixVectorProduct(const matrix& m, const std::vector& rhs, std::vector& lhs)
{
size_t cols = m.cols();

const T* restrict pcol = &(*rhs.begin());
//outer loop (/Qvec-report:3): nonstandard loop is not a vectorization candidate (Fine!)
_Cilk_for(size_t i=0; i
{
const T* prow = &(*(m.begin() + i * cols));
//inner loop(/Qvec-report:3): modifying order of operation not allowed under given switches (?)
lhs = __sec_reduce_add( prow[0:cols] * pcol[0:cols] );
}
}
Under the switches: /O3 /Qstd=c99 /Qopenmp /Qfp-speculation:safe /Qrestrict /arch:SSE2, this function's performance approaches Intel MKL's 'cblas_dgemv()'.
Cheers,
0 Kudos
12 Replies
Pablo_H_Intel
Employee
949 Views
When I tried this, instantiated with both float and double and using both the Windows and Linux versions of the compiler, the vectorization report I get says that the inner loop is vectorized. What does your matrix class look like? Are you using the released version of the compiler, or a beta, or something else?

- Pablo
0 Kudos
Brandon_H_Intel
Employee
949 Views
I think we may have figured out the trigger here. Assuming you're building out of the IDE, is /fp:precise specified by default? Try changing to /fp:fast if it is.

The question I'm following up on is whether this behavior of the vectorizer makes sense in the context of array notations.
0 Kudos
Jorge_Martinis
Beginner
949 Views
Indeed, I intentionally specify /fp:precise.
0 Kudos
Jorge_Martinis
Beginner
949 Views
After switching to /fp:fast, the loop is vectorized. However, it crashes at runtime with thread/call stack stalled right at the loop.
0 Kudos
Jorge_Martinis
Beginner
949 Views
My matrix class uses contiguous storage and row-major layout. I am using the Intel Composer 2011 XE Update 1 (12.1.127).
0 Kudos
Brandon_H_Intel
Employee
949 Views
Jorge,

If you turn on /W4, do you get any remarks like the following?

remark #18009: A temporary array is allocated to resolve data dependencies

If so, I think you might have a stack overflow caused by some of the array notation code. Let me know - I have an open problem report on this that I can link this thread to.

0 Kudos
Jorge_Martinis
Beginner
949 Views
Brandon,
After turning on /W4, I found no remarks. Under /fp:fast the MatrixVectorProduct() function (thread #1) builds and runs.
On the other hand, the following function works under /fp:precise (w/o innermost loop vectorization) whereas under /fp:fast the innermost loop is vectorized but it crashes at runtime due to an unhandled access violation.
template inline void MatrixProduct(const matrix& m, const matrix& rhs, matrix& lhs)
{
//assert(...) on all dimensions
size_t mcols = m.cols();
size_t ncols = rhs.cols();
const T* pcol = &(*rhs.begin());//restrict pointer candidate
_Cilk_for(size_t i=0; i
{
const T* prow = &(*(m.begin() + i * mcols));
for(size_t j=0; j
{
lhs = __sec_reduce_add(prow[0:mcols] * pcol);//acc violation on vect
}
}
}
Compiler:
/c /O2 /Ob2 /Oi /Ot /Oy /Qipo /I "C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\include\\ia32" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /Gy /arch:SSE2 /fp:fast /Fo"Release/" /Fd"Release/vc90.pdb" /W4 /nologo /Zi /Qopenmp /Quse-intel-optimized-headers /Qstd=c99 /Qrestrict /Qvec-report3
Linker:
mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /INCREMENTAL:NO /nologo /LIBPATH:"C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\lib\\ia32" /NODEFAULTLIB:"libcmt.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /DYNAMICBASE /NXCOMPAT /MACHINE:X86
Cheers,
0 Kudos
Brandon_H_Intel
Employee
949 Views
Hi Jorge,

This definitely looks like a compiler issue from what you've sent me. The vectorizer is doing something improperly, I think. I've created a problem report for our vectorizer team, and I'll update the thread as their investigation proceeds.
0 Kudos
Jorge_Martinis
Beginner
949 Views
Brandon,
I think I've found an answer to our follow-up question in a related article at:
It seems that the behavior makes sense on this context due to the fact that the /fp:precise model allows only value-safe optimizations. The reduction loop in __sec_reduce_add() implies sums reassociation, making it value-unsafe.
Question remains on why it does fail under /fp:fast though.
Regards,
0 Kudos
Brandon_H_Intel
Employee
949 Views
Hi Jorge,

Correct. Because /fp:precise is specified, the compiler can't safely vectorize the array notation reduction. However, the code crashing after vectorization is still an issue it seems to me.
0 Kudos
Jorge_Martinis
Beginner
949 Views
Brandon,
I agree. A very important one indeed.
I look forward to hearing from that.
Cheers,
0 Kudos
Brandon_H_Intel
Employee
949 Views
Hi Jorge,

We've put a fix in on update 3 for this issue. Try update 3, and let me know if you still have problems.
0 Kudos
Reply