Enforcing Loop Vectorization with Array Notations

Jorge_Martinis · ‎12-16-2010

I wonder what would be preventing the compiler from vectorizing the innermost loop in the following function (e.g.):

template inline void MatrixVectorProduct(const matrix& m, const std::vector& rhs, std::vector& lhs)

{

size_t cols = m.cols();

const T* restrict pcol = &(*rhs.begin());

//outer loop (/Qvec-report:3): nonstandard loop is not a vectorization candidate (Fine!)

_Cilk_for(size_t i=0; i

{

const T* prow = &(*(m.begin() + i * cols));

//inner loop(/Qvec-report:3): modifying order of operation not allowed under given switches (?)

lhs = __sec_reduce_add( prow[0:cols] * pcol[0:cols] );

}

Under the switches: /O3 /Qstd=c99 /Qopenmp /Qfp-speculation:safe /Qrestrict /arch:SSE2, this function's performance approaches Intel MKL's 'cblas_dgemv()'.

Cheers,

Pablo_H_Intel · ‎12-17-2010

When I tried this, instantiated with both float and double and using both the Windows and Linux versions of the compiler, the vectorization report I get says that the inner loop is vectorized. What does your matrix class look like? Are you using the released version of the compiler, or a beta, or something else?

- Pablo

Brandon_H_Intel · ‎12-17-2010

I think we may have figured out the trigger here. Assuming you're building out of the IDE, is /fp:precise specified by default? Try changing to /fp:fast if it is.

The question I'm following up on is whether this behavior of the vectorizer makes sense in the context of array notations.

Jorge_Martinis · ‎12-17-2010

Indeed, I intentionally specify /fp:precise.

Jorge_Martinis · ‎12-17-2010

After switching to /fp:fast, the loop is vectorized. However, it crashes at runtime with thread/call stack stalled right at the loop.

Jorge_Martinis · ‎12-20-2010

My matrix class uses contiguous storage and row-major layout. I am using the Intel Composer 2011 XE Update 1 (12.1.127).

Brandon_H_Intel · ‎12-21-2010

Jorge,

If you turn on /W4, do you get any remarks like the following?

remark #18009: A temporary array is allocated to resolve data dependencies

If so, I think you might have a stack overflow caused by some of the array notation code. Let me know - I have an open problem report on this that I can link this thread to.

Jorge_Martinis · ‎12-21-2010

Brandon,

After turning on /W4, I found no remarks. Under /fp:fast the MatrixVectorProduct() function (thread #1) builds and runs.

On the other hand, the following function works under /fp:precise (w/o innermost loop vectorization) whereas under /fp:fast the innermost loop is vectorized but it crashes at runtime due to an unhandled access violation.

template inline void MatrixProduct(const matrix& m, const matrix& rhs, matrix& lhs)

{

//assert(...) on all dimensions

size_t mcols = m.cols();

size_t ncols = rhs.cols();

const T* pcol = &(*rhs.begin());//restrict pointer candidate

_Cilk_for(size_t i=0; i

{

const T* prow = &(*(m.begin() + i * mcols));

for(size_t j=0; j

{

lhs = __sec_reduce_add(prow[0:mcols] * pcol);//acc violation on vect

}

Compiler:

/c /O2 /Ob2 /Oi /Ot /Oy /Qipo /I "C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\include\\ia32" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /Gy /arch:SSE2 /fp:fast /Fo"Release/" /Fd"Release/vc90.pdb" /W4 /nologo /Zi /Qopenmp /Quse-intel-optimized-headers /Qstd=c99 /Qrestrict /Qvec-report3

Linker:

mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /INCREMENTAL:NO /nologo /LIBPATH:"C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\lib\\ia32" /NODEFAULTLIB:"libcmt.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /DYNAMICBASE /NXCOMPAT /MACHINE:X86

Cheers,

Brandon_H_Intel · ‎12-23-2010

Hi Jorge,

This definitely looks like a compiler issue from what you've sent me. The vectorizer is doing something improperly, I think. I've created a problem report for our vectorizer team, and I'll update the thread as their investigation proceeds.

Jorge_Martinis · ‎01-04-2011

Brandon,

I think I've found an answer to our follow-up question in a related article at:

http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

It seems that the behavior makes sense on this context due to the fact that the /fp:precise model allows only value-safe optimizations. The reduction loop in __sec_reduce_add() implies sums reassociation, making it value-unsafe.

Question remains on why it does fail under /fp:fast though.

Regards,

Brandon_H_Intel · ‎01-04-2011

Hi Jorge,

Correct. Because /fp:precise is specified, the compiler can't safely vectorize the array notation reduction. However, the code crashing after vectorization is still an issue it seems to me.

Jorge_Martinis · ‎01-04-2011

Brandon,

I agree. A very important one indeed.

I look forward to hearing from that.

Cheers,

Brandon_H_Intel · ‎04-08-2011

Hi Jorge,

We've put a fix in on update 3 for this issue. Try update 3, and let me know if you still have problems.