Thank you both for pointing

selmilab · ‎09-08-2014

Hello, I have this little sample code

double foo(double **cache, double *prod, int iQ, int l)
{
	double FF = 0;
	for (int iP = 0; iP < l; ++iP) {
		const double * p = cache[iP];
		register double prod1 = prod[iP];
		for (int iP2 = 0; iP2 < l; ++iP2) {
			FF += prod[iP2] * p[iP2] * prod1;
		}
	}
	return FF;
}

compiler options are: -O3 -std=c99 -fstrict-aliasing -xSSE4.2 -align -qopt-report=5

Optimization report is:

Begin optimization report for: foo(double **, double *, int, int)

Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo(double **, double *, int, int)) [1/1=100.0%] x.c(4,1)

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at x.c(6,2)
remark #25096: Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)
remark #25452: Original Order found to be proper, but by a close margin
remark #25461: Imperfect Loop Unroll-Jammed by 2 (pre-vector)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10
remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10
remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10
remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10

LOOP BEGIN at x.c(9,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10
remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10
remark #15346: vector dependence: assumed ANTI dependence between FF line 10 and FF line 10
remark #15346: vector dependence: assumed FLOW dependence between FF line 10 and FF line 10
remark #25439: unrolled with remainder by 2
LOOP END

LOOP BEGIN at x.c(9,3)
<Remainder>
LOOP END
LOOP END

LOOP BEGIN at x.c(6,2)
<Remainder>
remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at x.c(9,3)
<Peeled>
LOOP END

LOOP BEGIN at x.c(9,3)
remark #15388: vectorization support: reference prod has aligned access [ x.c(10,4) ]
remark #15388: vectorization support: reference p has aligned access [ x.c(10,4) ]
remark #15399: vectorization support: unroll factor set to 4
remark #15300: LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15448: unmasked aligned unit stride loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 14
remark #15477: vector loop cost: 18.000
remark #15478: estimated potential speedup: 2.950
remark #15479: lightweight vector operations: 8
remark #15480: medium-overhead vector operations: 1
remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at x.c(9,3)
remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at x.c(9,3)
<Remainder>
remark #15389: vectorization support: reference prod has unaligned access [ x.c(10,4) ]
remark #15388: vectorization support: reference p has aligned access [ x.c(10,4) ]
remark #15381: vectorization support: unaligned access used inside loop body [ x.c(10,4) ]
remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at x.c(9,3)
<Remainder>
LOOP END
LOOP END

I have two issues regarding this report:

1) first it says the loop is not vectorized because of dependencies; then it says loop was vectorized. Why this disagreement?

2) first prod has aligned access, then prod has unaligned access. Why?

Any thought?

Thank you.

Patrick_F_Intel1 · ‎09-08-2014

Hello Selmilab,

Can you enter your question on the Intel C++ compiler forum at https://software.intel.com/en-us/forums/intel-c-compiler

You'll get quicker response.

Pat

TimP · ‎09-08-2014

It looks like the compiler may have generated separate versions of the inner loop, for the cases where peeling for alignment aligns both p and prod, and for other cases, so it seems there is some possibility you may not hit the vectorized loop at run time.

For remainder vectorization it didn't try to align both p and prod. As the main vector loop takes the data 8 at a time, the compiler judged it worth while to optimize the remainder with simd. Possibly it may be able to use the remainder loop for cases where the fully aligned version doesn't apply.

While CPUs which support AVX may be able to support both SSE4 and AVX vectorization without losing out on an unaligned operand, the SSE4.2 choice has to support the earliest SSE2 CPUs.

selmilab · ‎09-08-2014

Tim Prince wrote:

It looks like the compiler may have generated separate versions of the inner loop, for the cases where peeling for alignment aligns both p and prod, and for other cases, so it seems there is some possibility you may not hit the vectorized loop at run time.

For remainder vectorization it didn't try to align both p and prod. As the main vector loop takes the data 8 at a time, the compiler judged it worth while to optimize the remainder with simd. Possibly it may be able to use the remainder loop for cases where the fully aligned version doesn't apply.

While CPUs which support AVX may be able to support both SSE4 and AVX vectorization without losing out on an unaligned operand, the SSE4.2 choice has to support the earliest SSE2 CPUs.

How can I tell the compiler that the parameters point to 16 byte aligned memory? I've found a directive like __declspec(align(16)) but it doesn't work on parameters

TimP · ‎09-08-2014

I'd suggest "#pragma vector aligned" right ahead of the inner for(). If you switch to AVX it would mean 32-byte aligned. Also a good alternative with Intel compilers is the __aligned designator but other compilers will complain.

As the compiler has generated code for peel and checking alignment, those pragmas would only simplify the code (and maybe the compiler messages), giving you a slight advantage in starting the loop.

selmilab · ‎09-09-2014

Thank you all for your answers. You pointed me in the right direction. To solve my issues I've allocated my arrays using _mm_malloc and enforcing a 32 byte alignment. Also, I'm using the __assume_aligned directive in the code that uses these arrays. Everything seems to work fine

McCalpinJohn · ‎09-09-2014

Along the same lines, if you compile for OpenMP you will probably go back to generating multiple versions of the loops again, since the starting points for each OpenMP thread are not known until run-time when the OMP_NUM_THREADS variable is available to determine the distribution of data elements to threads.

(I think that the OpenMP SIMD directive(s) are intended to help the OpenMP compiler to maintain alignment even with an arbitrary number of threads, but I have not played with that (relatively new) feature yet.)

Given multiple versions of the loop you then want to know which one(s) are actually being executed. The easiest way to figure this out is probably with Intel's Amplifier XE (VTune) profiling -- you can click on the hot spots in the GUI to drill down to the assembly code to figure out which version(s) are being executed most of the time. Once you are pointed at the "hot" assembly code, it is typically pretty easy to see how the computations are vectorized and whether the memory accesses are assumed to be aligned. Understanding *why* is sometimes harder, but that is all part of the fun!

TimP · ‎09-10-2014

#pragma omp simd aligned (the OpenMP 4 vectorization pragma) offers some functionality equivalent to Intel proprietary __assume_aligned.

As John mentioned, the situation gets complicated with OpenMP parallel (threading). As far as I know, in practice it's not possible to assert alignment unless the product of number of threads times hardware simd vector width matches the total loop count. This is one of the reasons why OpenMP parallel is more effective in the situation of outer loop parallel inner loop vector than in the case where a single loop is to be compiled as both threaded and simd parallel (a situation supported by Intel compilers with #pragma omp parallel for simd). If any thread has to take the time to process misalignment, all threads may as well do so.

Even with Intel compilers, there is a question about the degree of support for multiple OpenMP 4 clauses. gcc seems less likely to produce crashes when more clauses are added, but also less likely to do anything useful with additional clauses.

selmilab · ‎09-10-2014

Thank you both for pointing me out these issues with OpenMP, issues I've never considered.

The code snippet is at the end of a long function call tree. At the top of this tree there is a for loop which is parallelized via the usual #pragma omp parallel for but the two loops of the snippet are supposed to be executed sequentially by each thread.

Loop vectorization and how to read optimization report