cilk_for "loop iteration count cannot be computed"

TimP · ‎01-04-2017

Apparently, Advisor cannot analyze results for display in Survey Report where cilk_for is in use. The "Why no vectorization?" field shows this comment. The times are quoted as 0. in summary even though they show up with reasonable values ascribed to cilkrts_cilk_for in source and assembly view. I have built with -debug:inline-debug-info -Qipo-.

Kevin_O_Intel1 · ‎01-04-2017

Hey Tim,

If each "iteration" of the cilk_for loop is a separate thread then it is possible that the loop is not being vectorized.

I would have to have your example to say for sure, but

There may be some vector instructions in the loop but from our analysis the loop itself is done in parallel.

Looking at the loop analytics in Advisor would be the way to show how many vector instructions are executed.

You could also try adding an inner loop (ie strip mining) to vectorize and use the cilk_for for parallelism.

I'd be happy to look at your example.

Regards,

Kevin

TimP · ‎01-04-2017

Kevin,

I'm not expecting the cilk_for top level loop to vectorize, except in the cases where _Simd clause is included (with a demonstrated benefit). In most cases, on a dual core HSW platform, Cilk(tm) Plus performance is good enough to convince me of effective vectorization. Even on MIC KNC, where cilk(tm) Plus typically performs about 30% of plain C99 (after numbers of threads and workers are optimized by trial and error), the vectorization shows typically 3x speedup.

icl -O3 -Qipo- -debug:inline-debug-info -QxHost -Qunroll:4 -Qopt-report:4 -c loopdcp.c

ifort -O3 -Qipo- -Qopenmp -QxHost -debug:inline-debug-info -fpp -Qopt-report:4 -assume:underscore -names:lowercase loopdcp.obj maind.F forttime.f90

set CILK_NWORKERS=3

set OMP_NUM_THREADS=2

set OMP_PLACES=cores

I don't know of any way to force use of 2+ cores other than to set NWORKERS=3.

I suppose it may be possible to run on 1 core with an option which serializes cilk_for.

Thanks,

Tim

Kevin_O_Intel1 · ‎01-04-2017

Comparing the serial performance would give you the precise speedup but using the loop analytics tab you can do a ballpark estimate based upon our static analysis of the instructions in the loop. Also running a trip count/flops analysis would give you another metric you could compare.

Kevin_O_Intel1 · ‎01-05-2017

Hi TIm,

Can you send g2c.h?

Kevin

TimP · ‎01-05-2017

Sorry, still some of those f2c translation relics.

ifort -O3 -Qipo- -Qopenmp -QxHost -debug:inline-debug-info -fpp -Qopt-report:4 -assume:underscore -names:lowercase loopdcp.obj maind.F forttime.f90

When /Qcilk-serialize is set in the ICL compilation, a majority of the "iteration count cannot be computed" notations are changed to "vector dependence prevents vectorization, " but there are 4 more cases displayed as implementing AVX or AVX2 vectorization. Run time barely increased with suppression of cilk_for parallelization.

In function s2102, the (int) cast which is necessary for performance apparently triggers Advisor into assuming VL=8, so cutting the reported "efficiency" in half (same effect as in C or Fortran code).

TimP · ‎01-05-2017

The corrected version of function s176 (in attachment) has execution of cilk-serialize confined to non-vector remainder loop. It doesn't show up in Advisor summary either with or without cilk-serialize.

If the operands of reduce_add are reversed (so as to align one of them), the remainder loop is vectorized. Then the cilk-serialize executes the primary loop version and shows up in Advisor summary as 47% efficient.

Sampling interval should be reduced to about 2ms as the total run time is not much over 1 second.

C and Fortran versions of s125 and s2102 take advantage of pragma nontemporal, but this appears to be excluded by Cilk(tm) Plus.