If you know you want it to be

meldaproduction · ‎06-14-2015

Hi, I'm fighting ICC to optimize a certain code, something like this:

float Function(float x)
{ 
    // a few lines of code, some short floating point math
};

...

for (int i=0; i<cnt; i++)
{
    // a few lines of code
    dst = Function(x);
};

I also added "force vectorize" pragma to the loop. Now if I leave it like this, the testing program takes 10 seconds. If I put the body of "Function" directly into the cycle however, it will take 6 seconds, because ICC will correctly use AVX and actually create a pretty long stuff from it. So there's like 40% improvement! I even tried _Pragma("vector always") before the "Function" call, but nothing, still slow.

Any ideas?

TimP · ‎06-14-2015

Depending on your compile options, a plain inline or a forceinline function definition should get you the equivalent of copying the code in line if in the same file or by include. Possibilities for separate files include ipo, elemental or omp declare Simd.

jimdempseyatthecove · ‎06-14-2015

Have you looked at...

__declspec(vector) (Windows*)

__attribute__(vector) (Linux* and OS X*)

Combines with the map operation at the call site to provide the data parallel semantics. When multiple instances of the vector declaration are invoked in a parallel context, the execution order among them is not sequenced.

Attribute your function with that.

The inline-ing would not require this decoration, but you must keep in mind that wishing for the code to vectorize does not relieve you from your responsibility to make it vectoizable.

What are your few lines of code inside the function. Something inside there must be thwarting your intent.

Jim Dempsey

meldaproduction · ‎06-15-2015

Thank you for the info. I started digging into ICC logs and found something odd - in way too many cases ICC doesn't allow vectorization. Consider this example I have here right now:

template <class type>
        static MFORCEINLINE type GetCubic(type y0, type y1, type y2, type y3, type x)
	{		
		type a = (3 * (y1-y2) - y0 + y3) * (type)0.5;
		type b = 2*y2 + y0 - (5*y1 + y3) * (type)0.5;
		type c = (y2 - y0) * (type)0.5;
		return a * x * x * x + b * x * x + c * x + y1;
        };

...
// Loop to be vectorized, inlined function.
		const float y0 = PrecomputedPtr[index-1];
		const float y1 = PrecomputedPtr[index+0];
		const float y2 = PrecomputedPtr[index+1];
		const float y3 = PrecomputedPtr[index+2];
		return MInterpolation::GetCubic(y0, y1, y2, y3, x);
};

This gets vectorized just fine. Now if I change the statement to this:

return MInterpolation::GetCubic(PrecomputedPtr[index-1], PrecomputedPtr[index+0], PrecomputedPtr[index+1], PrecomputedPtr[index+2], x);

which is the exact same thing, I just didn't create new variables, ICC doesn't vectorize it and says this:

remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed ANTI dependence between this_123622 line 174 and a line 174
remark #15346: vector dependence: assumed FLOW dependence between a line 174 and this_123622 line 174

All of the functions are "const", PrecomputedPtr is part of the object, I also tried to use __declspec(noalias) to the only output pointer, so the compiler should assume nothing is going to change. Yet, it says there's a dependence...

Any ideas?

emmanuel_attia · ‎06-17-2015

If you know you want it to be vectorized, you know how you want it to be vectorize, and you are going to use intel specific stuff, I would suggest you to take the time to vectorize by hand your interpolation kernel once for all (forceinline is a good idea, so all the messy AVX/SSE stuff is hidden behind a function bound to "vanish" at link time).

meldaproduction · ‎06-17-2015

I don't follow your answer really. My point in the code above is that it doesn't really make sense that the compiler complaints and doesn't want to vectorize something. There's no difference in the 2 codes, one is just "nicer". But ICC doesn't like the nicer one...

How to force ICC to inline and vectorize a function?