Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

How to force ICC to inline and vectorize a function?

meldaproduction
Beginner
719 Views

Hi, I'm fighting ICC to optimize a certain code, something like this:

float Function(float x)
{ 
    // a few lines of code, some short floating point math
};

...

for (int i=0; i<cnt; i++)
{
    // a few lines of code
    dst = Function(x);
};

I also added "force vectorize" pragma to the loop. Now if I leave it like this, the testing program takes 10 seconds. If I put the body of "Function" directly into the cycle however, it will take 6 seconds, because ICC will correctly use AVX and actually create a pretty long stuff from it. So there's like 40% improvement! I even tried _Pragma("vector always") before the "Function" call, but nothing, still slow.

Any ideas?

0 Kudos
5 Replies
TimP
Honored Contributor III
719 Views
Depending on your compile options, a plain inline or a forceinline function definition should get you the equivalent of copying the code in line if in the same file or by include. Possibilities for separate files include ipo, elemental or omp declare Simd.
0 Kudos
jimdempseyatthecove
Honored Contributor III
719 Views

Have you looked at...


__declspec(vector) (Windows*)

__attribute__(vector) (Linux* and OS X*)
 
Combines with the map operation at the call site to provide the data parallel semantics. When multiple instances of the vector declaration are invoked in a parallel context, the execution order among them is not sequenced.
 

Attribute your function with that.

The inline-ing would not require this decoration, but you must keep in mind that wishing for the code to vectorize does not relieve you from your responsibility to make it vectoizable.

What are your few lines of code inside the function. Something inside there must be thwarting your intent.

Jim Dempsey

0 Kudos
meldaproduction
Beginner
719 Views

Thank you for the info. I started digging into ICC logs and found something odd - in way too many cases ICC doesn't allow vectorization. Consider this example I have here right now:

template <class type>
        static MFORCEINLINE type GetCubic(type y0, type y1, type y2, type y3, type x)
	{		
		type a = (3 * (y1-y2) - y0 + y3) * (type)0.5;
		type b = 2*y2 + y0 - (5*y1 + y3) * (type)0.5;
		type c = (y2 - y0) * (type)0.5;
		return a * x * x * x + b * x * x + c * x + y1;
        };

...
// Loop to be vectorized, inlined function.
		const float y0 = PrecomputedPtr[index-1];
		const float y1 = PrecomputedPtr[index+0];
		const float y2 = PrecomputedPtr[index+1];
		const float y3 = PrecomputedPtr[index+2];
		return MInterpolation::GetCubic(y0, y1, y2, y3, x);
};

This gets vectorized just fine. Now if I change the statement to this:

return MInterpolation::GetCubic(PrecomputedPtr[index-1], PrecomputedPtr[index+0], PrecomputedPtr[index+1], PrecomputedPtr[index+2], x);

which is the exact same thing, I just didn't create new variables, ICC doesn't vectorize it and says this:

   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed ANTI dependence between this_123622 line 174 and a line 174
   remark #15346: vector dependence: assumed FLOW dependence between a line 174 and this_123622 line 174

All of the functions are "const", PrecomputedPtr is part of the object, I also tried to use __declspec(noalias) to the only output pointer, so the compiler should assume nothing is going to change. Yet, it says there's a dependence...

Any ideas?

 

 

0 Kudos
emmanuel_attia
Beginner
719 Views

If you know you want it to be vectorized, you know how you want it to be vectorize, and you are going to use intel specific stuff, I would suggest you to take the time to vectorize by hand your interpolation kernel once for all (forceinline is a good idea, so all the messy AVX/SSE stuff is hidden behind a function bound to "vanish" at link time).

0 Kudos
meldaproduction
Beginner
719 Views

I don't follow your answer really. My point in the code above is that it doesn't really make sense that the compiler complaints and doesn't want to vectorize something. There's no difference in the 2 codes, one is just "nicer". But ICC doesn't like the nicer one...

0 Kudos
Reply