Vectorization of a function call

Grzegorz_K_ · ‎05-19-2015

Hello,

I have a following problem: I have a big loop inside my program that I want to parallelize and vectorize. Inside the loop I do a lot of math computations, but there are no dependencies between the iterations. However, inside the loop I call a simple function that returns minimum of two values, or zero, if the minimum value is negative. Generally, it looks like that:

__attribute__((vector))
inline int abs_min_int(int arg1, int arg2) {
  int minval, retval;

  minval = (arg1 < arg2) ? arg1 : arg2;
  retval = (minval > 0) ? minval : 0;
  return retval;
}


int main()
{
...
#pragma omp parallel for
#pragma ivdep
	for (int i = 0; i < bufferSize; i++)
	{
	//some calculations
		int a = abs_min_int(b, c);
	}
...

The vectorization report says it cannot vectorize that function and the whole loop suffers from scalar execution. However, I'm able to manually vectorize that function using intrinsics:

inline void store512intaligned (int *const  address, const __m512i vect)
{
	_mm512_store_epi32((void*) address, vect);
}

inline __m512i loadint512aligned (int const * const address)
{
	return _mm512_load_epi32(address);
}

inline __m512i absmin512int(const __m512i v1, const __m512i v2)
{
	__m512i cmpRes = _mm512_min_epi32(v1,v2);
	__mmask16 resultM = _mm512_cmplt_epi32_mask(cmpRes, Zero);
	return _mm512_mask_sub_epi32(cmpRes, resultM, cmpRes, cmpRes);
}

int main()
{
#pragma omp parallel for
#pragma ivdep
	for (int i = 0; i < bufferSize; i+=16)
	{
		store512intaligned(&(resultArray), absmin512int(loadint512aligned(&(A)), loadint512aligned(&(B))));
	}
}

And I can observe a speedup of approximately 2 times using my manually vectorized function for an isolated test like that.

My problem is the following:

Because the original loop is too complicated to rewrite it whole using intrinsics (and portability would suffer), I need the compiler to do the autovectorization. Is it possible to somehow help the compiler vectorize my abs_min function - because I know how to do it? Because I cannot mix scalar and vector code inside the same loop. Also, my example may help the compiler team to further leverage autovectorization strategies.

Best,

Greg

TimP · ‎05-19-2015

Intel compilers may have an easier time with

minval = std::min(arg1 , arg2) ;

retval = std::max(minval , 0) ;

If you mean

#pragma omp parallel for simd

you must write it so and use a recent compiler. ivdep then would be unnecessary.

If you want optimization to be portable to g++, you might file a bugzilla. gfortran already has provisions for min/max optimization, so the methods are known; besides, gcc can optimize fmin/fmax under -ffinite-math-only. I don't think g++ will do anything with parallel for simd clause.

pbkenned1 · ‎05-19-2015

I'd suggest as Tim noted going with '#pragma omp parallel for simd' and removing '#pragma ivdep', and declare the function as omp simd.

You also need to remove the inline keyword to vectorize the function:

#pragma omp declare simd
int abs_min_int(int arg1, int arg2) {

'declare simd' defaults to vector inputs/outputs, so I modified the main() program accordingly:

int main()
{
int bufferSize=512;
int a[bufferSize],b[bufferSize],c[bufferSize];

#pragma omp parallel for simd
        for (int i = 0; i < bufferSize; i++)
        {
           b = i*1;
           c = i*2;
           a = abs_min_int(b, c);
        }
        cout << "\n a[" << bufferSize-1 << "] = " << a[bufferSize-1] << endl;
   return 0;

This now parallelizes and vectorizes the function:

$ icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 20150407
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.

$ icc -qopenmp -mmic U558505.cpp -opt-report -opt-report-phase:openmp,vec -opt-report-file:stdout -o U558505-mmic.exe

Begin optimization report for: main()

Report from: OpenMP optimizations [openmp]

OpenMP Construct at U558505.cpp(19,1)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED

Report from: Vector optimizations [vec]

LOOP BEGIN at U558505.cpp(21,2)
<Peeled>
remark #15301: PEEL LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at U558505.cpp(21,2)
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at U558505.cpp(21,2)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at U558505.cpp(21,2)
<Remainder>
remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END
===========================================================================

Begin optimization report for: abs_min_int..zN16vv.U(int, int)

Report from: Vector optimizations [vec]

remark #15301: FUNCTION WAS VECTORIZED [ U558505.cpp(4,37) ]
===========================================================================

Begin optimization report for: abs_min_int..zM16vv.U(int, int)

Report from: Vector optimizations [vec]

remark #15301: FUNCTION WAS VECTORIZED [ U558505.cpp(4,37) ]
===========================================================================

$ /usr/bin/micnativeloadex ./U558505-mmic.exe

a[511] = 511

$

Patrick

Grzegorz_K_ · ‎05-19-2015

Thank you for your suggestions. Actually, std::min won't do, because the code is in pure C. But using proper pragmas, with isolated, simple code, the compiler vectorized properly the loop. However, in my real code, the compiler still refuses to vectorize the function:

optimization report:

remark #15382: vectorization support: call to function abs_min_int cannot be vectorized [ push.c(371,18) ]

push.c:

line 371: a = abs_min_int(b-1, (int) c);

The abs_min_int function is declared in a different file. I commented all of its content and left a trivial return:

functions.c:

#pragma omp declare simd
int abs_min_int(int arg1, int arg2) {
return arg1;
}

So the problem with the vectorization is not in the function logic. Do you have any suggestions how to check what is the problem? As I've mentioned before, isolating the problem doesn't help, because with a very simple code the compiler vectorizes it.

Greg

TimP · ‎05-20-2015

Traditionally, this optimization depends on in-lining. but inline definition may not be enough to make it happen across source files, without some brute force #include stuff or IPO. inline might also be subject to the compiler limits on in-lining. I'm not entirely clear on how the declare simd stuff Patrick quotes should deal with this, and I don't expect it to work the same with compilers other than icc (even among those which accept it).

icc 16.0 beta introduced more auto-vectorization of C analogues of std::min|max. I agree that it's good to do it without depending on non-portable stuff. Yes, to extend the discussion of portability, I suppose a reason why gcc requires use of fmin|fmax et al. to optimize this stuff (under risky compile flags) may be a desire to make C and C++ work alike. There's little discussion in the docs about the various extensions used by various compilers (including icc/icpc) to work over C/C++ differences.

In my examples (not an omp parallel), the icc beta compiler can auto-vectorize the expression (a[i__] > 0.f ? a[i__] : 0.f) without directives which were needed in previous versions. This seems a further acknowledgement that the slogan "directive based vectorization" had been carried further than necessary. Optimization without directive is helpful since the OpenMP 4 stuff, although understood by gcc, is counter-productive across compilers. But, I expect omp parallel to remove vectorization if simd clause isn't set. I wouldn't expect the parallel simd to be effective for a loop count less than several thousand. You might have a case that #pragma loop count avg(5000) should encourage vectorization without the simd clause.

pbkenned1 · ‎05-21-2015

The fundamental problem is that loops with function calls cannot be vectorized unless the function can be inlined, or a vector version of the function is available. In my last post, I chose the latter implementation, simply because you were using OpenMP in the first place.

You will have to decide which method you want to implement; if you want to go with inlining, you will need to use -ipo to get that to happen between source files. You may still need to apply a directive such as '#pragma omp simd' or '#pragma simd' at the call site to achieve vectorization. Despite the fact the either directive is basically a command (not a hint) to the compiler to vectorize the loop, there are still certain criteria to satisfy before vectorization will succeed, for example:

-- the loop must be countable

-- some special non-mathematical operations are not supported

-- loops containing complex array subscripts or pointer arithmetic may not vectorize

-- loops with low trip counts will not vectorize

-- very large loop bodies may not vectorize due to vector register pressure or internal compiler limits

>>>However, in my real code, the compiler still refuses to vectorize the function

Yes, that is common in the real world. Needless to say, I have no idea exactly what is defeating vectorization in your 'real world' code. I request that you file an Intel Premier ticket and attach a 'real world' test case that reproduces the issue. Ask that Patrick investigate, and I'll be glad to do so.

Thank you,

Patrick

Grzegorz_K_ · ‎05-25-2015

I have found a workaround to my problem. Previously, I had a following layout:

file1.c

#include "file2.h"

....
for(int i = 0; i < n; i++)
{
     abs_min_int();
}
...

file2.h

int abs_min_int(int arg1, int arg2);

file3.c

inline int abs_min_int(int arg1, int arg2)
{
  int minval, retval;
  minval = (arg1 < arg2) ? arg1 : arg2;
  retval = (minval > 0) ? minval : 0;
  return retval;
}

The compiler cannot vectorize ab_min_int. But now, when I have:

file1.c

#include "file3.c"

....
for(int i = 0; i < n; i++)
{
     abs_min_int();
}
...

the function and the whole loop is vectorized. I can live with that, but for me this is strange.

jimdempseyatthecove · ‎05-25-2015

Is multi-file IPO enabled .AND. file3.c available to the compiler and linker when you compile file1.c?

Jim Dempsey