Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Richard_H_6
Beginner
446 Views

The real meaning of n in "#pragma loop_count(n)" and usage for PRAGMA

Jump to solution

Hi guys,

I'm working on a compiler abstraction to provide more loop information to compilers. In order to get optimized code by using pragmas, we implemented something for TI as follows.
#define PRAGMA(x) _Pragma(#x)
#define LOOP_COUNT_INFO(_min_n, _multiple) \
   PRAGMA(MUST_ITERATE(_min_n, , _multiple))
As a result, we can add the following code in the following loop if we know the minimum loop count is 32 and it is multiple of 4.
void test_loop_count(float*__restrict a, float*__restrict b, float*__restrict c, int n)
{
  int i;
  LOOP_COUNT_INFO(32, 4)
  for (i = 0; i < n; i++)
  {
    c = a * b;
  }
}
Now we are trying to implement similar things in Intel icc/icl. In order to use PRAGMA, we only implemented as follows:
#define DLB_LOOP_COUNT_INFO(_min_n, _multiple) \
            PRAGMA(loop_count(_multiple))
However, there are quite a lot of discussions in the team about whether it should be PRAGMA(loop_count(_multiple)) or PRAGMA(loop_count(_min_n)). 
From http://d3f8ykwhia686p.cloudfront.net/1live/intel/CompilerAutovectorizationGuide.pdf
#pragma loop count (n) may be used to advise the compiler of the typical trip 
count of the loop. This may help the compiler to decide whether vectorization is 
worthwhile, or whether or not it should generate alternative code paths for the loop.
and:
https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-5C5112...
#pragma loop_count(n)
#pragma loop_count=n
(n) or =n
Non-negative integer value. The compiler will attempt to iterate the next loop the number of times specified in n; however, the number of iterations is not guaranteed.
Could you give a doubtless answer about it to us whether it should be PRAGMA(loop_count(_multiple)) or PRAGMA(loop_count(_min_n))?

From the code:
#pragma loop_count min(n),max(n),avg(n)
#pragma loop_count min=n, max=n, avg=n
It seems there's no multiple in this pragma. If that's true, it makes the compiler generate more code to cover the cases that the loop count is not multiple of n. Is there a way to do that from a pragma?

Thanks,
Richard

0 Kudos
1 Solution
Martyn_C_Intel
Employee
446 Views

In icc,   #pragma loop count (n)    is a hint to the compiler, not a guarantee. n is a "typical" trip count.

When n is small, it may invite the compiler to create special case code for exactly n iterations, as well as general code for any trip count. This might result in a fully unrolled loop version, or a vectorized loop kernel without the need for a remainder loop. Or it may result in an unvectorized, scalar loop.

When n is large, the compiler can assume that in its cost model, when it is trying to decide whether vectorization will be worthwhile. It will be unlikely to generate a separate loop version for the case when n is small.

If you choose the form with min and/or  max and avg,  then I don't think you can get special case code for particular values of the trip count.

The Intel implementation of the loop count pragma does not have a way to assert that the trip count is a multiple of a given integer. The Intel compiler does support the __assume() language feature, e.g. __assume(n%4==0); the compiler may or may not make use of this information. A typical usage would be in a doubly nested loop over a matrix, where n is the row length; for a matrix of doubles, n%==4  implies that the row length is a multiple of 32 bytes, and so consecutive rows would have the same alignment for either Intel SSE instructions or Intel AVX instructions. Depending on the context, that might help the compiler to generate code for aligned loads and stores. There are several ways to assert absolute alignment, if it is not otherwise known to the compiler, e.g. __assume_aligned(), #pragma vector aligned; #pragma omp simd aligned.

        Incidentally, the "aligned " or "unaligned" annotations in the optimization report may not map directly to aligned and unaligned memory instructions such as vmovaps and vmovups. On some microarchitectures, the performance of the unaligned instruction may be as good as the aligned instruction when the data are aligned; the compiler then always generates the unaligned instruction, (safer in case the data are unaligned), but may still generate different code sequences according to the expected alignment. These different sequences correspond to the "aligned" and "unaligned" messages you may see in the optimization report.

       It's not clear to me how __assume, which refers to a specific variable, could be hooked up to your pragma, which takes a literal value and doesn't know about individual variable names.

     In your sample code above, you have one set of parentheses too many.  Try

#define DLB_LOOP_COUNT_INFO(_min_n, _avg) \
            __pragma(loop_count min(_min_n), avg(_avg))

>icl /Qopt-report-file=stderr /Qopt-report-phase=vec /Qopt-report:3 test.cpp /c /Qstd=c++11 /Qalias-args-
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.1.148 Build 20141023
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

test.cpp

Begin optimization report for: test(int *, int *, int)

    Report from: Vector optimizations [vec]


LOOP BEGIN at ...\loop_count\test.cpp(11,5)
<Peeled>
LOOP END

LOOP BEGIN at ...\loop_count\test.cpp(11,5)
   remark #15300: LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 7
   remark #15477: vector loop cost: 1.250
   remark #15478: estimated potential speedup: 4.320
   remark #15479: lightweight vector operations: 5
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at ...\loop_count\test.cpp(11,5)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at ...\loop_count\test.cpp(11,5)
<Remainder>
LOOP END
===========================================================================

In my tests, #pragma vector aligned made the peel loop and the alternate loop version for alignment go away.

However, __assume(n%4==0) did not make the remainder loop go away, which is presumably what you were hoping for.

 

View solution in original post

18 Replies
Richard_H_6
Beginner
446 Views

Is there anyone who can help on this topic? Thanks!

jimdempseyatthecove
Black Belt
446 Views

If you want a specific number of iterations then use #pragma unroll(n). You can have your macro expand multi-line to incorporate multiple pragmas.

Note, your macros are not including hints for vectorization or simd. These may be more valuable (performance wise) than loop count/unroll directives.

Jim Dempsey

Richard_H_6
Beginner
446 Views

Thanks Jim. The reasons I don't want to specify number of iteration are: a) I need an abstraction for different kinds of processors. Different processor has different optimal unroll number. Definitely we can specify it for x86 specific optimization but it would be safer to give loop count info for all targets. b) ICC always gives warnings if number of iterations is given.

Is there someone from Intel compiler team can give the answer?

TimP
Black Belt
446 Views

Having  discussed this with compiler team, loop count(44) requests code optimized for exactly 44 trips, likely reducing performance at other counts.  With variable loop count, the min and avg settings may be more useful.

with the vectorized remainder generation by current Intel compilers it's no longer critical to adjust unrolling to fit trip count. Unroll(4) with intel compilers or max_unroll_times=4 for gnu are frequently effective for i7 CPU, if trip count is large enough (proportional to register width).

in the past, vectorization of 32bit data types strongly favored trip count multiple of 16.

Richard_H_6
Beginner
446 Views

Great! That's quite helpful suggestion. Is there a way to set the min and avg settings by using macro as below instead of using "#pragma"? That will be easy to use as we need to make it work across different compilers.

#define PRAGMA(x) _Pragma(#x)
#define LOOP_COUNT_INFO(_min_n, _multiple) \
    PRAGMA(loop_count(_multiple))
   
Thanks,
Richard

Marián__VooDooMan__M
New Contributor II
446 Views

Richard H. wrote:

Great! That's quite helpful suggestion. Is there a way to set the min and avg settings by using macro as below instead of using "#pragma"? That will be easy to use as we need to make it work across different compilers.

#define PRAGMA(x) _Pragma(#x)
#define LOOP_COUNT_INFO(_min_n, _multiple) \
    PRAGMA(loop_count(_multiple))
   
Thanks,
Richard

Try do define you own (let's call it) PRAGMA(x) macro for each compiler by #ifdef guards, like __GNU__ or __VISUALC__, etc...

btw. in MSVC (with or without ICC) it is "__pragma(your pragma argument goes here)". NB: the doubled underscore!

sadly, there is no standard for #pragma directives... so different vendors have their own pragma directives... But NB as usual, ICC under MSVC tries so hard to be extremely compatible with MSVC, and on Linux/MacOSX with GCC. Therefore e.g. ICC under MSVC defines "__VISUALC__" even it sounds like a non-sense... but it also defines extra "__INTEL_COMPILER" (again, NB the doubled underscore) defined to version number (year and month of the release date IIRC).

TimP
Black Belt
446 Views

Another apparent problem with loop_count (n) would be difficulty in taking peeling for alignment into account.

I don't know if the recent addition of pragma  vector unaligned could help in making loop count numbers apply more exactly. With Haswell CPU it may perform well.

Marián__VooDooMan__M
New Contributor II
446 Views

I'm on Haswell, and I am using

#pragma aligned...

when I am 100% sure the array is aligned (at 16 bytes) and loop count both set with pragma and with for() loop is divisible by 16 bytes w/o remainder.

But the compiler still generates unaligned AVX2 instructions. I think performance would be better if it would generate aligned versions when above pragma is used. But I am not able to check clock tick count difference between aligned and unaligned versions of the same AVX2 instruction.

Intel: a feature request: when "#pragma aligned" is used, do not generate unaligned AVX2 instructions, but rather segmentation or some other fault, since this would help in debugging performance issues regarding of the processing non-aligned arrays...

TimP
Black Belt
446 Views
The compiler team did extensive testing to assure that the unaligned moves don't give up performance. Agreed it removes a means for checking alignment. This isn't as big an issue for avx2 CPU as it was before.
Marián__VooDooMan__M
New Contributor II
446 Views

Even when I use MSVC's intrinsics like:

#   define MY_STORE_SI              _mm256_store_si256
#   define MY_LOAD_SI               _mm256_load_si256
#   define MY_ZERO_SI               _mm256_setzero_si256

instead of:

_mm256_loadu_si256()

and friends, compiler generates unaligned instructions instead of aligned ones (as originally stated in the source code).

My question is: is there some performance penalty when using unaligned instructions with aligned memory reference and trip count divisible by 16 without remainder (i.e. everything is prepared for aligned access)?

Richard_H_6
Beginner
446 Views

Marián "VooDooMan" Meravý wrote:

Quote:

Richard H. wrote:

 

Great! That's quite helpful suggestion. Is there a way to set the min and avg settings by using macro as below instead of using "#pragma"? That will be easy to use as we need to make it work across different compilers.

#define PRAGMA(x) _Pragma(#x)
#define LOOP_COUNT_INFO(_min_n, _multiple) \
    PRAGMA(loop_count(_multiple))
   
Thanks,
Richard

 

 

Try do define you own (let's call it) PRAGMA(x) macro for each compiler by #ifdef guards, like __GNU__ or __VISUALC__, etc...

btw. in MSVC (with or without ICC) it is "__pragma(your pragma argument goes here)". NB: the doubled underscore!

sadly, there is no standard for #pragma directives... so different vendors have their own pragma directives... But NB as usual, ICC under MSVC tries so hard to be extremely compatible with MSVC, and on Linux/MacOSX with GCC. Therefore e.g. ICC under MSVC defines "__VISUALC__" even it sounds like a non-sense... but it also defines extra "__INTEL_COMPILER" (again, NB the doubled underscore) defined to version number (year and month of the release date IIRC).

Right. We'll use macro for each compiler by #ifdef guards. The real difficulty for us is we can't specify "min", "max", "avg" in __pragma(loop_count(**)), no matter it is __pragma in icl or _Pragma() in icc. Refer to https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-5C5112FB-898C-41E4-86EA-0CFC27591F39.htm#GUID-5C5112FB-898C-41E4-86EA-0CFC27591F39.
Any idea?
Thanks,
Richard

TimP
Black Belt
446 Views

Could you clarify your further questions?  Apparently, latest versions of Intel c++ should work with _pragma when std=c++11 or c11 are set. 

Richard_H_6
Beginner
446 Views

Hi Tim,

I meant that I can't use min and avg by using __pragma if icl is used. It works for MSVC cl. See the following code. Can you give an advice?

#define DLB_LOOP_COUNT_INFO(_min_n, _avg) \
            __pragma(loop_count(min(_min_n), avg(_avg)))
void test(int *a, int *b, int n)
{
    int i;
    
//    __pragma(loop_count(20))
//    __pragma(loop_count(min=2, max=100))
    //__pragma(loop_count(min(2), max(100)))
    DLB_LOOP_COUNT_INFO(2, 100)
    for (i=0; i<n; i++)
    {
        b = 2 * a;
    }
}

Thanks,
Richard

Richard_H_6
Beginner
446 Views

Hi Tim,

Do you have any suggestions?

Thanks,

Richard

Martyn_C_Intel
Employee
447 Views

In icc,   #pragma loop count (n)    is a hint to the compiler, not a guarantee. n is a "typical" trip count.

When n is small, it may invite the compiler to create special case code for exactly n iterations, as well as general code for any trip count. This might result in a fully unrolled loop version, or a vectorized loop kernel without the need for a remainder loop. Or it may result in an unvectorized, scalar loop.

When n is large, the compiler can assume that in its cost model, when it is trying to decide whether vectorization will be worthwhile. It will be unlikely to generate a separate loop version for the case when n is small.

If you choose the form with min and/or  max and avg,  then I don't think you can get special case code for particular values of the trip count.

The Intel implementation of the loop count pragma does not have a way to assert that the trip count is a multiple of a given integer. The Intel compiler does support the __assume() language feature, e.g. __assume(n%4==0); the compiler may or may not make use of this information. A typical usage would be in a doubly nested loop over a matrix, where n is the row length; for a matrix of doubles, n%==4  implies that the row length is a multiple of 32 bytes, and so consecutive rows would have the same alignment for either Intel SSE instructions or Intel AVX instructions. Depending on the context, that might help the compiler to generate code for aligned loads and stores. There are several ways to assert absolute alignment, if it is not otherwise known to the compiler, e.g. __assume_aligned(), #pragma vector aligned; #pragma omp simd aligned.

        Incidentally, the "aligned " or "unaligned" annotations in the optimization report may not map directly to aligned and unaligned memory instructions such as vmovaps and vmovups. On some microarchitectures, the performance of the unaligned instruction may be as good as the aligned instruction when the data are aligned; the compiler then always generates the unaligned instruction, (safer in case the data are unaligned), but may still generate different code sequences according to the expected alignment. These different sequences correspond to the "aligned" and "unaligned" messages you may see in the optimization report.

       It's not clear to me how __assume, which refers to a specific variable, could be hooked up to your pragma, which takes a literal value and doesn't know about individual variable names.

     In your sample code above, you have one set of parentheses too many.  Try

#define DLB_LOOP_COUNT_INFO(_min_n, _avg) \
            __pragma(loop_count min(_min_n), avg(_avg))

>icl /Qopt-report-file=stderr /Qopt-report-phase=vec /Qopt-report:3 test.cpp /c /Qstd=c++11 /Qalias-args-
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.1.148 Build 20141023
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

test.cpp

Begin optimization report for: test(int *, int *, int)

    Report from: Vector optimizations [vec]


LOOP BEGIN at ...\loop_count\test.cpp(11,5)
<Peeled>
LOOP END

LOOP BEGIN at ...\loop_count\test.cpp(11,5)
   remark #15300: LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 7
   remark #15477: vector loop cost: 1.250
   remark #15478: estimated potential speedup: 4.320
   remark #15479: lightweight vector operations: 5
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at ...\loop_count\test.cpp(11,5)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at ...\loop_count\test.cpp(11,5)
<Remainder>
LOOP END
===========================================================================

In my tests, #pragma vector aligned made the peel loop and the alternate loop version for alignment go away.

However, __assume(n%4==0) did not make the remainder loop go away, which is presumably what you were hoping for.

 

View solution in original post

Kittur_G_Intel
Employee
446 Views

Thanks Martyn for the detailed elaboration to Richard's question.

Richard, let us know if Martyn's response answered your question? If you need any further clarifications, please let us know.

_Kittur 

Richard_H_6
Beginner
446 Views

Hi Martyn and Kittur,

Thanks a lot! That's quite clear. It's very helpful. Maybe we need to add "typical" in our macro abstraction.

Cheers,
Richard

Kittur_G_Intel
Employee
446 Views

Thanks Richard, good to know it's very clear now. .

Kittur 

Reply