Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

__builtin_expect() for switch() statement

Marián__VooDooMan__M
New Contributor II
2,418 Views

Greetings,

I am using hints for ICC regarding branch prediction, i.e.:

#ifndef my_likely
#   if defined(__GNUC__) || defined(__INTEL_COMPILER)
#       define my_likely(x)     __builtin_expect(!!(x),1)
#       define my_unlikely(x)   __builtin_expect(!!(x),0)
#   else
#       define my_likely(x)     (x)
#       define my_unlikely(x)   (x)
#   endif
#endif

and it is used as e.g.:

signed int x=...;
if(my_likely(x>0)) {
    ....
}

but what about "switch(...)" branching statement + "case:" "labels" ? is there any way to instruct compiler that "case:" label is "likely" more than others?

This is crucial for my application to enforce branch prediction at compile time.

If Intel have no support for this, then, please, take this as a "feature support request".

Best,

VooDooMan

0 Kudos
21 Replies
TimP
Honored Contributor III
2,236 Views

Conconceptually, icc profile guided optimization should often be more useful here. Without any explicit designation, the rule about priority of conditional jump not taken should apply.

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Tim Prince wrote:

Conconceptually, icc profile guided optimization should often be more useful here. Without any explicit designation, the rule about priority of conditional jump not taken should apply.

Yes, I do /Qpio on windows. But VTune reported branch mispredistion in

switch(...)

with 3 times inside switch(...) 's ->

case:

statements. Please, take this post as an improvement case. Thank you very much.

The user of my application is encouraged use 2nd case-switch on application start-up that will affect whole program to execute, so I would like Intel ICC get second "case XXX:" as a default prediction, and not iterating through all 3 "case:" switches (or 1, and 2nd successful "case XXX:" one.)

NB: if user select precision of FP floats 1 or 3 case from:

switch(...)

they will get cost of processing. and my application performance is "case XXX:" where "XXX" is 2.

So I'd like to ICC generate code with explicit branch prediction default to 2nd case out of 3.

I guess ICC is generating branching firstly for case 1, and SECONDLY for case 2, which my users are encouraged to select on application start-up.

example code (did not tried to build it):

typedef enum : unsigned  {
     FP_1 = 1,
     FP_2,
     FP_3,
} e_precision;

switch(m_precision) {
    case FP_1:
        do_processsing_fp_1();
        break;
    case FP_2:
        do_processsing_fp_2();
        break;
    case FP_3:
        do_processsing_fp_3();
        break;
}

 

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Marián "VooDooMan" Meravý wrote:

Quote:

Tim Prince wrote:

 

Conconceptually, icc profile guided optimization should often be more useful here. Without any explicit designation, the rule about priority of conditional jump not taken should apply.

"icc profile guided optimization" is not a good option for me, since user on application start-up could chose precision "float", "double" or "long double", (see above) i.e. "FP_1", "FP_2", or "FP_3".

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

I just want to know how to use

__builtin_expect(...)

in switch-case block in C++. i.e. see example above in my post.

If this is not implemented in ICC, please (Intel staff) issue feature request.

TIA!

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Greetings,

My application is x64 MSVC 2013, ICC 15.0 Update 1.

Of course user at application start-up setup dialog will choose "FP_2" that is "double".

FP_1 is float

FP_2 is double

FP_3 is "long double"

regarding "FP_3" i.e. "long double" there is post from @ilyapolak https://software.intel.com/en-us/forums/topic/536591#comment-1806554

so the user will chose "FP_2" i.e. double .

The problem is I have global variable (from FP_1 to FP_3) and VTune reported branch misprediction about

switch(m_preecision)

since I have

switch {
    case FP_1:
    ....; break;
    case FP_2:
    ....; break;
    case FP_3:
    ....; break;
}

And I like to prediction to FP_2(C++ "double"). If user chooses FP_1 or FP_3, it is their performance fault. I do not matter for this case, since I recommend FP_2 i.e. c++ "double" type.

The problem is VTune reported branch misprediction, and I WANT TO AS A DEFAULT BRANCH FOR switch(...) TO JUMP TO "FP_2"

If above isn't possible, take this as a support request, since VTune reported slow-down...

If you file support request, please respond by tracing number.

I know fix might be incompatible with _MSC_VER / MSVC, but I really would appreciate ICC extension ((though not compatible with _MSC_VER).

PS: I use MSVC 2013 Ultimate, ICC x64 15.0 Update 1

TIA!

best,

vdm

.

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

I am concerned about branch mis-prediction, since most of users will choose "FP_2" (==double) and not FP_1  (==float)  nor FP_2 (== long double).

I'd like to modify the code that branch prediction will not flush cache and jump to "case FP_2: ",which is "double" type, and jump x64 instruction I guess flushes Instruction cache, and maybe, but not limited to IMHO Data cache too...

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,236 Views

At some point early in your program you can validate m_precision. IOW it is 1, 2, or 3.

Then wherever you have a switch(m_precision), replace it with a dispatcher?

#include "stdafx.h"

typedef enum : unsigned  {
     FP_1 = 1,
     FP_2,
     FP_3,
} e_precision;

e_precision m_precision;

void do_processing_fp_1()
{
 printf("do_processing_fp_1()\n");
}
void do_processing_fp_2()
{
 printf("do_processing_fp_2()\n");
}
void do_processing_fp_3()
{
 printf("do_processing_fp_3()\n");
}

int _tmain(int argc, _TCHAR* argv[])
{
 static void(*dispatch[])(void) = { NULL, do_processing_fp_1, do_processing_fp_2, do_processing_fp_3};

 m_precision = FP_1;
 dispatch[m_precision]();

 m_precision = FP_2;
 dispatch[m_precision]();

 m_precision = FP_3;
 dispatch[m_precision]();

 return 0;
}
do_processing_fp_1()
do_processing_fp_2()
do_processing_fp_3()

There may be more elegant ways of doing this. This should give you an idea.

Jim Dempsey

0 Kudos
Feilong_H_Intel
Employee
2,236 Views

Hi VooDooMan,

Re-orgnizing switch-case would help only when a case block is hotest case and is executed many times.  If the case in your application was not executed quite frequently, you may not see performance gain by moving it  to the 1st place.  You could verify that by manually moving the case and recompiling your application.  And see if there is any perf gain.

AFAIK, Neither MSVC or gcc has such a feature (likely/unlikely hints for switch-case).  It would be better that we could come up with a proposal on what the hints for switch-case look like.  And then I'll submit it to icc engineering team as a feature request.

Thanks.

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Feilong H (Intel) wrote:

Re-orgnizing switch-case would help only when a case block is hotest case and is executed many times.  If the case in your application was not executed quite frequently

Yes, my SW is real-time generating sound from visual programming interface, so switch() is executed few thousand times a second in every effect or mathematical function the user is using.

Feilong H (Intel) wrote:

, you may not see performance gain by moving it  to the 1st place. You could verify that by manually moving the case and recompiling your application.  And see if there is any perf gain.

Thank you for the suggestion, I will give it a try and update this topic with results.

Feilong H (Intel) wrote:

AFAIK, Neither MSVC or gcc has such a feature (likely/unlikely hints for switch-case).  It would be better that we could come up with a proposal on what the hints for switch-case look like.  And then I'll submit it to icc engineering team as a feature request.

AFAIK on my side, agreed.

IMO the syntax could be:

#if defined(__INTEL_COMPILER)
#    define LIKELY(x) __builtin_expect(x)
#else
#    define LIKELY(x) x /* nevermind */
#endif

switch(y) {
    case 1:
        ...;
        break;
    case LIKELY(2):
        ...;
        break;
    case 3:
    default:
        ...;
        break;
}

Please could you file a feature request on my behalf?

Thanks to all,

best,

vdm

.

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Greetings,

I can confirm that reordering "case XXX:" labels improved performance. Before, my unit "stress" test was eating circa 12.30% of CPU, now it is circa 11.60%.

So I guess __builtin_expect()-like feature for explicit branch prediction would be welcome and efficient for final binary.

best,

vdm

.

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Marián "VooDooMan" Meravý wrote:

IMO the syntax could be:

#if defined(__INTEL_COMPILER)
#    define LIKELY(x) __builtin_expect(x)
#else
#    define LIKELY(x) x /* nevermind */
#endif

switch(y) {
    case 1:
        ...;
        break;
    case LIKELY(2):
        ...;
        break;
    case 3:
    default:
        ...;
        break;
}

Thinking about this proposed solution more throughly, "__builtin_expect(x)" accepts boolean-like (or to-bool-convertible) expression for "x", so passing a "2" into it is a bit confusing and inconsistent with unofficial "standard". I would recommend to introduce new built-in keyword.

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

jimdempseyatthecove wrote:

At some point early in your program you can validate m_precision. IOW it is 1, 2, or 3.

Then wherever you have a switch(m_precision), replace it with a dispatcher?

#include "stdafx.h"

typedef enum : unsigned  {
     FP_1 = 1,
     FP_2,
     FP_3,
} e_precision;

e_precision m_precision;

void do_processing_fp_1()
{
 printf("do_processing_fp_1()\n");
}
void do_processing_fp_2()
{
 printf("do_processing_fp_2()\n");
}
void do_processing_fp_3()
{
 printf("do_processing_fp_3()\n");
}

int _tmain(int argc, _TCHAR* argv[])
{
 static void(*dispatch[])(void) = { NULL, do_processing_fp_1, do_processing_fp_2, do_processing_fp_3};

 m_precision = FP_1;
 dispatch[m_precision]();

 m_precision = FP_2;
 dispatch[m_precision]();

 m_precision = FP_3;
 dispatch[m_precision]();

 return 0;
}
do_processing_fp_1()
do_processing_fp_2()
do_processing_fp_3()

There may be more elegant ways of doing this. This should give you an idea.

Jim Dempsey

Hi Jim,

I really like your idea, I have partially implemented such dispatcher, too bad, without major refactoring I need to use C++11 lambdas in "dispatch[...]();" static array. I guess lambdas will prevent Intel compiler from inlining them, since I guess they are implemented as global static functions, and it will pose additional overhead.

I will report my findings back.

TIA!

vdm

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,236 Views

In the old days (late 80's early 90's) Borland C and C++ compilers would examine the switch statement to see if it contained case statements where the values were sequential. If so, there was a quick range check, and if the numbers were in range, it would perform a computed goto. This resulted in all cases having a uniform and very low overhead switch time. Check the disassembly code of the optimized output to see if your "modern" compiler performs this optimization. From your reported time differences, it appears it does not.

The (your) switch statement "should" optimize into a computed goto statement.

Jim Dempsey

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

jimdempseyatthecove wrote:

The (your) switch statement "should" optimize into a computed goto statement.
 

Do you mean use "goto label_1;" and "label_1:" syntax?

Because I think "switch(...)" and subsequent "case label_1:" is EXACTLY the same as "goto label_1;", though it uses other syntax, but at the assembly level they are exactly the same IMO, except "switch(...)" executes conditional "jmp" instruction/branching, same as "if(XXX) goto label_1;".

Or am I missing something?

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Feilong H (Intel) wrote:

AFAIK, Neither MSVC or gcc has such a feature (likely/unlikely hints for switch-case).  It would be better that we could come up with a proposal on what the hints for switch-case look like.  And then I'll submit it to icc engineering team as a feature request.

Yes, AFAIK for me too as well. Please submit feature request to engineering team.

Thanks again,

best,

vdm

.

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

Greetings,

maybe:

#if defined(__INTEL_COMPILER)
#    define LIKELY(x,y) __builtin_expect(x,y) /* with 2 arguments only for
                                                 "case x"/"default" labels,
                                                 where 'x' is case "label"'s
                                                 value prefixed by "case" keyword,
                                                 and 'y' is a priority as 'signed int':
                                                 the bigger number, the higher
                                                 priority of ordering 'case' labels:
                                                 i.e. explicit and automatic reordering
                                                 of 'case' labels by proposed weights */
#else
#    define LIKELY(x,y) x /* nevermind */
#endif
 
switch(y) {
    case 1: /* "do not care", i.e. if *NO* "LIKELY(...):" is present in switch block,
               treat as as a priority of '0', i.e. as thought it were: "LIKELY(case 1,0):" */
        ...;
        break;
    LIKELY(case 2,30): // much more lesser priority (priority weight 30)
        ...;
        break;
    LIKELY(default,-10): // the lowest priority (priority weight -10)
        ...;
        break;
    LIKELY(case 3,100): // the highest priority (priority weight 100)
        break;
}

so there will be re-ordering the case/default syntax keywords, which allows ICC to predict more, and construct a much better way of branching and other optimizations including auto-inlining, etc... Of course, ICC should take into account of internal logic, when fall-thru (i.e. missing "break;") is present between case/default labels.

What do you think? If you agree, and there are no other objections, nor better solutions, please take this as a feature request.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,236 Views

Marián "VooDooMan" Meravý wrote:

Quote:

jimdempseyatthecove wrote:

 

The (your) switch statement "should" optimize into a computed goto statement.
 

 

 

Do you mean use "goto label_1;" and "label_1:" syntax?

Because I think "switch(...)" and subsequent "case label_1:" is EXACTLY the same as "goto label_1;", though it uses other syntax, but at the assembly level they are exactly the same IMO, except "switch(...)" executes conditional "jmp" instruction/branching, same as "if(XXX) goto label_1;".

Or am I missing something?

Yes you are missing something. pseudo code

index = switchVar - lowRange
if(index < 0) jmp overSwitch
if(index > lastIndex) jmp overSwitch
jmp @table[index]

When the switchVar is in range (shouldn't it be preponderantly in range), then you have two branch not taken's followed by an indirect jump. Each entry in your dispatch table takes the same amount (and very low) time.

If the compiler optimizations doesn't do this for you, then you can roll your own dispatch table. In C or C++ you can use a table of function pointers and dispatch on that.

Jim Dempsey

0 Kudos
Marián__VooDooMan__M
New Contributor II
2,236 Views

jimdempseyatthecove wrote:

Quote:

Marián "VooDooMan" Meravý wrote:

 

Quote:

jimdempseyatthecove wrote:

 

The (your) switch statement "should" optimize into a computed goto statement.
 

 

 

Do you mean use "goto label_1;" and "label_1:" syntax?

Because I think "switch(...)" and subsequent "case label_1:" is EXACTLY the same as "goto label_1;", though it uses other syntax, but at the assembly level they are exactly the same IMO, except "switch(...)" executes conditional "jmp" instruction/branching, same as "if(XXX) goto label_1;".

Or am I missing something?

 

 

Yes you are missing something. pseudo code

index = switchVar - lowRange
if(index < 0) jmp overSwitch
if(index > lastIndex) jmp overSwitch
jmp @table[index]

When the switchVar is in range (shouldn't it be preponderantly in range), then you have two branch not taken's followed by an indirect jump. Each entry in your dispatch table takes the same amount (and very low) time.

If the compiler optimizations doesn't do this for you, then you can roll your own dispatch table. In C or C++ you can use a table of function pointers and dispatch on that.

Jim Dempsey

yes, of course it would be best to use dispatcher, which I have implemented for switch() conditions. It's performance gain was *exceptional*! But I had bad results, when the code after "case X:" was too complex (not even telling you about fall-trhu to next "case X:"/"default:" label) so I needed to use C++11 lamdba-embedded functions to encapsulate it, where there was *GREAT* performance *impact*, I guess too low L1I cache size; and maybe L1D too...

My propose is to give a hints to ICC how to reorder "case X:"/"default:" while still accepting fall-thru to next label w/o excessive branching.

If it were implemented as ordinary "goto lbl"/"lbl:", there is no conditional branching and everything is dependent on "if(...) goto lbl;" branching, which is conditional, but except one thing, it will not allow duplication/inlining the code for 2 different "case X:"/"default:" branches, and imagine fall-thru case statement.

My proposion is about to get more comfort of the code, which will be better readable for human, and better writtable for human without need of thinking too much and profiling, where L1 caches flushes/missed event are.

Let the compiler to its better logic, when we need not to write crazy complicated and hard to read code, let the compiler write hard to read code into the assembler for us.

0 Kudos
Feilong_H_Intel
Employee
2,236 Views

jimdempseyatthecove wrote:

In the old days (late 80's early 90's) Borland C and C++ compilers would examine the switch statement to see if it contained case statements where the values were sequential. If so, there was a quick range check, and if the numbers were in range, it would perform a computed goto. This resulted in all cases having a uniform and very low overhead switch time. Check the disassembly code of the optimized output to see if your "modern" compiler performs this optimization. From your reported time differences, it appears it does not.

The (your) switch statement "should" optimize into a computed goto statement.

Jim Dempsey

Jim,

Thank you for sharing this info.  It reminds me that icc does such an optimization (converting swith-case to indirect jump) for switch-case statements that meet some criterias (similar to your description).

0 Kudos
Feilong_H_Intel
Employee
1,888 Views

Marián "VooDooMan" Meravý wrote:

Yes, AFAIK for me too as well. Please submit feature request to engineering team.

Thanks again,

best,

vdm

I've entered this feature request to our problem-tracking database.  Engineering team will look into it.

Thanks.

0 Kudos
Reply