>>...one possibility is to

McCalpinJohn · ‎02-10-2016

I have a code for which I know the statistics of a random branch, but I am struggling to understand how to convince the compiler to generate conditional branches of the correct parity for the static prediction to be correct as often as possible.

On the web page https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts the advice says:

In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common.

I wrote the code this way, but the code generated by the compiler does not generate branches in the correct direction to optimize the accuracy of the static branch prediction.

The code I am trying to optimize involves saving the indices of all of the elements in a 256-bit register for which a compare operation returns "true". The results of the compare are random and uncorrelated, with a probability of ~1/16 per element. Elements are 32 bits, so there are 8 elements in a 256-bit register. If I did my arithmetic correctly, this means that ~60% of the time none of the compares return true, ~32% of the time there is one compare that returns true, ~7% of the time there are two compares that return true, and ~1% of the time there are more than two compares that return true.

The implementation uses lots of sneaky bit manipulation tricks to optimize for the common case. The compiler does not vectorize the original code, so I was forced to use intrinsics. The compiler does mostly what I want with the intrinsics, but some of the branches are set up the wrong way.

In pseudo-code, what I am doing looks like:

for (i=0; i<IMAX; i+=8) {                   // process 8 elements at a time
   // load some data into a ymm register
   // load some values for comparison into another ymm register
   vmask = _mm256_cmp_ps()                  // do 8 compares in a 256-bit register 
   mask = _mm256_movemask_ps(vmask);        // move the 8 sign bits of the compare result to a GPR
   num_flagged = __builtin_popcount(mask);  // count the 1 bits to see how many compares returned true

   if (num_flagged == 0) {            // 60% probability
          continue;                   // no compares are true -- move to next "i" loop iteration
   } else if (num_flagged == 1) {     // 32% probability
          flag_indices[index] = i + __builtin_ctz(mask);      // only 1 bit set, so trailing zero count gives the index
          index +=1;
   } else if (num_flagged == 2) {     // 7% probability
          flag_indices[index] = i + __builtin_ctz(mask);      // trailing zero count gives the smaller of the two indices
          index +=1;
          flag_indices[index] = i + (31-__builtin_clz(mask)); // leading zero count gives the larger of the two indices
          index +=1;
    } else {                          // less than 1% probability
          for (j=0; j<8; j++) {       // check bits one at a time
               if ((mask & (1<<j)) != 0) {
                     flag_indices[index] = i + j;
                     index +=1;
                }
            }
      }
}

The compiler generates a forward conditional branch if zero ("JE") immediately after the POPCOUNT instruction. This will be statically predicted to be "not taken", even though it is the most common case (and I put it first so the compiler would know this). Should I have written the code some other way so that the compiler would generate code that is consistent with the static branch prediction mechanism for this most common case?

The assembly code looks like:

        popcnt    %ebx, %edx                                    #3015.19
        je        ..END_I_LOOP     # Prob 25%                   #3024.9    BAD STATIC PREDICTION
..CODE_FOR_NONZERO_VALUES:
     // do each of the other if tests
..END_OF_I_LOOP 
       [end of "i" loop increment, compare, branch]

The first jump ("je") will be predicted not taken, which is the wrong choice for the most common case. This could be fixed by changing the parity of the conditional branch and adding an unconditional branch to the end-of-loop processing. E.g.,

        popcnt    %ebx, %edx
        jne        ..CODE_FOR_NONZERO_VALUES   // if not zero, look at next if test, but predict that this will not happen
        jmp        ..END_OF_I_LOOP      // unconditional branch to processing at end of "i" loop -- predicted taken
..CODE_FOR_NONZERO_VALUES:                      # Preds ..B17.162
       // do the other IF tests
..END_OF_I_LOOP 
       [end of "i" loop increment, compare, branch]

Sometimes the compiler will generate code like this (inverted comparison jumping over an unconditional branch), but I can't find a way to tell the compiler that I want it to do this. (It does it correctly for the rest of the cases, as I describe below -- just not for this first case.)

For this first conditional branch it may not make much difference -- after some iterations the dynamic branch predictor should notice that the branch is usually taken (~60%) and change the prediction. I don't know how many iterations it takes before the predictor changes states with a conditional branch that has a 60% random probability, but I do know that this loop only executes for a few hundred cycles, so a couple of 20-cycle branch mispredictions during warm-up will still be noticeable. There are lots of short routines crammed together in a pipeline, and I don't know whether the branch predictor will "remember" this loop when I return to it, or whether it will have to do the "warm-up" over again.

For the next IF test ("num_flagged == 1"), the compiler generates code that works properly with the static predictor. It compares "num_flagged" to 1 (which will be true about 80% of the time once you get to this line of code), then does a forward conditional jump if *not equal* to the next IF test. That branch will be predicted *not taken*, and execution will fall through the branch and begin processing the code for "num_flagged == 1". This code ends in an unconditional branch to the END_OF_I_LOOP processing, so this will also be predicted correctly. (All these double-negatives make my brain hurt!)

The compiler also does the right thing in exactly the same way for "num_flagged == 2".

For the final case, the compiler fully unrolls the loop and tests the 8 bits sequentially. The compiler can't know this, but when "num_flagged" is greater than 2, the most common value is 3. This means that each of these 8 tests should be assumed to be false -- then in the most common case you would predict 5 correctly and 3 incorrectly. It would be very nice to have a way to tell the compiler that the tests are less likely to be true than to be false, so that it would know how to generate the best possible code. As it happens, the compiler generates conditional branches in such a way that the static branch prediction is correct if the comparison (mask & (i<<j)) is true, so the compiler-generated code gets 3 correctly-predicted branches and 5 incorrectly-predicted branches (using static branch prediction) -- adding about 40 cycles to the loop execution for the 1% of the time this code is executed. This is not catastrophic, but it is actually a significant contributor to the overall execution time. Eventually the dynamic branch predictor will change these static predictions, but I am not sure that it will respond quickly enough to be useful. For a vector length of 1000, the outer loop only executes 128 times, and with the final IF test only occurring about 1% of the time, it will often only be executed once for a call to this routine. Lots of other code will be executed before coming back here, and there is no guarantee that the branch predictor will remember this loop, or that the weak 3/8 vs 5/8 ratio will be strong enough to train the branch predictor in only a few calls.

Back in the olden days, I seem to recall that compilers supported pragmas that would allow me to give hints about branch probabilities. I know that the branch predictors are much better now, but there is still code that has random, data-dependent branches and knowing something about the probabilities could be helpful.

Any ideas?

SergeyKostrov · ‎02-10-2016

>>... >>...Should I have written the code some other way so that the compiler would generate code that is >>...consistent with the static branch prediction mechanism for this most common case? >>... Here is my more than 15-year-old solution for elimination of very complex if-else-if-else constructions: ... _inline void ( *g_pProcessing[4] )( void ); _inline void Processing0( void ){ //... }; _inline void Processing1( void ){ //... }; _inline void Processing2( void ){ //... }; _inline void Processing3( void ){ //... }; ... void main( void ) { ... g_pProcessing[0] = Processing0; g_pProcessing[1] = Processing1; g_pProcessing[2] = Processing2; g_pProcessing[3] = Processing3; ... for( int i = 0; i < IMAX; i += 8 ) { ... g_pProcessing[ __builtin_popcount( mask ) ](); ... } ... } Add a verification that there is No out-of-bound values, that is exceeding 3, returned from __builtin_popcount( mask ) function. Of course, it does not get rid of jmp instructions when C/C++ compiler generates binary codes...

TimP · ‎02-11-2016

Not having time to study the situation fully today, one possibility is to employ profile guided optimization with sufficient training cases to replicate the expected branch probabilities, generate the asm code and see if that exposes the internal directives.

SergeyKostrov · ‎02-11-2016

>>...one possibility is to employ profile guided optimization with sufficient training cases... Plus investigation with VTune...

McCalpinJohn · ‎02-11-2016

I have done some modeling of simple branch predictors, and they are probably good enough to override the (statistically) incorrect static branch predictions pretty quickly, so that the overall overhead is small. I would still like to be able to annotate the code with probabilities, even if it is just to remind me of the analysis that I did, but I don't think that I would get a significant improvement in performance on current hardware.

After doing the more detailed analysis, I was surprised to see how much of the execution time of this code is likely attributable to unavoidably mispredicted branches. For highly vectorized code that can sustain 2 8-wide vector instructions per cycle, a single 20 cycle branch mispredict penalty corresponds to ~40 vector instructions or 320 element operations. This completely overwhelms the small number of instructions required to implement the "num_flagged ==1" processing (about 8 instructions) and the "num_flagged==2" processing (about 12 instructions).

I see that AVX-512F includes a VCOMPRESSPS instruction that can be used to compress the flagged elements of a vector register into a set of contiguous register locations or a set of contiguous memory locations. I will probably revisit this once our KNL systems start to show up....

Source code conventions for optimizing branch prediction?