branch prediction pattern

may_ka · ‎06-01-2014

Hi all,

Say I have the code below

Program Main
use ModSubroutines
Implicit None
Integer :: A, Stat
call SubA(A,Stat)
if(Stat==0) call SubB(A,Stat)
if(Stat==0) call SubC(A,Stat)
if(Stat==0) Then
write(*,*) "Success"
Else
write(*,*) "Error"
End If
End Program

In order to avoid uncontrolled termination of the program, a status variable "Stat" is carried all along the way. If it's value is different from zero, the program will give an informative error message and will close down gently. Since "Stat" changes from zero to one only if an error has occured which lead to program termination, it will always be zero in a normal program run. Thus "Stat" has a very predictable pattern and a sever slow down from this pattern is not an issue because the pogram will then terminate. Thus, if the compiler predicts the pattern of "Stat" from "Stat" only, all these if statement, and there can be thousand, will not slow down the run time of the program because the pattern will always be "FFFFFF....". However, if the compiler mixes the pattern of "Stat" with those of other branching variables such as input parameter, the "Stat" if statements will slow down the pogram. I read somewhere on the web that compiler will not store a pattern for every single branch, thus, mixing branches and variables. If this is correct for Ifort too, the question is whether there is any possibility of telling the compiler to create a predictive pattern for "Stat" branches only.

Thanks

Karl

jimdempseyatthecove · ‎06-02-2014

Program Main
  use ModSubroutines
  Implicit None
  Integer :: A, Stat
  call SubA(A,Stat)
  if(Stat!=0) goto 999
  call SubB(A,Stat)
  if(Stat!=0) goto 999
  call SubC(A,Stat)
999 continue
  if(Stat==0) Then
    write(*,*) "Success"
  Else
    write(*,*) "Error"
  End If
End Program

You will have to check the code to see if in release build it generates a branch or branch over and jump. If branches are too far (on IA32), try

Program Main
  use ModSubroutines
  Implicit None
  Integer :: A, Stat
  goto 111
666 continue
    write(*,*) "Error"
  goto 999
111 continue
  call SubA(A,Stat)
  if(Stat!=0) goto 666
  call SubB(A,Stat)
  if(Stat!=0) goto 666
  call SubC(A,Stat)
  if(Stat!=0) goto 666
  write(*,*) "Success"
999 continue
End Program

Fortran (IVF) does not have a !DEC$ to aid in branch prediction (at least not that I am aware of).

Note, if both cases generate a branch over a jump, then you will be unable to attain what you seek.

Jim Dempsey

Steven_L_Intel1 · ‎06-02-2014

Leave branch prediction to the processor - it's very good at it. I suggest not spending any time worrying about nano-optimizations of this nature. Write the program in a clear manner and let the compiler and processor do the rest. Run the program through Intel Vtune Amplifier XE and look for hotspots if you're dissatisfied with performance. You can ask VTune to show you branch mispredicts, but I predict (!) you won't find much of interest here.

may_ka · ‎06-02-2014

Thanks Jim an Lionel.

Will have a look at the VTune output.

FortranFan · ‎06-04-2014

Steve Lionel (Intel) wrote:

Leave branch prediction to the processor - it's very good at it. I suggest not spending any time worrying about nano-optimizations of this nature. Write the program in a clear manner and let the compiler and processor do the rest. Run the program through Intel Vtune Amplifier XE and look for hotspots if you're dissatisfied with performance. You can ask VTune to show you branch mispredicts, but I predict (!) you won't find much of interest here.

Steve,

Can you comment on the Fortran 2008 BLOCK construct that has been introduced with the 2015 Beta compiler version and whether you'd recommend it in situations like these? That is, do you think performance will be just as good, better, or worse compared to the traditional styles of nested IFs, GOTOs, etc.? I personally prefer the clarity offered by the BLOCK construct, see code snippet below, and have started using it in new code with your maxim in mind i.e., let the compiler do the rest. It'll be great to know if your compiler team has looked at any efficiency aspects when it comes to BLOCK constructs and if there are any caveats you can share here.

Thanks,

   PROGRAM p
   
      REAL :: A
      REAL :: B
      REAL :: C
      INTEGER :: Istat
      
      A = 1.0
      B = -1.0
      C = 0.0
      CalcBlock: BLOCK
      
         !.. Call with A
         CALL Sub(A, Istat)
         WRITE(6,*) " After call with A: Istat = ", Istat
         IF (Istat /= 0) EXIT CalcBlock
         
         !.. Call with B
         CALL Sub(B, Istat)
         WRITE(6,*) " After call with B: Istat = ", Istat
         IF (Istat /= 0) EXIT CalcBlock
         
         !.. Call with C
         CALL Sub(C, Istat)
         WRITE(6,*) " After call with C: Istat = ", Istat
         IF (Istat /= 0) EXIT CalcBlock
   
      END BLOCK CalcBlock
      
      STOP
      
   CONTAINS
   
      SUBROUTINE Sub(X, ErrorCode)
      
         !.. Argument list
         REAL, INTENT(INOUT)    :: X
         INTEGER, INTENT(INOUT) :: ErrorCode
         
         !..
         ErrorCode = 0
         IF (X < 0.0) THEN
            ErrorCode = 1
         END IF
         
         !..
         RETURN
         
      END SUBROUTINE Sub
   
   END PROGRAM p

Steven_L_Intel1 · ‎06-04-2014

I don't see the connection between BLOCK and efficiency. BLOCK is a scoping construct, it has no relationship I can divine to IF-THEN, etc. In your example, you simply used BLOCK as a way to label a group of code so that you could exit it by name. You could have used DO for that, I suppose. The main purpose of BLOCK is to be able to declare variables that are local to the block - it's very helpful in DO CONCURRENT to avoid loop-carried dependencies.

FortranFan · ‎06-04-2014

Steve,

Yes, I understand BLOCK is a scoping construct with the ability to create locally scoped variables.

But what I'm trying to show is that from a programmer's point of view, a BLOCK construct can ALSO be used to simplify code sections that involve a series of checks. And yes, a single-pass DO loop can also achieve the same result. But IMHO, the BLOCK construct appears more elegant and clearer.

Now consider the code in OP and in Jim's suggestions in Quote #1 and what you'll find is this:

The code in OP is traditional style with a series of checks involving IF.. THEN.. ELSE statements which can lead to a lot of code indentation and the logic can become convoluted.
Jim's suggestions involve GOTOs and ALSO, smart placements of GOTOs to create a predictive pattern for the compiler and thereby, avoid branch overs and jumps (or far branches)

To which you responded simply write clear code and let the compiler take care of the rest.

Now with the BLOCK construct (or with a single-pass DO loop), the program can be made to behave the same as, say, the code in Quote #1 by Jim. But when such code involving the BLOCK construct is compiled with optimizations turned on (/O2 or /O3), will it execute as well as the code given by Jim? Or will the use of the BLOCK scoping construct lead to some slow down due to some internal aspects of how IFORT implements this feature?

That is, is there any penalty for the programming simplicity that can be gained by BLOCK construct for the "series of checks" case given by the OP?

Steven_L_Intel1 · ‎06-04-2014

There is no penalty I know of for using BLOCK in this manner. I agree it is clean to read.

FortranFan · ‎06-04-2014

Oh well, the example I gave in Quote #5 is not yet functional in Intel Fortran because the EXIT statement does not work inside the BLOCK construct. But note this example works correctly in gfortran 4.9.

Steve, can you indicate whether the EXIT statement aspect will make into the compiler 2015 version along with BLOCK?

Here's an explanation of these two features by John Reid in "The New Features of Fortran 2008" document:

Steven_L_Intel1 · ‎06-04-2014

We're aware of this feature - it's on our list, but not currently implemented and probably won't get done for 15.0. I have reminded the developers about it.

jimdempseyatthecove · ‎06-04-2014

Hack(barf)

outer: block
outer_block: do
  do i=1,num_in_set
    if(x==a(i)) exit outer_block
  end do
  call r
  exit outer_block
end do outer_block
end block outer

I am aware that an IF statement can be named. The user guide I have does not state if the EXIT name can be used to exit a named IF statement/block. You could try it if you want (it will look cleaner than the dummy DO loop and you won't need the extra exit at the bottom of the dummy loop.

Jim Dempsey

Steven_L_Intel1 · ‎06-04-2014

We don't yet support the enhanced uses of EXIT as specified in F2008, including an IF block.

IanH · ‎06-07-2014

I think BLOCK is awesome, but...

Steve Lionel (Intel) wrote:
The main purpose of BLOCK is to be able to declare variables that are local to the block - it's very helpful in DO CONCURRENT to avoid loop-carried dependencies.

Drifting off topic, but don't you just exchange a loop-carried dependency for an undefined variable access?

It certainly can help from accessing a variable that is undefined after the DO CONCURRENT terminates.

Steven_L_Intel1 · ‎06-09-2014

Ian,

Here's an example from the standard that shows how BLOCK is useful in DO CONCURRENT:

! A variable that is effectively local to each iteration of a 
! DO CONCURRENT construct can be declared in
! a BLOCK construct within it. For example:

DO CONCURRENT (I = 1:N)
  BLOCK
    REAL :: T
    T = A(I) + B(I)
    C(I) = T + SQRT(T)
  END BLOCK
END DO

If you didn't have BLOCK to declare a "threadprivate" T, then all iterations would be sharing the same T.

jimdempseyatthecove · ‎06-09-2014

FWIW T is on the stack of the local thread and not in the threadprivate area of the thread. I am sure that is why Steve quoted "threadprivate"

Jim Dempsey

Steven_L_Intel1 · ‎06-09-2014

Yes, exactly.

IanH · ‎06-09-2014

I don't think block makes any difference in that example, bar making the intent of the programmer absolutely clear (which is still important - is that what you meant?). The alternative without block:

REAL :: T
DO CONCURRENT (I = 1:N)
    T = A(I) + B(I)
    C(I) = T + SQRT(T)
END DO

No loop carried dependency there - in each iteration T is defined before it is referenced. If the compiler wants to execute the iterations in parallel then it needs to create a temporary for T for each iteration, but that's the compiler's problem, not the programmer's.

In the block case T doesn't exist after the construct which is a easily diagnosed error (unless it also exists in the parent scope, in which case the programmer deserves whatever they get) - here it is undefined - so BLOCK does help with that.

Edit to add... here's an example that tries to show what I was thinking - as presented it is an erroneous DO CONCURRENT construct because, assuming N > 1, in iteration N T is referenced without having been previously defined in that iteration and it is defined in another iteration (the loop dependency stuff banned by the requirements on things inside DO CONCURRENT constructs). Uncomment the BLOCK bits and the local declaration - now T is undefined in iteration N.

Perhaps it is easier for the compiler to diagnose one or other of the errors (I haven't checked) - but neither of the errors is called out by constraint, so there's no requirement that they even be diagnosed. And if N == 1 then both forms are legal (I think?).

Edit**2 to add the actual example... oops.

REAL :: T
DO CONCURRENT (i = 1:N)
! BLOCK
!   REAL :: T
    A(i) = some_pure_function(B(i))
    
    IF (i == 1) THEN
      T = A(i)
    ELSE IF (i == N) THEN
      A(i) = A(i) + T
    END IF
! END BLOCK
END DO

Perhaps I got the subject of the "helpful" comment wrong - were you perhaps saying it is more helpful for the compiler (more so than more helpful for programmers) because the loop independent nature of the variable is obvious so it is more likely that their compiler will generate something that can and will be executed in parallel? Fair enough in that case, though my recollection from playing with this some time back was that ifort had analysis capability well beyond that anyway.

Steven_L_Intel1 · ‎06-09-2014

That's not the way the language is defined. Without BLOCK, there is a single instance of T for the procedure. If DO CONCURRENT is executed in parallel, each of the iterations is referencing the same T and they step on each other. The compiler isn't free to invent its own iteration-local copies.

FortranFan · ‎06-09-2014

Steve Lionel (Intel) wrote:

That's not the way the language is defined. Without BLOCK, there is a single instance of T for the procedure. If DO CONCURRENT is executed in parallel, each of the iterations is referencing the same T and they step on each other. The compiler isn't free to invent its own iteration-local copies.

Steve,

While your comments in general are valid about BLOCK construct and the applicability about BLOCK constructs in DO CONCURRENT, I agree with Ian in that the specific example you provide in Quote #14 above (from Intel 2015 Beta documentation) is incorrect. In that specific example, as explained by Ian, there is no loop carried dependency and the DO CONCURRENT loop should work in parallel even without the BLOCK construct; as to how a compiler would do it, as stated by Ian, is compiler's problem - per Fortran 2008 standard, the programmer is not required to do anything to get the two instructions of [fortran] T = A(I) + B(I) [/fortran] and [fortran] C(I) = T + SQRT(T) [/fortran] to work in parallel.

A different, correct example for BLOCK construct is called for; in addition, I think Intel documentation should convey the broader applicability of BLOCK instead of the narrower view portrayed by documentation in 2015 beta. It'll be great if the team working on 2015 release can follow-up on these two suggestions.

By the way, an implementation of BLOCK without EXIT comes across as quite lame, IMHO. I think it would be better to hold off on releasing BLOCK until the work on EXIT is completed.

IanH · ‎06-09-2014

With some trepidation - I don't think that's quite right. (This isn't my view in isolation - there's been discussion about this elsewhere (perhaps c.l.f.) that informed my thinking, but I have botched a few concepts up in the last week or two since returning from a road trip during which I had to endure about 60 hours of two children under the age of five yelling at me from the back seat.)

I think it is a bit the other way around. If a single storage location in memory is used for T (noting that's at the implementation level of detail) then the compiler must not arrange for the iterations to be executed in parallel, because then the observable behaviour wouldn't match what the language specifies. DO CONCURRENT doesn't actually mean "the compiler must do it concurrently", but the restrictions on what can go inside make it easier (...relatively... something that is difficult is still easier than something that is nearly impossible) for the compiler to set things up such that it can do stuff concurrently. The behaviour is specified as execution in "any order", which still implies that there is an order, just not one that you can know ahead of time, and bar an exception for sequential output, that's all the restrictions really support.

If the intended concept was concurrent-execution-possible-without-additional-compiler-analysis-and-transformation (such as introducing temporaries), then the restrictions on the program associated with DO CONCURRENT in 8.1.6.7 are inadequate and misdirected at the same time. You would need to always ban the definition of a variable in more than one iteration... but the standard doesn't do this - the bit around variable definition has an "or". The statement "a variable that is defined or becomes undefined by more than one iteration becomes undefined when the loop terminates" acknowledges the possibility of definition in multiple iterations and I think that requirement exists to allow for transformations such as temporaries.

IanH · ‎06-09-2014

No - BLOCK is awesome. I want it now! exit from block can wait!!