branch prediction pattern - Page 2

may_ka · ‎06-01-2014

Hi all,

Say I have the code below

Program Main
use ModSubroutines
Implicit None
Integer :: A, Stat
call SubA(A,Stat)
if(Stat==0) call SubB(A,Stat)
if(Stat==0) call SubC(A,Stat)
if(Stat==0) Then
write(*,*) "Success"
Else
write(*,*) "Error"
End If
End Program

In order to avoid uncontrolled termination of the program, a status variable "Stat" is carried all along the way. If it's value is different from zero, the program will give an informative error message and will close down gently. Since "Stat" changes from zero to one only if an error has occured which lead to program termination, it will always be zero in a normal program run. Thus "Stat" has a very predictable pattern and a sever slow down from this pattern is not an issue because the pogram will then terminate. Thus, if the compiler predicts the pattern of "Stat" from "Stat" only, all these if statement, and there can be thousand, will not slow down the run time of the program because the pattern will always be "FFFFFF....". However, if the compiler mixes the pattern of "Stat" with those of other branching variables such as input parameter, the "Stat" if statements will slow down the pogram. I read somewhere on the web that compiler will not store a pattern for every single branch, thus, mixing branches and variables. If this is correct for Ifort too, the question is whether there is any possibility of telling the compiler to create a predictive pattern for "Stat" branches only.

Thanks

Karl

jimdempseyatthecove · ‎06-10-2014

There is a nuance (implementation detail) with respect to the DO CONCURRENT

At what scope is the stack reservation for the block contained T made? Note this is not the same as the scope of where the block contained T is visible/accessible. The additional stack space for T could be made once outside the block (but only visible inside the block). If so then the stack of T is that outside the scope of the block(s). On the other hand, if the stack reservation is made inside the scope of the block, and the block is inside the DO CONCURRENT, then the stack allocation is made on each pass through the block .AND. the stack placement is that local to the thread (assuming DO CONCURREN was made parallel).

Note, in the former case (T on stack outside scope of block) T is like a dummy reference (and shared), in the latter T is like a local variable (and private).

I would like to know if the behavior of where T is located is specified in the Fortran specification.

Jim Dempsey

Steven_L_Intel1 · ‎06-10-2014

The Fortran standard never talks about "where" a variable is allocated - instead it describes the effect of using a language feature.

"Except for the ASYNCHRONOUS and VOLATILE statements, specifications in a BLOCK construct declare construct entities whose scope is that of the BLOCK construct (16.4)."

"Actions on a variable local to a BLOCK construct do not affect any variable of the same name outside the construct."

It's just like declaring a variable in a {} section in C - variables declared there are local to the construct, come into existence when the BLOCK is entered (BLOCK is an executable construct) and become undefined when it exits.

It is absolutely true that DO CONCURRENT does not require parallelization - rather, it specifies semantics that allow for parallelization (or vectorization.)

jimdempseyatthecove · ‎06-10-2014

Consider

subroutine foo
  integer(4) :: local
  ...
  do i=1, bazillion
    ...
    block
      integer(4) :: X
      ...
    end block
    ...
    block
    integer(4) :: Y
    ...
    end block
    ...
  end do
  ...
end subroutine foo

The compiler could determine foo requires 4 bytes for variable local and 4 bytes for the union of the requirements for each block. And then perform the stack reservation once at program start. This places X and Y at the same location of the stack external to both blocks. And this also saves 4 x bazillion stack adjustments.

.OR. the compiler could reserve the 4 bytes for variable X or Y and at each entry and exit of block reserve and release 4 bytes of stack. IOW incur 4 x bazillion sub/add 4 from/to stack pointer.

The conscientious programmer might be tempted to implement the former over the latter.

Then consider

subroutine foo
  integer(4) :: local
  ...
  do i=1, bazillion
    ...
    do concurrent
      block
        integer(4) :: X
        ...
      end block
    end do concurrent
    ...
    do concurrent
      block
        integer(4) :: Y
        ...
      end block
    end do concurrent
    ...
  end do
  ...
end subroutine foo

Now then, assuming parallelization is made, if the first stack allocation strategy is chosen X's of all threads in the first DO CONCURRENT are shared. Same with Y, but the Xs and Ys are not cross shared due to implicit barrier at end of do concurrent.

Now the nuance

"Actions on a variable local to a BLOCK construct do not affect any variable of the same name outside the construct."

In the first block, the X is declared inside the block and meets the above qualification, yet requires a means of privatization to produce code that runs deterministically.

If the 2nd stack allocation strategy is chosen (and the threads are not sharing the same stack space - normally not), then the 2nd stack allocation strategy satisfies both the enquoted qualification .AND. the " requires a means of privatization". However it also comes at the (minor) expense of the bazillion stack reservation/release.

I know this is "picky" but this is why we have standards.

From what you quoted in #23, it is undetermined (specified) as to where the reservation is made.

Jim Dempsey

Izaak_Beekman · ‎06-10-2014

Steve, FortranFan and IanH,

I must admit that I was convinced that IanH and FortranFan were correct, and that a BLOCK construct was not required to create scalars that are not in danger of being stepped on by other iterations if the do concurrent construct is executed in parallel. I’m not sure if this is the final document defining the standard, but it seems to be the closest thing I can find for free on the internet: the J3 F2008 latest draft at: Fortran 2008 (latest draft)

The language referenced by both MFE and 8.1.6.7 it seems quite convincing that no BLOCK construct is required, however, reading further down the page, Note 8.11 explicitly states:

A variable that is eectively local to each iteration of a DO CONCURRENT construct can be declared in

a BLOCK construct within it. For example:

So, if the content and language of the final official F2008 specification is the same, and Note 8.11 is included, this leads me to believe that Steve is correct, Block is required inside do concurrent to define a scalar variable defined in more than one loop iteration, even if no inter-loop dependencies are carried.

I agree with IanH’s assessment that 8.1.6.7 does not do a good job explaining this:

If the intended concept was concurrent-execution-possible-without-additional-compiler-analysis-and-transformation (such as introducing temporaries), then the restrictions on the program associated with DO CONCURRENT in 8.1.6.7 are inadequate and misdirected at the same time. You would need to always ban the definition of a variable in more than one iteration... but the standard doesn't do this - the bit around variable definition has an "or". The statement "a variable that is defined or becomes undefined by more than one iteration becomes undefined when the loop terminates" acknowledges the possibility of definition in multiple iterations and I think that requirement exists to allow for transformations such as temporaries.

Note 8.11 seems to definitively imply that a BLOCK construct is required to define iteration local scalar variables. I’m not sure what the official procedure is, but an official request to the standards committee for clarification and to revisit the wording of 8.1.6.7 could prove quite helpful to clear this up definitively.

Steven_L_Intel1 · ‎06-10-2014

This section of the standard has been examined extensively - it is the subject of three interpretations as of Corrigendum 2. (None in Corrigendum 3.) However, the changes have all been to strengthen or clarify 8.1.6.7 Restrictions on DO CONCURRENT.

Note 8.10 says "The restrictions on the statements in the range of a DO CONCURRENT construct are designed to ensure that there are no data dependencies between iterations of the loop. This permits code optimizations that might otherwise be difficult or impossible because they would depend on properties of the program not visible to the compiler."

I don't agree with Ian's interpretation of 8.1.6.7, though I can see where he is coming from. But I will discuss it with other members of the committee when I see them later this month.

FortranFan · ‎06-10-2014

It'll great if the standard, or some supporting document or a new appendix/edition to MFE book, could provide a better interpretation of the "execute in parallel" context of DO CONCURRENT - that's where I'm getting lost. That was the theme behind Steve's comment in Quote #18 that initiated this round of discussion and which I am struggling to understand.

Steven_L_Intel1 · ‎06-10-2014

Please don't misunderstand - the standard doesn't say anything about DO CONCURRENT executing in parallel. Rather, the standard specifies semantics that are conducive to parallelization. It's not like an OpenMP PARALLEL DO directive.

Izaak_Beekman · ‎06-10-2014

Steve,

Note that one can use a scalar without introducing a data dependency between loop iterations. It can be defined and then referenced within the same loop iteration; therefore, its value is unambiguous during that iteration. Note 8.10 misses the mark if its intent was to clarify this topic. There is no data dependency between loops because the scalar variable never appears on the right hand side of an assignment until it has been defined in that loop. However, note 8.11 does at least imply that a block construct is required to prevent issues when defining scalars inside do concurrent. I don’t know where the right place to look for official documents is, but judging by http://www.nag.com/sc22wg5/ it looks as though the “do iteration local scalars require block construct?” question has not been addressed by any of these interpretations or corrigenda 1-3. A formal interpretation would be nice and clearer language.

Steven_L_Intel1 · ‎06-10-2014

Given that the standard includes an explicit example, which I quoted above, I think the standard's authors thought it clear.

IanH · ‎06-10-2014

(For clarity - my post #20 was in reply to Steve's #18. This might be the discussion elsewhere that I referred to... https://groups.google.com/d/topic/comp.lang.fortran/bHTjFSj-LJo/discussion but given I seem reasonably confident of my opinion there it is possible that there was one earlier than that.)

While it is correct (because it says "can", not "must") I think note 8.11 is a bit misleading.

One for the thinkers and philosophers... inside block-like-constructs (execution constructs that have an opening statement and a closing statement, with executable code in between - like IF...END IF or DO...END DO) why is it necessary to have a "duplicate" BLOCK construct to wrap the specification part and the executable part - why couldn't the language rules for block-like-constructs simply be changed to allow them to have a specification part before their execution part?

i.e. rather than requiring the first example, permit the second example.

DO i = 1, 10
  BLOCK
    ! Specification part.
    REAL :: T
    !***
    ! Execution part.
    T = A(i) + B(i)
    C(i) = T
  END BLOCK
END DO

DO i = 1, 10
  ! Specification part - currently not permitted here.
  REAL :: T
  !***
  ! Execution part.
  T = A(i) + B(i)
  C(i) = T
END DO

Perhaps it gets a bit hairy with things like the labelled form of DO construct?

may_ka · ‎06-10-2014

Sorry guys, although I did the original post, I got lost.

Cheers

PS: have a look at my next post!

jimdempseyatthecove · ‎06-11-2014

IanH,

I agree with your observation. FWIW, I'd have preferred they chose {}'s either would avoid assigning a keyword.

I suspect that doing so (IanH's suggeston) would have "broke" a conformity test program looking for specifications in the wrong place.

Jim Dempsey

FortranFan · ‎06-11-2014

may.ka wrote:

Sorry guys, although I did the original post, I got lost.

Cheers

PS: have a look at my next post!

Sorry Karl, it was I with postings on BLOCK construct in Quotes #5 and #9 that threw this topic off course.

How about we get back to your original question? Especially since DO CONCURRENT in the context of concurrency and parallelism and the general aspects of BLOCK construct are worthy of separate, major discussion threads instead of getting buried here.

As I mentioned earlier, based on my investigation, the BLOCK construct introduced in Fortran 2008 standard can be very helpful in simplifying code sections that involve a series of checks. And yes, a single-pass DO loop can also achieve the same result. But in my opinion, the BLOCK construct appears more elegant and clearer. Have you taken a look the code snippet in Quote #5? How do you find it? Do you think if you were to use such a construct, it will simplify your code and it will read easier than what you have currently? Your comment will be useful feedback for me as I'm wondering whether to propose it as a "recommended programming practice" to my colleagues at work. By the way, you will not be able to test it out because the EXIT aspect has not yet been implemented in Intel Fortran (however such a capability is available in gfortran 4.9, the open-source, "free" compiler).