Vectorisation of type bound procedures

AThar2 · ‎08-23-2019

If you have a type bound procedure and within the same module I do:

!$omp simd

do i =1,N

(...)

  call this%  AN_ELEMENTAL_INLINE_ROUTINE

(..)

enddo

My optimisation report says that it was unable to inline indirect call list.

1 ) I would understand this complain if my type bound procedure could be overridden, however, i tried to explicitly set a `non_overridable` it did not help either.

2) If I got a procedure which is deferred /overridden, I obviously would not be able to ask the compiler to inline it. However, does anybody in here know whether or not such loops can still benefit from vectorisation. I initially thought to put the "!$OMP DECLARE SIMD(ROUTINE_NAME)" on every potential routine that would overwrite the type bound procedure .

e.g.

type  test

contains

procedure  :: init => init_1

end type

type extend(test) :: test_2

contains

procedure init=> init_2

end type

!$omp declare simd(init_2)

subroutine init_2(....)

end subroutine

!$omp declare simd(init_1)

subroutine init_1(....)

end subroutine

jimdempseyatthecove · ‎08-26-2019

Can you show what is attempting to being SIMD'd?

IOW what are the argument declarations to init_1, init_2, ... and how are they expected to be used ?

On a subroutine with pass (e.g. this), declaring elemental (and/or simd) would imply the "this" is elemental (inclusive of the additional arguments if any).

In the case where a member variable is an array, and where the particular is desired to be manipulated in a SIMD/elemental manner, I suggest you declare a private subroutine that does not take the this argument, but rather you pass in the reference to this%array as a traditional array reference.

Jim Dempsey

AThar2 · ‎08-27-2019

Hello Jim,

I think my concern is not only targeted elemental routines. I am generally concerned with how vectorisation works for function/subroutine calls when these are dynamically polymorphic or type bound but not polymorphic, i.e. the compiler does or does not know the actual routine until runtime.

For the case with a static type bound procedure, I don't understand why the optimisation routine say it cannot inline that routine because of an indirect class?

For the second case, if my procedure is dynamically polymorphic, would it still vector If a have declared, all possible routines that could be called, with an !$omp simd declare(routine_name)

If these two cases are not clear, I am happy to make an example for what I mean.

jimdempseyatthecove · ‎08-27-2019

In your type declaration, declare a nopass subroutine, iow one where the this pointer is .NOT. passed.

!$omp simd
do I=1, N
   ...
   CALL AN_INLINE_ELEMENTAL_ROUTINE(this%memberArray(I), ...)
   ...
end do

Vector instructions (SSE, AVX...) generally function with arrays of fundamental types (INTEGER, REAL, COMPLEX of 1, 2, 4 8 bytes), but not of arrays of user defined types. While your post #1 is not showing an array of user defined type, it has a subroutine dispatch (the call) based on the user defined type. The technique you need to do is to lift the type bound dispatch outside the loop.

You may need to expand on the above in the event that memberArray has a different type for each of the different UDTs. IOW this may require a SELECT/CASE and optionally use of ASSOCIATE array=>this%memberArray

Jim Dempsey

AThar2 · ‎08-27-2019

Hello Jim,

Thanks for the reply.

Is there any specific reason why the routine cannot stay part of the type. I.e. doing the "call this% AN_INLINE_ELEMENTAL_ROUTINE"

Extending this further to when "CALL " has to be type bound procedure because that procedure polymorphic. I.e the procedure is only determined at run-time. How does this cope with vectorisation.

jimdempseyatthecove · ‎08-28-2019

In computer programming, there are two conceptual entities called a vector. An abstract mathematical concept, generally a Fortran array of something, which can be a polymorphic type; .AND. a CPU entity of CPU intrinsic types (8-bit, 16-bit, 32-bit, 64-bit signed/unsigned integer or 32-bit, 64-bit floating point) of which the CPU intrinsic type is replicated into a CPU Small Vector of contiguous 8, 16, 32, 64 bit types that fill or partially fill the Small Vector who's width is 128, 256 or 512 bits wide. An example is 16 REAL(4)'s on a CPU with AVX512 support.

The code generated by the compiler (IOW a specific instruction such as "vector add packed single precision floating point") is not dependent on the data type. That is the CPU instruction itself does not look at the type, the compiler does and chooses which instruction to insert into the binary output. The compiler must generate a type dispatch. In C++-speak this would be a vtable dispatch. This would be a small section of code, which would be conditionally executed with tests and branches and such, making it all but impossible to use a collection of undisclosed types that are required to be adjacent for the single SIMD instruction to act upon.

While your loop in the application may reference only one of your types (across all iterations a loop instance), the compiler cannot generate code under this assumption. To resolve this (attain vectorization), you must code in an unambiguous manner. For example when the compiler can unambiguously know it is iterating across an array of REAL(4)'s that this loop can potentially be vectorized. Whether it can and cannot be vectorized will depend on the other statements in the loop.

While in C++ you can facilitate this using templates, Fortran unfortunately does not have templates. You will have to write individual SELECT/CASE/DO/ENDDO sections of code and/or specific instances of GENERIC procedures, which can be hacked together using the FPP and #define for your specific types.

If you can present a complete, and simple case that can be compiled by others, you might get a recommendation sooner.

Jim Dempsey

AThar2 · ‎09-03-2019

@Jim

Thanks for the reply. Are you essentially saying that the compiler must know what function/subroutine we are calling within a SIMD loop before it can vector, hence any polymorphic call is not possible. I so far thought you could do that as long as you declared those routines with a declare simd(Function_name)

If this is correct(that it is not possibe), can you then advice me on how to deal with dynamically changing conditions within a SIMD loop.

For example,

do i = 1,no_part 
   (...) 
   if( st% ip(i) == 1) then 
        call ROUTINE1 

   else
      
         call ROUTINE2 

   endif 

enddo

I know you once showed me a nice example using the FPP and #define to make it attain vectorisation, which works very nicely for me when the ifs does not change over the course of the loop itself.

However, in the above example, you might have i = 1, calling ROUTINE1 while i=2 calling ROUTINE2.

I think I know the answer by now, but please confirm this with me. If I want to SIMD a loop, the compiler must execute the same code no matter what for each iteration i. Hence, there is no other way around that accepting the overhead of letting it execute all conditions before it automatically will make a merge for me depending on my condition.

Thanks again.

jimdempseyatthecove · ‎09-04-2019

A loop with a conditional section as above is not normally not capable of being vectorized even if ROUTINE1 and ROUTINE2 are inlined.

Only in very restrictive cases might the compiler figure out how to vectorize.

In order to vectorize the typically both ROUTINE1 and ROUTINE2 paths will be executed, and then the conditional test will be made to select what/where to store the result. For example should the inlining of the two routines produce something like

do i = 1,no_part 
   if( st% ip(i) == 1) then 
        st%foo(i) = st%foo(i)**2
   else
        st%foo(i) = st%foo(i) / 2
   endif 
enddo

Will produce something line this:

do i = 1,no_part 
   tempA = st%foo(i)**2
   tempB = st%foo(i) / 2
   if( st% ip(i) == 1) then 
     st%foo(i) = tempA
   else
     st%foo(i) = tempB
   endif 
enddo

*** where the IF(...) branches are replaced with conditional moves. IOW the loop can then run on each element of the array without branching.

The cost is increased by the unnecessary computation, but also reduced by the fact the loop can be vectorized.

You have provided insufficient detail anyone here to provide you with the information you seek.

Jim Dempsey