Thanks very much Jim!

AThar2 · ‎04-07-2019

Part of my code has been vectorized using !$omp simd. Whenver I have the vectorization enabled I get an error saying " array index out of bounds". The code line it points I find quite random, since when I comment out that line the error persists and referring to another line.

In my loop I have a clause which contains which if true calls function A and if false calls function B. (Both functions also have a function call inside them). But all these functions have been inlined and declared simd. The point I want to make is if I comment out one of these function call ( the part of the clause which I KNOW the code won't process at run time because of my flag settings) the segmentation fault is delayed. If I comment out the other function (B - the one that is being called) then following two scenarios happen

1) If I also comment out function A EVEN THOUGH it is not being called , my program runs!

2) If I DON'T comment out function A (EVEN THOUGH IT IS NOT BEING CALLED) my program complains about an "array index out of bounds"

I did have -traceback enabled. But that is completely useless.

I even did write a clause saying if the index gets larger than the array size, then skip that loop (CYCLE). However, I am 100% sure that my array index does not go out of bounds, unless the vectorization is doing something I am not aware about.

I don't know if this is useful

when running with Valgrind

I get numerous errors messages which are quite identical (only when running on the case that the prorgram actually fails)

I first get this error :

==2883== Invalid read of size 8
==2883==    at 0x44C200: lpt_particles_mp_displu_ (in lpt.x)
==2883==    by 0x43AC6E: lpt_marching_mp_unsteady_spray_steady_flow_ (in lpt.x)
==2883==    by 0x41ED25: MAIN__ (in lpt.x)
==2883==    by 0x403761: main (in lpt.x)
==2883==  Address 0xc001077a90872154 is not stack'd, malloc'd or (recently) free'd

Then I get followings erros ( which are probably due to the fact that I have not be deallocated and the program crashed, correct me if I am wrong

==31838== 262,144 bytes in 1 blocks are still reachable in loss record 137 of 137
==31838==    at 0x4C2C1E0: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==31838==    by 0xDDF8C9A: ???
==31838==    by 0xDDF707B: ???
==31838==    by 0xDDEF425: ???
==31838==    by 0xDDEF797: ???
==31838==    by 0x5421EDB: fi_endpoint (fi_endpoint.h:156)
==31838==    by 0x5421EDB: ??? (ofi_init.h:1733)
==31838==    by 0x5429F08: MPIDI_NM_mpi_init_hook (ofi_init.h:1117)
==31838==    by 0x5429F08: MPID_Init (ch4_init.h:855)
==31838==    by 0x5429F08: MPIR_Init_thread (initthread.c:647)
==31838==    by 0x541DD1B: PMPI_Init (init.c:284)
==31838==    by 0xC611CFA: MPI_INIT (initf.c:275)
==31838==    by 0x4481EF: lpt_parallel_mp_parallel_init_ (in lpt.x)
==31838==    by 0x41ED0D: MAIN__ (in lpt.x)
==31838==    by 0x403761: main (in lpt.x)

Then I get several of these :

==31838== 28,517,032 bytes in 1 blocks are possibly lost in loss record 137 of 137
==31838==    at 0x4C2A0B0: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==31838==    by 0x53465F: _mm_malloc (in lpt.x)
==31838==    by 0x4B9680: for_alloc_allocatable (in lpt.x)
==31838==    by 0x429CCF: lpt_geom_mp_tri_normals_ (in lpt.x)
==31838==    by 0x436FF6: lpt_init_mp_spray_init_ (in  lpt.x)
==31838==    by 0x466EFC: lpt_preprocessor_mp_preproc_ (in lpt.x)
==31838==    by 0x41ED17: MAIN__ (in lpt.x)
==31838==    by 0x403761: main (in pt.x)

I really know not showing the code makes it much difficult, but it would be insane for me to put the entire code here which is very large. Trying to simplify the problem and yet producing this bug has not been successful yet. It is very difficult to do so when not knowing an entire thing on where the error is.

A suggestion: Could it be that I am exhausting my vectorisation register.

I have tried to compile with -xcore-AVX2 -align array32byte -qopt-zmm-usage=high and AVX512 -align array64byte -qopt-zmm-usage=high.

I would really appreciate if somebody have experienced similar issue or could indicate potential reasons for this error,

Please notice again, I have run this in full debug mode and fully optimised (-O3) but not having vectorisation. Nor did the compiler complain or when running with valgrind. Everything just seemed fine?

jimdempseyatthecove · ‎04-07-2019

>>Part of my code has been vectorized using !$omp simd...(Both functions also have a function call inside them). But all these functions have been inlined and declared simd.

See: https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-simd-directive-openmp-api

"No other OpenMP* Fortran construct can appear in a SIMD directive."

IOW the (inlined) functions called from within an !$OMP SIMD do loop, cannot themselves contain !$OMP directives.

What you possibly could consider is declare those functions (A, B, and what they call) with

!DIR$ ATTRIBUTES VECTOR [:clause...] :: routine-name

Check the documentation as to appropriate clauses that you may wish to use.

Jim Dempsey

AThar2 · ‎04-07-2019

Thanks very much Jim.

i already have a !$omp simd declare for my functions A and B.

Is that not the same as the directive you mentioned?

What you/article are/is saying is that my function A and B cannnot themselves have a third function which got a directive like inline and omp declare simd , since functions A and B already are inside a simd loop?

jimdempseyatthecove · ‎04-07-2019

>>I already have a !$omp simd declare for my functions A and B.

*** NO - I stated: "IOW the (inlined) functions called from within an !$OMP SIMD do loop, cannot themselves contain !$OMP directives."

You also mentioned that these functions also call functions that are also inlined (and I presume !$OMP SIMD'd as well - take it off).

When a function/subroutine IS inlined into an !$OMP SIMD loop it behaves as if the code was pasted into the loop.
*** However, declaring a function as inlineable does not assure it is inlined.
*** Also, inlinining a function/subroutine that is not suitable for vectorization will either bung-up the !$OMP SIMD execution .OR. result in the compiler .NOT. SIMD-izing the loop.

A function/subroutine that has !DIR$ ATTRIBUTES VECTOR [:clause...] :: routine-name

.AND. in fact is vectorizable to the vector width of the !$OMP SIMD, may be suitable for inlining.

>>What you/article are/is saying is that my function A and B cannnot themselves have a third function which got a directive like inline and omp declare simd , since functions A and B already are inside a simd loop?

That is how I interpret the rules.

Assume you use your IDE to locate function A, and COPY the code into the paste buffer, then Paste the code directly into the !$OMP SIMD, this code cannot have !$OMP directives, it is just code (presumably compatible to be within an !$OMP SIMD loop). This code now in place in the !$OMP SIMD loop is SIMD-ized (assuming it is capable of being so). This also holds true for the inline-d functions within (formerly) A and B as they too are brought into the !$OMP SIMD loop (and they must not have !$OMP SIMD or other !$OMP directives).

Also, any of the code in the outer (only) !$OMP SIMD loop cannot contain flow control statements that cannot be performed as SIMD equivalent statements. This means if your code contains flow control (of the inlined A and B, and their inlined functions) cannot be rewritten by the compiler as strait-line code then these functions are not suitable for a SIMD loop.

What a SIMD loop can do is convert simple flow control such as:

     if(Array(I) < 0.0) then
          OtherArray(I) = expression
     else
          OtherArray(I) = somethingElse
    endif

While the source code above have flow control, the SIMD instructions are capable of masked moves, meaning some of the vector lanes will get expression and other vector lanes get something else. Note, your expression and somethingElse can be your additional inlined functions *** provided they stay within their lane and do not contain flow control that cannot be SIMD-ised.

Your current construction is equivalent to:

!$OMP SIMD
...
      !$OMP SIMD
      ...
            !$OMP SIMD
             ....
            !$OMP END SIMD
            ...
       !$OMP END SIMD
........;
!$OMP END SIMD

And that is not valid.

Jim Dempsey

AThar2 · ‎04-08-2019

Jim, thanks very much. It does really make sense w.r.t to having omp declre simd when it is already within a do-loop. But I still have two confusions about what you mentioned

1) This article below motivated me to declare my functions (used in a simd loop) with declare simd. But are you telling me that I either should inline the function or declare simd and cannot use both. This is what I understand by now. Second, are you then saying that I still CAN have inline attributes and !DIR$ attributes vector, or shall I just keep the latter. I have always understood that !DIR$ attributes vector and !OMP DECLARE SIMD are equivalent. One being the intel command and the other being from openMP?

https://software.intel.com/en-us/articles/explicit-vector-programming-in-fortran

2) If you can please also help me this confusion about cluases. Provided that All my functions can be inlined (I am mostly using FORCEINLINE as these routines are not large) or declared simd would it be a problem with vecorisation. Isn't only when these routines are incapable of being inlined and vectorizable that the SIMD loop will serialized around this call? PS: Whenever I can I am using procedure pointer calls, which are determined at my program initialisation which routine they should point to, hence, I can avoid the clauses in my loop - do you reckon that makes it better/easier/more efficient in general terms.

TimP · ‎04-08-2019

If you have chosen your inner loop well and in-lined everything inside it, the compiler should do a good job of auto-vectorization without the directives. In-lining outside vectorizable loops might be counter-productive. If you require the compiler to perform loop interchanges for effective vectorization, you will need to keep the code inside the loops small, and omp directives may prevent interchanges.

Vectorization of conditionals normally is done by forcing speculative execution of both sides of a conditional. If there is much code with results to be discarded, this will become counter-productive. Likewise, in the case where you need to perform outer loop vectorization because the inner loop is not vectorizable by itself, and it is impossible to interchange loops, the size of code which will benefit and the vector speedup resulting will be less than you would hope for. In that case, you would still have only one level of !$omp simd.

On some long past Intel CPUs, forceinline could be effective in avoiding stalls incurred by a large number of arguments in a procedure call. You would require detailed analysis e.g. by VTune to uncover such a situation. On more recent architectures, inlining might be needed only to the extent needed to remove procedure calls from inside the vectorizable loop. The omp simd directive on the inner loop, among other things, tells the compiler to ignore hidden dependencies which the compiler might be able to eliminate with more levels of inlining. Proprietary directives such as the !dir$ VECTOR family can be more targeted to the specific requirements for vectorization in your application (ignore dependencies, ignore possibility of faults in speculative execution....)

I haven't investigated, but I doubt the mechanism for declaring simd procedures would be useful when you have already achieved the necessary inlining.

jimdempseyatthecove · ‎04-08-2019

do i=1, N
   Out(i) = ...              ! scalar (single element)
   ...
   ... = A(i) * B(i) + C(i)  ! scalar (single elements)
   ...
end do

! When in the above loop ...(i)
! is not dependent upon ...(i+j) or (i-j)
! Then you can

!$omp do simd
do i=1, N	! an implicit stride of SIMD width is used
   Out(i) = ...              ! (i) becomes (i:MIN(i+SIMDwidth-1, N))
   ...
   ... = A(i) * B(i) + C(i)  ! (i) becomes (i:MIN(i+SIMDwidth-1, N))
   ...
end do

Note the (i:MIN(i+SIMDwidth-1, N)) is only equivilantly used on the last iteration.

In looking at the original (scalar) loop above, when the entire loop has:

a) No inter-iteration dependencies (i) <-> (i+offset)
b) When the loop is known to execute SIMD serially, offset is not less than SIMD width (or greater than -SIMD width)
c) No subroutine/function calls can be made that cannot be inlined by IPO (typically of pure and elemental)
d) no IF[THEN, ELSE] ENDIF that cannot be efficiently speculatively executed (Tim P's) second paragraph
e) All explicitly inline code is scalar and conforms to a), b), c), d)

Use the above qualification measures to determine if and how you should use !$OMP SIMD

It appears that the time of your initial post, your thoughts were along the line of

SIMD... good!, Use it everywhere.

You cannot throw in the directives without forethought.

Keep in mind that

!$omp simd

is a binding contract between you and the compiler that the code contained within conforms to the requirements of a SIMD loop.

Jim Dempsey

AThar2 · ‎04-08-2019

@Jim and @Tim Again thank you very much for your time.

It appears that the time of your initial post, your thoughts were along the line of
SIMD... good!, Use it everywhere.

I am aware of dependency rules, however, I was not/ and still not I am afraid, when and when not to use !$OMP simd declare or/and !DIR$ ATTRIBUTES VECTOR.

The article I referred to in quote 5, seems to suggest that the latter directives are similar. Also, it seems to suggests that one should use such directive to a function/subroutine if such appears in a simd-do loop. While you (Jim) mentioned that in that case I get !OMP directive within an OMP directive which then means we break one of the rules. That does also make sense to me.
What I am actually understanding from this is either you use INLINE directives or !$OMP simd. Is that correctly understood.

d) no IF[THEN, ELSE] ENDIF that cannot be efficiently speculatively executed (Tim P's) second paragraph

Is there a way to know if my clause cannot be speculatively executed. What are the obvious killers to simd loops except for early return/exists.

Thanks again

jimdempseyatthecove · ‎04-08-2019

>>What I am actually understanding from this is either you use INLINE directives or !$OMP simd. Is that correctly understood.

Starting from PROGRAM, as you nest into your application, when the execution enters an !$OMP SIMD, it should not enter an additional !$OMP SIMD prior to exiting the first (outer) !$OMP SIMD. Any/all code inlined, explicitly with INLINE, or implicitly by way of Inter-Procedural Oprimizatons, should be written AS-IF it were scalar code. For non-inlined functions/subroutines, should these be decorated with !DIR$ ATTRIBUTES VECTOR... .AND. should the compiler determine it is suitable for its purposed in your !$OMP SIMD section, then, and only then will you get what you want.

>>Is there a way to know if my clause cannot be speculatively executed.

It would help if you show us the code. Compiler diagnostics may (may stressed) be able to tell you this.

Also:

Tim P>> If there is much code with results to be discarded, this will become counter-productive.

Tim's point is that your code CAN be speculatively executed, but it can also be counter-productive (too many statements in each/every branch of speculation).

Jim Dempsey

jimdempseyatthecove · ‎04-08-2019

Speculative execution:

IF(condition) THEN
  ... some scalar work
  Output(I) = prior scalar work
ELSE
  ... some other scalar work
  Output(I) = prior other scalar work
ENDIF

Becomes

  ... some scalar work
 tempTrue Result = prior scalar work
  ... some other scalar work
tempFalse =  prior other scalar work
IF(condition) THEN
  Output(I) = tempTrue
ELSE
  Output(I) = tempFalse
ENDIF

Or if you wish to view it as:

 ... some scalar work
 tempTrue Result = prior scalar work
  ... some other scalar work
tempFalse =  prior other scalar work
Output(I) = MERGE(tempTrue, tempFalse, condition)

It is done this way because the SIMD instructions have the ability to conditionally insert (or not insert) into individual lanes of the vector.

This making code that otherwise would have been incapable of executing on vectors, into code that can be executed on vectors.

The trade-off is: when

The sum of the work of both paths in vector mode
is less than
The (sum of the scalar work of each path taken) * number of lanes in the SIMD vector

Then it is beneficial to do the extra work.

Jim Dempsey

AThar2 · ‎04-10-2019

Jim, thanks very much. After adopting your advice regarding not having omp simd within omp simd, seemed to solve the seg fault.

I always have thought that I must have enabled

!$OMP SIMD DECLARE( routine-name )

for the functions calls I have within my simd loop, and did not realize the clash it would cause when also inlining this routine on top of it.

I guess as a developer one must look at the routine which is being called from a simd loop, whether to choose to have the omp simd declare or inlining the function.

Also, just to make sure to my self !$OMP SIMD DECLARE( routine-name ) and !DIR$ ATTRIBUTES VECTOR are similar like omp simd and dir simd

jimdempseyatthecove · ‎04-10-2019

You also need to assure that any of the intrinsic math functions use are vectorizable (this may depend upon the targeted host CPU). I do not have a handy list for this. The compiler vectorization diagnostic may be able to infer that the chosen function will not vectorize.

Jim Dempsey

jimdempseyatthecove · ‎04-10-2019

The !$omp SIMD is intended to not only declare the loop (is suitable and) to be SIMD,
it also assures that this loop be in a parallel region, that the iteration space be divided at SIMD vector boundaries.

!DIR$ ATTRIBUTES VECTOR makes no requirement (and no provisions) that the vector(s) received do or do not contain partial SIMD vectors that may be in contention with another thread. (this is not to say it is not multi-thread safe)

Jim Dempsey

AThar2 · ‎04-11-2019

@Jim, just a thought regarding what you mentioned on If statements.

Would it help if I made a procedure pointer, which is pointed only once during the initialisation of my program depending on user inputs, and then in my simd loop I do not need a if statement, but rather just calling the pointer.

So for example

! init 


if(apply_turb) then 
    proc_ptr => routine_A
else 
    proc_ptr => routine_B
end if


! simd loop

!$OMP SIMD 
do i = 1, N 
   (....) ! other calculations etc.

   call proc_ptr(A,B,C)

end if

Provided that my routine A, B are vectorizable of course.

Would that avoid the need of the simd loop doing speculatively execution

Thanks!

AThar2 · ‎04-11-2019

Oh But then It does no longer make sense to have an inline of the procedure functions, as since the compiler does not know which one to inline

jimdempseyatthecove · ‎04-11-2019

It would be better to:

! YourSIMDloop.inc
! simd loop
!$OMP SIMD 
do i = 1, N 
   (....) ! other calculations etc.
! *** proc_ptr is either routine_A or routine_B as substituted by FPP
!dir$ attributes forceinline :: proc_ptr
   call proc_ptr(A,B,C)
end if
! end YourSIMDloop.inc

-----

! in your main code
...
if(apply_turb) then
! Compile with PreProcess file
! use FPP #define and #include
#define proc_ptr routine_A
#include "YourSIMDloop.inc"
#undef proc_ptr
else
#define proc_ptr routine_B
#include "YourSIMDloop.inc"
#undef proc_ptr
endif
...

*** Note, if your loop is not too long, I would suggest a Copy and Paste then change the routine line(s)

Fortran does not provide macro substitution, FPP does. That is why the #include #define and #undef is used.

Jim Dempsey

AThar2 · ‎04-12-2019

Thanks very much Jim!

jimdempseyatthecove · ‎04-12-2019

I forgot to mention, do this only after fully debugging your code. Stepping through this with the debugger may present you with incorrect line numbers.

A third alternative is to write two variations of the code in the !$OMP SIMD loop. One using routine_A, the other using routine_B.

Then maintain the code using a good difference program. IOW update one, then synch the files sans the routine_x statement.

I use Beyond Compare (http://www.scootersoftware.com/download.php). This will provide you with side-by-side view of the difference between files and a line or group of line replace left to right or right to left. I find this a really great tool when comparing yesterdays working copy of the code (100's of files) to todays broken code.

Jim Dempsey

AThar2 · ‎04-12-2019

Thanks Jim!

I will look at this .

I just wandered if it is possible to keep the "*.inc" code in the same file as where you are writing your main code. Sometimes it could be convenient to have it part of the file, although not being part of a routine/module etc.

Is that possible with macros / fpp ?

AThar2 · ‎04-12-2019

Jim, I am having a link problem when I have the inline attribute before the call to PROC_PTR? (Line no: 7 in your code)

Is inlining not possible here?

Just an update:

I can do inline at routine A and B's interfaces, something like

!DIR$ ATTRIBUTES FORCEINLINE :: A

pure subroutine A( ....)

Does this not essentially have the same effect as you intended with your DEMO code?

jimdempseyatthecove · ‎04-12-2019

Inline-ing a proc_ptr is meaningless. It will still point to an out-of-line function.

The "!DIR$ ATTRIBUTES FORCEINLINE :: A" goes in front of the CALL statement.

ivf doc>>You should place the directive option in the procedure whose inlining you want to influence.

The procedure that incorporates the inlining is that which performs the CALL.
Alternatively, you may be able to place this in a module in which you USE and declare the interface. (test and verify)

The file containing SUBROUTINE A may not be visible at the point of the CALL A so any attributes in the file containing A may not be known to the compiler at the point of compiling the CALL A. Attributing at the CALL instructs the compiler to go look for it.

In the #include example in #16, the text "proc_ptr" is replaced by the text "routine_A" (or just A in what you describe in #20) _prior_ to the Fortran compiler seeing the source code. A temp .f90 file is created with the #include-ed and macro-expanded text, and the temp file is compiled.

Jim Dempsey

Segmentation fault only when vectorization is enabled