Using enhanced instruction set for code does not lead to speed up

Ioannis_K_ · ‎08-27-2024

Hello everybody,

I have the following code for a vectorized computation, in a numerical simulation code of mine:

(various definitions)....

.........................................................................

integer, parameter:: Nvec1 = 1024

.......................................................................

!============================================================ begin code

iel = 0

do while (iel.lt.nval) !*******************************************************************

Nvec = nval - iel
if (Nvec.gt.Nvec1) Nvec = Nvec1

!$OMP PARALLEL num_threads(nthrea2)
!$OMP DO
do J = 1,Nvec
U11(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+1)
* - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+1)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsN(iel+J) = nx(iel+J)*U21(J) + ny(iel+J)*U22(J) +
* nz(iel+J)*U23(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsN(iel+J) = epsN(iel+J)-nx(iel+J)*U11(J)-ny(iel+J)*U12(J)-
* nz(iel+J)*U13(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsM(iel+J) = mx(iel+J)*U21(J) + my(iel+J)*U22(J) +
* mz(iel+J)*U23(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsM(iel+J) = epsM(iel+J)-mx(iel+J)*U11(J)-my(iel+J)*U12(J)-
* mz(iel+J)*U13(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsM(iel+J) = epsM(iel+J) - A1(J)*U14(J) - A2(J)*U15(J) -
* A3(J)*U16(J)
end do !J
!$OMP END DO
!$OMP END PARALLEL

end do ! while loop !*****************************************************************

!================================================================ end code

All vectors used in the above code have been defined with the option:

!dir$ attributes align:64

One detail that has caught my attention is that building for enhanced instruction sets (no matter whether I use SSE2, SSE3 or AVX), I see no difference at all in the speed of my code.

From my understanding, these enhanced instructions can significantly speed up execution, which is why I was surprised to see this (and disappointed, given that trying to write loops in a form that may allow vectorization may take a lot of time).

Is there anything that could be explaining this behavior?

In case it matters, many of the vectors have been defined (together with the !dir$ attributes align:64 option) in a module which is used by my routine. I include the directive !DIR$ ASSUME_ALIGNED ... for these vectors.

Thanks in advance for any help/input on this.

Andrew_Smith · ‎08-27-2024

I don't see any vectorisation directives in your code. Can you say what is expected to vectorise?

Ioannis_K_ · ‎08-27-2024

Why would I need explicit vectorization directives? Shouldn't the code automatically vectorize if I use the appropriate optimization flag?

jimdempseyatthecove · ‎08-27-2024

Is your build Release or Debug?

What is the value of nthrea2?

...LDPMends(1,iel+J)... will cause strided loads. When targeting AVX512 this may use gather instructions.

Nvec1 is only 1024, this may present too little work per (parallel) loop.

I suggest you check for store order dependencies and collapse whatever loops are safe to collapse.

The following is freeform (untested)

!$OMP PARALLEL num_threads(nthrea2)
!$OMP DO
do J = 1,Nvec
  U11(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+1) - aHHT*Uprev((LDPMends(1,iel+J)-1) * ndof+1)
  epsN(iel+J) = nx(iel+J)*U21(J) + ny(iel+J)*U22(J) + nz(iel+J)*U23(J) - nx(iel+J) * U11(J) - ny(iel+J) * U12(J) - nz(iel+J) * U13(J)
  epsM(iel+J) = mx(iel+J)*U21(J) + my(iel+J)*U22(J) + mz(iel+J)*U23(J) - mx(iel+J)*U11(J)-my(iel+J)*U12(J)- mz(iel+J)*U13(J) - A1(J)*U14(J) - A2(J)*U15(J) - A3(J)*U16(J)
end do !J
!$OMP END DO
!$OMP END PARALLEL

It is up to you to insert fixed form line breaks and continuation line.

When pasting code, use the "..." on the tool bar, then </>, then select Markup=Fortran, then paste your code into the edit box.

Jim Dempsey

Ioannis_K_ · ‎08-27-2024

Thanks Jim.

Sorry for the sloppy code, I will paste properly in the future so that it is easier to read.

I also noticed that I had not included the code for initializing U21, U22, etc.

I have collapsed the code as follows, and still do not see improvement with AVX...

      iel = 0
      do while(iel.lt.nval)
      Nvec = nval - iel
      if(Nvec.gt.Nvec1) Nvec = Nvec1
      !$OMP PARALLEL DO num_threads(nthrea2)
      do J = 1,Nvec 
          U11(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+1)
     *            - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+1)
          U12(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+2)
     *            - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+2)     
          U13(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+3)
     *            - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+3)     
          U14(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+4)
     *            - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+4)
          U15(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+5)
     *            - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+5) 
          U16(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+6)
     *            - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+6) 
          U21(J) = (1.d0+aHHT)*U((LDPMends(2,iel+J)-1)*ndof+1)
     *            - aHHT*Uprev((LDPMends(2,iel+J)-1)*ndof+1)
          U22(J) = (1.d0+aHHT)*U((LDPMends(2,iel+J)-1)*ndof+2)
     *            - aHHT*Uprev((LDPMends(2,iel+J)-1)*ndof+2) 
          U23(J) = (1.d0+aHHT)*U((LDPMends(2,iel+J)-1)*ndof+3)
     *            - aHHT*Uprev((LDPMends(2,iel+J)-1)*ndof+3)
          U24(J) = (1.d0+aHHT)*U((LDPMends(2,iel+J)-1)*ndof+4)
     *            - aHHT*Uprev((LDPMends(2,iel+J)-1)*ndof+4) 
          U25(J) = (1.d0+aHHT)*U((LDPMends(2,iel+J)-1)*ndof+5)
     *            - aHHT*Uprev((LDPMends(2,iel+J)-1)*ndof+5) 
          U26(J) = (1.d0+aHHT)*U((LDPMends(2,iel+J)-1)*ndof+6)
     *            - aHHT*Uprev((LDPMends(2,iel+J)-1)*ndof+6)     
          
          
          epsN(iel+J) = nx(iel+J)*U21(J) + ny(iel+J)*U22(J) + 
     *                    nz(iel+J)*U23(J) + A4n(iel+J)*U24(J) 
     *                    + A5n(iel+J)*U25(J) + A6n(iel+J)*U26(J)
     *                - nx(iel+J)*U11(J) - ny(iel+J)*U12(J) - 
     *                    nz(iel+J)*U13(J) 
     *                 - A1n(iel+J)*U14(J) - A2n(iel+J)*U15(J) - 
     *                    A3n(iel+J)*U16(J)
          
          
          epsM(iel+J) = mx(iel+J)*U21(J) + my(iel+J)*U22(J) + 
     *                    mz(iel+J)*U23(J) + A4m(iel+J)*U24(J) 
     *                    + A5m(iel+J)*U25(J) + A6m(iel+J)*U26(J)
     *                    - mx(iel+J)*U11(J) - my(iel+J)*U12(J) - 
     *                    mz(iel+J)*U13(J) - A1m(iel+J)*U14(J) 
     *                    - A2m(iel+J)*U15(J) - A3m(iel+J)*U16(J)    
          
          
          epsL(iel+J) = lx(iel+J)*U21(J) + ly(iel+J)*U22(J) + 
     *                    lz(iel+J)*U23(J) + A4l(iel+J)*U24(J) +
     *                     A5l(iel+J)*U25(J) + A6l(iel+J)*U26(J)
     *                    - lx(iel+J)*U11(J) - ly(iel+J)*U12(J) - 
     *                    lz(iel+J)*U13(J) - A1l(iel+J)*U14(J) 
     *                    - A2l(iel+J)*U15(J) - A3l(iel+J)*U16(J)     
          
          
      end do !J
      
      iel = iel + Nvec
      end do ! do while

Additional information:

The build is always Release.
The number of threads, nthrea2, is between 1 and 16 (between 4 and 8 the most probable).
I will try /QxHost, but I prefer not using it, because this is an executable that is to be used by others, too.
I do not have a processor that uses AVX-512. All the processors that I use have AVX2.
Yes, I agree that !DIR$ ASSUME_ALIGNED does not enforce vectorization, I only use it because some of the vectors are declared in a module which in turn is used by my routine; I have the !dir$ attributes align:64 directive in the module, when declaring the various vectors.
Something else I forgot to mention, is that all the vectors in the code apart from U11, U12, ..., U26 are allocated during runtime. Could this be a factor?
I agree that the terms involving LDPMends() cause strided loads. Would it help if lines 7-30 of the above code were placed in a separate PARALLEL DO loop than the rest of the code? I am asking because the code below line 30 does not have LDPMends(), so it should not have strided loads...

jimdempseyatthecove · ‎08-28-2024

LDPMends

While the indexing of LDPMends will require strided loads, the values retrieved are use as an index to U as well as the values retrieved are not contiguous (i.e. resulting in disbursed memory access). This section of code is presenting a double whammy for vectorization.

Try building the indices into a separate array (non-threaded loop):

do J = 1,Nvec 
  UIndex(J) = LDPMends(1,iel+J)-1)*ndof ! Add +1, +2, ... later
end do

Constructing the Unn tables will likely be memory access bound (depending on size of U). Unless your target CPU is a Xeon, you will likely have two memory channels. Building the Unn tables may be limited by memory bandwidth (or L3 bandwidth). Experiment with building the Unn tables with and without parallelization, and when with parallelization, limit to 2 threads.

The eps?(eil+J) = ... statements should now be able to be vectorized as there are now defined sequential indexing of the arrays.

Additional notes:

1) You can compile targeted to AVX or AVX2 (/Qx...) and add an additional alternate path /QaxAVX512

2) CPU (hw thread) have a limited number of registers. I am not concerned about the SIMD registers but for the integer registers that can be used to hold the base addresses of rhs arrays. To reduce register pressure, you may need to compute each eps? in a seperate loop

Consider:

!$omp parallel num_threads(3) private(J)
!$omp sections
do J=1,Nvec
  epsN(iel+J) = ...
end do
!$omp section
do J=1,Nvec
  epsM(iel+J) = ...
end do
!$omp section
do J=1,Nvec
  epsL(iel+J) = ...
end do
!$omp end sections
!$omp end parallel

Jim Dempsey

jimdempseyatthecove · ‎08-27-2024

Use VTune, look at disassembly for vector usage.

!DIR$ ASSUME_ALIGNED does not enforce vectorization.

It means if the code is suitable for vectorization, the arrays listed are assumed to be aligned (you are required to have made that so).

The LDPMends(1,iel+J) as stated earlier, cannot be vector aligned (but can be gathered on AVX512)

Add the /QxHost for your system (this may not be advisable if your code will run on a CPU that does not have AVX512)

Jim Dempsey

Ioannis_K_ · ‎09-01-2024

@jimdempseyatthecove

Thank you, I agree that VTune is required to look into this in more detail.

I have one more question (and I am embarassed to ask, as it must be something obvious, and I had previously used VTune without any issue).

For some reason, when I run a profiling job with VTune, it refuses to print the actual function/routine names when showing the results. For instance, it may show names such as "func@0x14000d210".

When building the executable for my code, I have set the Debug option to Full for both the compiler and linker, I use the /Zi option with the compiler, and I also have the /traceback option. Finally, I load the executable from the folder where it is built. I also use the O2 optimization level.

Is there any other reason/setting which would lead to Vtune not listing the actual function/routine names in the results of the profiling job?

Thank you for all your help.

Andrew_Smith · ‎08-28-2024

Your loops are all threaded, are you expecting them to get vectorised as well? Why not remove all the openMP DO, and either leave the vectorising to the compiler, or specify with SIMD. Then use threading at a higher level. Threading with such small amounts of work as you have will be slower than serial.

TobiasK · ‎08-30-2024

@Ioannis_K_
As @jimdempseyatthecove said vtune would be a good start to see if and what you are able to gain. @Andrew_Smith mentioned that your loop body that you parallelize is too small, did you check how that scales with the number of threads?

IEL is constant in the J loop, so you might want to try to move the j loop to a separate subroutine which will be inlined again but only pass all arrays with the IEL offset.
Also if the array references are different in each while iteration, you might want to move the openmp parallelization into the while loop (not easy).
What is the leading dimension of LDPMends? Is the leading dimension known at compile time? Maybe it's worth to transpose LDPMends

Last but not least, if this loopnest does not account for a significant percentage of your execution time, don't spend time on it.

Ioannis_K_ · ‎08-30-2024

@TobiasK

Thank you and everyone else for the suggestions.

The loop bodies are not small. The actual size of the arrays epsN(), epsM() etc. may include tens of thousands, up to millions of elements. I try to sweep it little by little, because otherwise I would have to define additional arrays with the full dimension (e.g., U11(), U12(), etc.) which would in turn increase memory requirements.

The "large array" dimension is NOT KNOWN at runtime. This is another reason why I want to work on it "little-by-little", as this allows me to have DO loops with known counter bounds.

The code does get faster when increasing the number of threads.

I also had a question. You mention:

"IEL is constant in the J loop, so you might want to try to move the j loop to a separate subroutine which will be inlined again but only pass all arrays with the IEL offset"

Do you mean that I should define a routine that operates on "local, small matrices", and simply pass the corresponding offset for the large matrices? Would this lead to speedup, or to slowdown? How will this help, if the routine is actually inlined in the code? Also, would creating/destroying local arrays in the stack of a separate routine help with speed, or lead to slow down?

jimdempseyatthecove · ‎08-30-2024

>>The actual size of the arrays epsN(), epsM() etc. may include tens of thousands, up to millions of elements. I try to sweep it little by little

Then consider something like:

!$omp parallel do
do iel=1,size(epsN),Nvec
  call doWork(epsN(iel,min(iel+Nvec-1,size(epsN)),epsN(iel,min(iel+Nvec-1,size(epsN)),...)
end do
!$omp end parallel do
...
contains
subroutine doWork(epsN, epsM, ...)
  implicit none
  real :: epsN(:), epsM(:), ... ! assumed size
  integer :: I
  ...
  do i=1,size(epsN) ! extract size from actual argument
     ...
end subroutine doWork
end subroutine whatever

In the doWork:

do not have parallel regions.

do not use (modify) scope of caller's variables that are intended to be private (pass as args or make local to doWork)

Experiment and stated earlier to reduce register pressure and/or improve cache utilization.

Adage: Parallel Outer - Vector Inner

Jim Dempsey

jimdempseyatthecove · ‎09-01-2024

I suggest you do this:

Write a simple PROGRAM that calls two external subroutines that do some simple computational work (say 5 seconds of work each).

Build that with same options, and run VTune on it.

If this exhibits the same behavior (missing subroutine names), then post the code here along with your compiler options, folder locations, current directory, version of the compiler, O/S name and version, and screenshot of VTune output showing lack of procedure names.

Also, any settings you had for VTune.

With this, perhaps we can reproduce the issue.

If the test shows the procedure names, then you should be able to note what is different between VTunes.

Jim Dempsey