Solved: Intel Fortran in Visual Studio not multithreading - Page 2

rudydiaz · ‎06-26-2023

I have a computer with dual socket AMD Epyc Rom 7552, with 48 cpus per socket for a total of 96 cpu. Windows system sees the 96 cpus and says I have access to 192 threads. But when I run a FORTRAN code in Visual Studio windows 11, and then check the resource monitor, it only sees 96 cpus not the 192 threads. PLUS, the code is running only as as fast as it runs on another computer with 16 cores and 32 threads.

The manufacturer of the computer claims it must be a problem with the FORTRAN since the Windows system sees all the threads available. Any idea of what I can do?

rudydiaz · ‎07-11-2023

OK.

We might as well close this thread.

Thanks again for your help.

View solution in original post

jimdempseyatthecove · ‎06-27-2023

>>Where do you see it calls it Visual Studio 11?

Your first post has: "But when I run a FORTRAN code in Visual Studio windows 11, ..."

Does your version of Visual Studio show the icons for VTune?

Jim Dempsey

rudydiaz · ‎06-27-2023

yes, absolutely.

rudydiaz · ‎06-27-2023

Doesn't mean I know how to use it correctly.

jimdempseyatthecove · ‎06-28-2023

VTune has a bit of a learning curve. The basics are to do the hotspot analysis:

Outline of what to do:

*** Note, you may need to launch MS VS in Administrative Mode)
Use a modified version of your Junetest (with OpenMP directives)

Program Junetest
    use omp_lib

    implicit none
    integer*4 i,inx,iny,inz,inmax,ipoints,icounting,ic,icend
    integer*4 j
    integer*4 k
    PARAMETER (INX=1000,INY=1000,INZ=1000,inmax=1000,ipoints=400000)
    character (10) t1,t2
    character (50) z

    real, ALLOCATABLE :: XD(:,:,:)
    real, ALLOCATABLE :: AD(:,:,:)
    ALLOCATE(XD(0:(INX+1),0:(INY+1),0:(INZ+1)))
    ALLOCATE(AD(0:(INX+1),0:(INY+1),0:(INZ+1)))
    XD(0:(INX+1),0:(INY+1),0:(INZ+1))=0
    AD(0:(INX+1),0:(INY+1),0:(INZ+1))=0
    icounting=0
    ic=0
    icend=0
    print *, 'how many steps '
    read *, icend
    call DATE_AND_TIME(TIME = t1, ZONE =z)
    print *, 'this is the time ',t1
    print *, 'will tell progress every 10 steps'
    DO IC=1,ICEND
    ICOUNTING=ICOUNTING+1
    IF(ICOUNTING.EQ.10) THEN
    ICOUNTING=0
    print *, ic, 'last =',AD(1000,1000,1000)
    end if
!$omp parallel do private(i, j, k)
    do k=1,1000
    do j=1,1000
    do i=1,1000
    XD(i,j,k)=sqrt(AD(i,j,k)+i+j+k)
    end do
    end do
    end do
!$omp end parallel do
    
!$omp parallel do private(i, j, k)
    do k=1,1000
    do j=1,1000
    do i=1,1000
    AD(i,j,k)=sqrt(XD(i,j,k)+i+j+k)
    end do
    end do
    end do
!$omp end parallel do

    end do ! iteration loop
    call DATE_AND_TIME(TIME = t2, ZONE =z)
    print *, 'start and end time ',t1,' ',t2

END program Junetest

Build

Assuming build worked without error...
Click on tool bar [VT]
(first time use will take a while to load)
Click on Configure Analysis
(it will issue /!\ Detecting Local Machine Configuration Please Wait)
Pull-Down Performance Snapshot
Click on Hotspots
Clice on the Play Button (right triangle)
Your program starts, wait for allocation and prompt, enter 1 (without too much delay)
VTune will report "Collecting Hotspot data" (wait for collection)
When done the summary tab will open
Click on Bottom-up tab
Double-Click on top item in Function/Call Stack (this will show the source code of the hotest loop)
You can then click on the Assembly button and you will find that the K, J, I loops were fused .AND. vectorized.

Your CPU is an AMD CPU and does not have the Intel hardware monitoring capabilities. You will be limited to timer based sampling.

*** If VTune failed to launch in MS VS User Mode, you will need to launch MS VS in Administrative mode:

Start | (*** Right-Click) Visual Studio | More | Run as Administrator.

Then open your solution and re-run VTune (this time use the VTune play-button icon)

After a few uses, you will get the hang of it.

Jim Dempsey

jimdempseyatthecove · ‎06-28-2023

EDIT last post:

In your program, add after declarations of XD and AD

!dir$ attributes align: 64 :: XD, AD

Jim Dempsey

rudydiaz · ‎06-28-2023

Jim,

This is very helpful. I followed you until you said that the k,j,i loop is the issue. As I understood you just wanted me to tell the program to run one step. I did. I took screenshots as I went through the output of Vtune: This is my summary.

This is the rest of the summary:

Now

Now I go to bottom-up:

Now after I double click on top item which is a kmp fork barrier:

It says it cannot find a source file. I click on Assembly:

And now I cannot tell what I am looking at. A note in it said to check threading. Here is the threading summary:

and the bottom of the threading summary:

Now I click on threading bottom-up and it tells me the main culprit is a manual reset event:

I don't think I am seeing in my computer what you see in yours.

Barbara_P_Intel · ‎06-28-2023

Here's a link to a Getting Started Guide for VTune. That may help! Click on the Windows version.

jimdempseyatthecove · ‎06-29-2023

The omission of the Source View is due to either or both not generating debug information at build time and/or not keeping debug information at link time. (iow you missed two steps in my instructions):

...
Properties | Fortran | Debugging | Full (/debug:full) (VTune needs debug information if you want source references)
Apply

...

Properties | Linker | Debugging | Generate Debug Informatin | Yes
Apply

The kmp_fork_barrier seems/is excessive. Is something(s) else running on the system?

Jim Dempsey

rudydiaz · ‎06-29-2023

Jim,

I know I turned those debugging options on.

One thing that confused me at first was that I was used to setting the project properties from the Visual Studio toolbar on top but that is different from the properties I accessed following your instructions. I tried to make sure I made them all look like your instructions.

I will try again.

As to whether something else is running in the system, the answer is "there shouldn't be."

But how to tell that from Task Manager is hard because it shows so much going on. However, when I have looked in Task Manager, of all the processes going on, the only one consuming > 80% of cpu time is the fortran program when it is running.

rudydiaz · ‎06-29-2023

Jim,

I started fresh with a new project. Performed all the steps in your instructions.

The results are the same. I compared the summary, bottom up etc to the snapshots I sent last time. All the comments are the same. There was slight difference in the elapsed time in the summary. Here they are side by side:

Still the first item in the bottom up is that kmp_fork_barrier

Now, as Hotspots is about to display the results, it flashes a screen that I could not capture with screenprt so I took a couple of snapshots with my phone:

and then a moment later

Do these give a clue as to what is going on?

jimdempseyatthecove · ‎06-29-2023

Maybe some insight can be gleened by reducing the thread count.

Click on the VTune right triangle (Configure Analysis)

Pull down the Advanced control.

In there you can set user defined environment variables. Define and set

OMP_NUM_THREADS 4

(note, name and value are in different fields).

You may want to experiment with different values (start with 4, 8, 16, ...)

Sometimes the Resolving Information... takes quite awhile to complete.

Your runtime isn't so long, but the number of threads you are using may be slowing things down.

I haven't fired up my KNL system in a while (64 cores, 256 hardware threads) so I cannot test with as many threads.

I had no issues with earlier versions of VTune and 256 threads.

I haven't tested with an AMD CPU.

Jim Dempsey

rudydiaz · ‎06-29-2023

I modified my program so that the too loops to be parallelized would be better representations of the typical loop the actual application uses. There are two matrices holding values, and in each cycle of the outer (time step) loop, the first 3D matrix is updated according to values in the second. And once that is done, the second is updated according to values of the first. The two loops in question are then like this:
...

...

Running VTune and successively increasing the number of threads as you suggested gives thse results:

Below 128 threads, the bottom-up tab shows one of the parallel loops as being the top time consumer but the fork barrier is rising among the results
Above 128 the fork barrier is the top time consumer.

Is there an easy explanation of what this means?
Given that I have 1 billion elements per matrix to be addressed, and that the elements of a given matrix are independent of each other (that is the method is explicit; although neighboring elements of a matrix may address (read) the same element in the other matrix) I would have assumed the omp parallelization would simply divide the 3D space to be looped amnong as many threads as available.

This must be too naive to be right. But this kind of method is known for being absolutely parallelizable.

Do you have any suggestion what I should do at this point?

jimdempseyatthecove · ‎06-29-2023

Your (shown) code, each k,j,i loop has temporary variables (kp, km, jp, jm, ip, im) that need to be declared private.

However....

As structured, with the kp=k+1, km=k-1, the compiler optimization might not collapse the k and j loops into a single loop.

If the collapse does not occur, then the parallelization will split the k=1,1000 into 128 chunks. Where 104 threads get 8 iterations, and 24 threads get 7 iterations. IOW the workload will not be balanced. (24 threads will be waiting for 104 threads to complete the additional iteration).

Collapsing the three nested loops into two or one loop will provide better balance amongst the threads.

To aid the compiler optimization into collapsing the two or possibly three loops (k,j,i), remove the temporary variables and explicitly use k+1, k-1, j+1, j-1, i+1, i-1 in the indexing.

*** Note, while this may seem to perform unnecessary additional addition/subtraction, the compiler optimizer will(should) generate offsets to the bases of AD and XD for the +/- 1's as opposed to performing the addition/subtraction.

Jim Dempsey

rudydiaz · ‎06-29-2023

Ok. Just did it and reran the case with 32 threads. The results in terms of time and average effective cpu utilization are the same as the previous case. Same with the 128 threads except I hit the fork barrier at 128.

jimdempseyatthecove · ‎06-29-2023

It is odd that the elapse time is relatively consistent.

Increase the timing loop (that above the do k=) such that any initialization is amortized.

Note, if this outer loop (before do k=) performs I/O, then add do l=1,10 around the two nested loops (the entire section of code shown three postings above).

What I suspect is that your code may be memory bandwidth limited in this section of code.

In order to improve performance, you will need to do a little more coding work.

The strategy (for the above algorithm) is to construct 3D tiles, for example, 8x8x8. This will produce 125 3D tiles.

!$omp parallel do private(k,j,i,kb,ke,jb,je,ib,ie)
do iTile = 0, 124
! use iTile to compute the begin and end values for the k, j, i loops
...
do k=kb,ke
  do j=jb,je
    do i=ib,ie
      ...

Also, when you paste code examples, it helps a lot if you paste the text as opposed to an image (we can copy your text and paste in our response).

You might find this article on interest:

http://www.techenablement.com/plesiochronous-loosely-synchronous-phasing-barriers-to-avoid-thread-inefficiencies/

Jim Dempsey

rudydiaz · ‎06-29-2023

Jim,

If I understood what you are saying, you want me to compare two programs.

Program A, like the latest version I had before is here below. (I have changed the quantity being evaluated in the loops to be more like the final application).

== old version ==

Program Junetest
use omp_lib

implicit none
integer*4 i,inx,iny,inz,inmax,ipoints,icounting,ic,icend
integer*4 j,im,ip,jm,jp,km,kp,II,JJ,KK
integer*4 k
PARAMETER (INX=1000,INY=1000,INZ=1000,inmax=1000,ipoints=400000)
character (10) t1,t2
character (50) z

real, ALLOCATABLE :: XD(:,:,:)
real, ALLOCATABLE :: AD(:,:,:)
!dir$ attributes align: 64::XD,AD
ALLOCATE(XD(0:(INX+1),0:(INY+1),0:(INZ+1)))
ALLOCATE(AD(0:(INX+1),0:(INY+1),0:(INZ+1)))

c print *, 'how many steps '
c read *, icend
icend=100
call DATE_AND_TIME(TIME = t1, ZONE =z)
print *, 'this is the time ',t1
print *, 'will tell progress every 10 steps'

AD(500,500,500)=1.

DO IC=1,ICEND
ICOUNTING=ICOUNTING+1
IF(ICOUNTING.EQ.10) THEN
ICOUNTING=0
print *, ic, 'last =',AD(500,500,500)
end if
!$omp parallel do PRIVATE(I,J,K)
do k=1,1000
do j=1,1000
do i=1,1000
XD(i,j,k)=XD(i,j,k)+(AD(i+1,j,k)-AD(i-1,j,k)-AD(i,j+1,k)+
& AD(i,j-1,k))/50.
end do
end do
end do
!$omp end parallel do

!$omp parallel do PRIVATE (I,J,K)
do k=1,1000
do j=1,1000
do i=1,1000
AD(i,j,k)=AD(i,j,k)-(XD(i+1,j,k)+XD(i-1,j,k)-XD(i,j+1,k)+
& XD(i,j-1,k))/50.

end do
end do
end do
!$omp end parallel do

end do ! iteration loop
call DATE_AND_TIME(TIME = t2, ZONE =z)
print *, 'start and end time ',t1,' ',t2

END program Junetest

== == to be compared with the new version below:

Program Junetest
use omp_lib

implicit none
integer*4 i,inx,iny,inz,inmax,ipoints,icounting,ic,icend
integer*4 j,im,ip,jm,jp,km,kp,II,JJ,KK
integer*4 k,IB,JB,KB,IE,JE,KE,ITILE
PARAMETER (INX=1000,INY=1000,INZ=1000,inmax=1000,ipoints=400000)
character (10) t1,t2
character (50) z

real, ALLOCATABLE :: XD(:,:,:)
real, ALLOCATABLE :: AD(:,:,:)
!dir$ attributes align: 64::XD,AD
ALLOCATE(XD(0:(INX+1),0:(INY+1),0:(INZ+1)))
ALLOCATE(AD(0:(INX+1),0:(INY+1),0:(INZ+1)))

c print *, 'how many steps '
c read *, icend
icend=100
call DATE_AND_TIME(TIME = t1, ZONE =z)
print *, 'this is the time ',t1
print *, 'will tell progress every 10 steps'

AD(500,500,500)=1.

DO IC=1,ICEND
ICOUNTING=ICOUNTING+1
IF(ICOUNTING.EQ.10) THEN
ICOUNTING=0
print *, ic, 'last =',AD(500,500,500)
end if
!$omp parallel do PRIVATE(k,j,i,kb,ke,jb,je,ib,ie)
DO ITILE=0,7
IB=1+500*MODULO(ITILE,2)
JB=1+500*MODULO(INT(ITILE/2),2)
KB=1+500*MODULO(INT(ITILE/4),2)
IE=IB+499
JE=JB+499
KE=KB+499
do k=KB,KE
do j=JB,JE
do i=IB,IE
XD(i,j,k)=XD(i,j,k)+(AD(i+1,j,k)-AD(i-1,j,k)-AD(i,j+1,k)+
& AD(i,j-1,k))/50.
end do
end do
end do
END DO

!$omp end parallel do

!$omp parallel do PRIVATE(k,j,i,kb,ke,jb,je,ib,ie)
DO ITILE=0,7
IB=1+500*MODULO(ITILE,2)
JB=1+500*MODULO(INT(ITILE/2),2)
KB=1+500*MODULO(INT(ITILE/4),2)
IE=IB+499
JE=JB+499
KE=KB+499
do k=KB,KE
do j=JB,JE
do i=IB,IE
AD(i,j,k)=AD(i,j,k)-(XD(i+1,j,k)+XD(i-1,j,k)-XD(i,j+1,k)+
& XD(i,j-1,k))/50.

end do
end do
end do
END DO

!$omp end parallel do

end do ! iteration loop
call DATE_AND_TIME(TIME = t2, ZONE =z)
print *, 'start and end time ',t1,' ',t2

END program Junetest

===

Both programs are hardwired to run 100 steps without asking. So no I/O to amortize.

The second one divides the 3D space into 8 tiles. Yes, I know it isn't 125 tiles but if it is going to make a difference, it should be noticed.

The second one, the tiled one, takes longer than the first one.

With numthreads at 32, VTune gives for the first an elapsed time of 36 seconds but for the second one 50 seconds. Furthermore, at 32 threads the first one's largest time consumption is one of the loops. For the second one it is that fork barrier.

And I am still at a loss why my old computer with 16 cpus takes only 1.33 times longer to run the same code. I mean, if I was running out of bandwidth in this brand new one with 96 cpus, I would certainly expect to be worse off on the old one. Now the old one is running Visual Studio 2019 in windows 10, and doesn't have a working VTune. (I may have to repair that program). However, its thread analyzer works and it doesn't complain about anything except to tell me that some improvements could be made. It says that the loops are approximately 100% vectorized.

TobiasK · ‎07-03-2023

@rudydiaz

In your last two examples the arrays AD and XD are not initialized and thus may contain random data.

As Jim mentioned, the loops here are memory bound, so it is essential to handle locality correctly.

Depending on your bios settings you may have a lot of NUMA domains enabled in your system.

Be aware that a simple AD=0 XD=0 initializes the arrays on the master thread only, hence you loose all the memory bandwidth of the other NUMA domains, at least that is the case of Linux. (If you have NUMA balancing enabled and you run much more iterations, the memory pages might move to other NUMA domains).
To make use of all NUMA domains you have to also parallelize the array initialization using the same parallelization as for the subsequent computations (e.g. also the K loop).

Your second example uses the tile loop for parallelization, this loop has only 8 iterations, so you only have 8 threads busy. If you run with more than 8 threads the other threads will directly jump into the barrier at $OMP end parallel do. This is why you see such a huge time at fork barrier.

If you really want to put some time into parallelization of your application, you may find some helpful analysis in some research papers, e.g.:

https://www.researchgate.net/publication/266856549_Multicore-Optimized_Wavefront_Diamond_Blocking_for_Optimizing_Stencil_Updates

So in short, this application is way to complicated to get some significant performance improvement from auto parallelization, you have to do it on your own.

Best
Tobias

rudydiaz · ‎07-03-2023

Thank you, Tobias. I will go to that link and study that.

I understand now why the auto-parallelization is not the way to go. This is why Mr. Dempsey has been patently walking me through fixing my misconceptions using OMP directives.

jimdempseyatthecove · ‎07-03-2023

Here is a free-form reworked version of your 2nd version.

Note comment about missing k-1 and k+1

This is free-form (.f90)

Program Junetest
    use omp_lib

    implicit none
    integer*4 i,inx,iny,inz,inmax,ipoints,icounting,ic,icend
    integer*4 j,II,JJ,KK
    integer*4 k,IB,JB,KB,IE,JE,KE,ITILE
    PARAMETER (INX=1000,INY=1000,INZ=1000,inmax=1000,ipoints=400000)
    character (10) t1,t2
    character (50) z

    real, ALLOCATABLE :: XD(:,:,:)
    real, ALLOCATABLE :: AD(:,:,:)
    !dir$ attributes align: 64::XD,AD
    ALLOCATE(XD(0:(INX+1),0:(INY+1),0:(INZ+1)))
    ALLOCATE(AD(0:(INX+1),0:(INY+1),0:(INZ+1)))

    ! on NUMA system, perform 1st touch using same NUMA node as will the computation.
    ! Tile the two larger stride indicies by the number of threads (Y/Z plane)
    ! each thread will have a contiguous secion of X's to work on
    ! first touch the interior
    !$omp parallel do collapse(2) PRIVATE(k,j,i)
    do k=1, INZ     ! not 0:INZ+1
        do j=1, INY ! not 0:INY+1
            do i=0, INX+1
                XD(i,j,k)=0.0
                AD(i,j,k)=0.0
            end do
        end do
    end do
    ! next, first touch the perimeters
    !$omp parallel do collapse(2) PRIVATE(k,j,i)
    do k=1, INZ     ! not 0:INZ+1
        do j=1, INY ! not 0:INY+1
            do i=0, INX+1
                if(k==1) XD(i,j,k-1)=0.0
                if(j==1) XD(i,j-1,k)=0.0
                if(k==1 .and. j==1) XD(i,j-1,k-1)=0.0
                if(k==INZ) XD(i,j,k+1)=0.0
                if(j==INY) XD(i,j+1,k)=0.0
                if(k==INZ .and. j==INY) XD(i,j+1,k+1)=0.0
            end do
        end do
    end do



    ! print *, 'how many steps '
    ! read *, icend
    icend=100
    call DATE_AND_TIME(TIME = t1, ZONE =z)
    print *, 'this is the time ',t1
    print *, 'will tell progress every 10 steps'

    AD(500,500,500)=1.

    DO IC=1,ICEND
        ICOUNTING=ICOUNTING+1
        IF(ICOUNTING.EQ.10) THEN
            ICOUNTING=0
            print *, ic, 'last =',AD(500,500,500)
        end if
        
        !$omp parallel do collapse(2) PRIVATE(k,j,i)
        do k = 1, INZ
            do j = 1, INY
                do i = 1, INX
                    ! ?? missing K-1 and k+1
                    XD(i,j,k)=XD(i,j,k)+(AD(i+1,j,k)-AD(i-1,j,k)-AD(i,j+1,k)+ &
                    & AD(i,j-1,k))/50.
                end do
            end do
        end do
        !$omp end parallel do

        !$omp parallel do collapse(2) PRIVATE(k,j,i)
        do k = 1, INZ
            do j = 1, INY
                do i = 1, INX
                        ! ?? missing K-1 and k+1
                    AD(i,j,k)=AD(i,j,k)-(XD(i+1,j,k)+XD(i-1,j,k)-XD(i,j+1,k)+ &
                    & XD(i,j-1,k))/50.
                end do
            end do
        end do
        !$omp end parallel do

    end do ! iteration loop
    call DATE_AND_TIME(TIME = t2, ZONE =z)
    print *, 'start and end time ',t1,' ',t2

END program Junetest

Note the tiling stragegy.

Jim Dempsey

rudydiaz · ‎07-03-2023

Jim,

Thank you again. I will try this immediately.

But I feel really stupid. I must have completely misunderstood what you meant by tiling strategy.

When you say at the end "Note the tiling strategy" are you referring to the two loops you added just after the allocate commands, where you initialize the variables to zero? I think I understand now, based also on Tobias' reply "that a simple AD=0 XD=0 initializes the arrays on the master thread only. " So, you are certainly fixing that.

But would you mind confirming to me that those loops are also doing the tiling you are referring to? Or do I need to still do something like my 2nd program except to, say, use the modulo function to separate the space into 125 sub-spaces?

jimdempseyatthecove · ‎07-03-2023

Tiling does not necessarily mean cubic or square sections. The design goal is to partition the arrays such that each thread can process (as much as possible) contiguous sections of memory. And further, for NUMA capable systems, try to code such that each thread is affinity pinned such that it is constrained to a single CPU and perhaps to one core. (L1 and L2 caches are core based, L3 is typically CPU based, CPU is NUMA node based on NUMA configured systems).

On CPU's in multi-socket systems, the BIOS can configure the system in at least two ways:

a) NUMA where each CPU's connected RAM is isolated to one/different region/s of physical address space. Where each CPU has shorter latencies accessing its own RAM, and longer latencies when accessing RAM attached to the other CPU(s)

b) UMA aka non-NUMA aka interlieved. Where each sequential cache line size (or 2x this size) is located on a different CPU's memory system.

The optimal choice of configuration will depend on the nature of the application(s) run on the system

When running many small applications where each process can run within a single CPU's connected memory, then NUMA configuration may be best (you may need/want to bind such processes to its startup CPU).

When running a large application, larger than one CPU's connected memory, and IIF that application is .NOT. NUMA aware, then you might see benefit in selecting UMA configuration. IIF the application is NUMA aware then configure for NUMA.

Note, the virtual memory granularity is CPU's page size. For large arrays, on NUMA aware applications, consider using the alignment of that of the page size (usually 4KB).

Or look at example 2 here.

Jim Dempsey