Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28612 Discussions

Intel Fortran in Visual Studio not multithreading

rudydiaz
New Contributor I
11,015 Views

I have a computer with dual socket AMD Epyc Rom 7552, with 48 cpus per socket for a total of 96 cpu. Windows system sees the 96 cpus and says I have access to 192 threads. But when I run a FORTRAN code in Visual Studio windows 11, and then check the resource monitor, it only sees 96 cpus not the 192 threads. PLUS, the code is running only as as fast as it runs on another computer with 16 cores and 32 threads.

 

The manufacturer of the computer claims it must be a problem with the FORTRAN since the Windows system sees all the threads available.  Any idea of what I can do?

0 Kudos
1 Solution
rudydiaz
New Contributor I
9,355 Views

OK.

We might as well close this thread.

Thanks again for your help.

View solution in original post

0 Kudos
66 Replies
Steve_Lionel
Honored Contributor III
3,959 Views

How are you expressing parallelism in your program?  It's an unusual program that can make optimal use of that many cores. How are you determining the number of threads being used in Task Manager?

0 Kudos
rudydiaz
New Contributor I
3,949 Views

Thank you for your reply.

I have been using Intel Parallel FORTRAN for many years. Once it became available free in Visual Studio, I have continued to use it. All my programs have the original !DEC parallel line before the loops to be parallelized.

But then, in the Visual Studio Project properties, I go to the fortran tab and in the General subtab I click on optimization "Maximize speed plus higher level optimizations [/O3]". I also allow multiprocessor compilations.

 

Then I go to the Optimization sub tab and under Parallelization, click "Yes [/Qparallel]"  I se the threshold for parallelization at 90 and that for vectorization at 90. I also allow aggressive prefetch  [/Qopt-prefectch=3].

I also set Debugging to "line numbers only" and turn off all run time checks except for Traceback information.

These are the settings I have used before in  my other computers (4 cpu, 8cpu, 16 cpu) and they have always worked.

 

When I start the program running I open Task Manager and look at the details of the processes. All CPUs are always operating between 80 and 100% but as I scroll down the tab on the right, the last CPU is 95. That is, it claims to be seeing only 96, not twice the number of cpus as I see in  my other computers. Yet in the previous CPU screen of the task manager it says I have 96 cores and 192 logical processors.

 

I can tell it is running slow because I start the same program running in this computer and my other one at the same time. In the program, I print to the screen every time the program has cycled through 10 steps. per what I see on the two screens the 16 core / 32 thread computer is keeping pace with this one with 96 cores. Keeping pace: the 96 core one may be 5% faster.

 

The program is a number cruncher using multiple 3D loops but basically doing arithmetic. No matrix inversions or linear algebra.

I hope I have answered your questions.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,929 Views

Try this program:

program Console17
    use omp_lib
    implicit none
    !$omp parallel
    !$omp single
    print *, "max threads", omp_get_max_threads()
    print *, "num threads", omp_get_num_threads()
    !$omp end single
    print *, "Hello from", omp_get_thread_num()
    !$omp end parallel
end program Console17

Jim Dempsey

 

 

0 Kudos
rudydiaz
New Contributor I
3,922 Views

Jim,

Thank you. I replaced my code with the above in the same visual studio project which has all the previous settings.

It ran and returned

 

max threads        192

num threads            1

Hello from                0

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,915 Views

Check your environment variables OMP_xxx, GOMP_xxx, KMP_xxx to see if one or more of them are restricting the thread count.

Jim Dempsey

 

0 Kudos
rudydiaz
New Contributor I
3,899 Views

Thank you. This will take a me a bit. I am truly an L-user. So I first have to figure out how to check environment variables.

0 Kudos
rudydiaz
New Contributor I
3,898 Views

I just saw the question from John Nichols and realized my project settings did not have the openMP allowed in the Language sub tab.

As soon as I set that,  my outout looks like yours

max threads   192

num threads   192

and the Hello message came back from all the 192 threads.

 

So, does this mean that my code would run fine using MP directives but not when I just let it autoparallelize?

I need to answer that question myself. I wrote a simple code with two 3D loops that has the same problem as my large code (taking as long as a 32 thread, old computer). So, I am going to insert openMP directives... I hope ... using your Hello code as guidance.

 

Thank you.

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,913 Views

Output on my system:

 max threads           8
 num threads           8
 Hello from           5
 Hello from           1
 Hello from           2
 Hello from           7
 Hello from           0
 Hello from           3
 Hello from           6
 Hello from           4

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
3,902 Views

@jimdempseyatthecove 

 

Where do I add the openmp command to VS to avoid this error ? I do not use openmp but I was trying your code? 

 

Thanks 

John

Capture7.PNG

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,901 Views

jimdempseyatthecove_0-1687888316672.png

Jim Dempsey

0 Kudos
Steve_Lionel
Honored Contributor III
3,879 Views

/parallel uses OpenMP underneath, but it is not guaranteed to be optimal. It is a quick way of getting some parallelism without diving into OpenMP directives. If best use of all the threads matters to you, and keep in mind that threads compete for a core's resources, you'll want to learn about OpenMP.

0 Kudos
rudydiaz
New Contributor I
3,870 Views

Steve,

I understand what you are saying. My program is not elegant at all.

The issue I am trying to resolve is that the same program in MS Visual Studio Fortran executes only a bit slower on an older computer that has only 16 cpus and 32 threads; compared to this one with 96 cpus and 192 threads.  The improvement for the new computer should have been a factor of 6. I am getting only a factor of 1.33.

 

So, I am trying to figure out what has changed: Is it something with Windows 11 and the newest MS Visual Studio?

0 Kudos
rudydiaz
New Contributor I
3,868 Views

This is the small program I am testing in both computers

Both Visual Studios have the same settings to autoparallelize.

The 16 cpu computer takes about 36 seconds while the 96 core computer takes about 27. 

 

program starts below >>>>>>>>

Program Junetest
use omp_lib

implicit none
integer*4 i,inx,iny,inz,inmax,ipoints,icounting,ic,icend
integer*4 j
integer*4 k
PARAMETER (INX=1000,INY=1000,INZ=1000,inmax=1000,ipoints=400000)
character (10) t1,t2
character (50) z

real, ALLOCATABLE :: XD(:,:,:)
real, ALLOCATABLE :: AD(:,:,:)
ALLOCATE(XD(0:(INX+1),0:(INY+1),0:(INZ+1)))
ALLOCATE(AD(0:(INX+1),0:(INY+1),0:(INZ+1)))
XD(0:(INX+1),0:(INY+1),0:(INZ+1))=0
AD(0:(INX+1),0:(INY+1),0:(INZ+1))=0
icounting=0
ic=0
icend=0
print *, 'how many steps '
read *, icend
call DATE_AND_TIME(TIME = t1, ZONE =z)
print *, 'this is the time ',t1
print *, 'will tell progress every 10 steps'
DO IC=1,ICEND
ICOUNTING=ICOUNTING+1
IF(ICOUNTING.EQ.10) THEN
ICOUNTING=0
print *, ic, 'last =',AD(1000,1000,1000)
end if

do k=1,1000
do j=1,1000
do i=1,1000
XD(i,j,k)=AD(i,j,k)+i+j+k
end do
end do
end do

do k=1,1000
do j=1,1000
do i=1,1000
AD(i,j,k)=XD(i,j,k)+i+j+k
end do
end do
end do

end do ! iteration loop
call DATE_AND_TIME(TIME = t2, ZONE =z)
print *, 'start and end time ',t1,' ',t2

END program Junetest

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,848 Views

In the loops of your test program, there is no computation to say of.

Addition of 3 integers, convert to real, addition of two integers (with memory read), then write.

These loops should vectorize, however, due to lack of computation, the test program will be memory access limited.

Two CPU's each with eight channel memory (assuming all 16 sticks installed), this program should saturate the memory subsystem at 16 threads (in seperate cores). It may reach saturation at 8 threads.

 

The code generation might benefit by declaring the arrays to be aligned.

What might be a better test is to introduce additional computation such as

...

XD(i,j,k)=sqrt(AD(i,j,k)+i+j+k)

...

AD(i,j,k)=sqrt(XD(i,j,k)+i+j+k)

...

Though that might not be sufficient enough processing time between memory reads and writes.

Jim Dempsey

0 Kudos
rudydiaz
New Contributor I
3,843 Views

Are you telling me that the code runs slow because the program's arithmetic operations are too simple? 

While the program is running, Task Manager, in the details of the resource monitor, shows 96 cpus (with the annotation (Node1)). In the front summary of the performance tab it says 192 threads are running. In the real time processes list, it indeed shows 192 threads consuming over 80% of cpu while the program is running.

I don't understand how this compares with your statement that it will "saturate" at 16 threads.

Besides, this is a "toy" version of the real program that showed me there was a problem. That one has many more matrices, much more computation going on inside the do loops, and the problem is exactly the same: The old computer with 16 cores is just slightly slower than this one, running Visual Studio Fortran.

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,869 Views

Auto-parallelism is likely to fault. Setup a VTune session to see what is happening.

Also note, more threads means finer granularity. When the granularity gets too small, the efficiency goes down. VTune will provide you with the information you need to improve the parallelism of your program.

Jim Dempsey

 

0 Kudos
rudydiaz
New Contributor I
3,863 Views

Hmmm...

I have never used VTune

But in the performance snapshot it says: "this analysis type is not applicable to the current machine microarchitecture"

My processor is a dual socket AMD Epyc Rome 7552.

Anyway, I ran the code, let VTune attach to the process and it created a result file. But I do not know how to interpret the Result file.

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,847 Views

You will have to use timer based sampling as opposed to hardware based sampling.

I notice you have MS Visual Studio 11. I do not think the newest version of oneAPI integrates with that old of a version of MS VS.

On the versions that integrate (mine is MS VS 2019, Version 16.10.3), you launch VTune from the IDE

This is a GUI interface and you do not see html text.

Jim Dempsey

 

0 Kudos
rudydiaz
New Contributor I
3,843 Views

The computer that is running this is using what it calls Visual Studio 2022 (although I downloaded the 2023.0.0 latest version.) Where do you see it calls it Visual Studio 11?

What I posted is a cut and paste from notepad looking at the file VTune produced (Console1.vtuneproj).

0 Kudos
rudydiaz
New Contributor I
3,861 Views

ok here is the result file from vtune

It knows I have an Epyc cpu and that I have 192 logical cpus from 48 cpu processors.

But I don't know what the rest of the file tells me about why it runs so slow.

 

file starts below>>>>>

<?xml version="1.0" encoding="UTF-8"?>
<root>
<rdmgr>
<timestamp type="u64_t">1687882736</timestamp>
<hostname>DESKTOP-M4VSB29</hostname>
<os>windows</os>
<product>Intel® VTune™ Profiler 2023.0.0</product>
<buildNumber type="s32_t">624757</buildNumber>
<logicalCPUCount type="s32_t">192</logicalCPUCount>
<physicalCoreCount type="s32_t">96</physicalCoreCount>
<processorPackageCount type="s32_t">2</processorPackageCount>
<CPUFrequency type="s64_t">2200000000</CPUFrequency>
<CPUFamily type="s32_t">23</CPUFamily>
<CPUModel type="s32_t">49</CPUModel>
<CPUStepping type="s32_t">0</CPUStepping>
<CPUBrandName>AMD EPYC 7552 48-Core Processor </CPUBrandName>
<isa>avx2</isa>
</rdmgr>
<vs_search_dirs>
<searchDirs>
<searchCategory>
<category type="u8_t">1</category>
<searchDirectory>
<name>C:\Users\User\Documents\FORTRAN 2023\Testing Fortran\Console1</name>
<recursive type="bool">false</recursive>
<priority type="bool">false</priority>
</searchDirectory>
</searchCategory>
<searchCategory>
<category type="u8_t">2</category>
<searchDirectory>
<name>C:\Users\User\Documents\FORTRAN 2023\Testing Fortran\Console1</name>
<recursive type="bool">true</recursive>
<priority type="bool">false</priority>
</searchDirectory>
</searchCategory>
<searchCategory>
<category type="u8_t">3</category>
<searchDirectory>
<name>C:\Users\User\Documents\FORTRAN 2023\Testing Fortran\Console1</name>
<recursive type="bool">true</recursive>
<priority type="bool">false</priority>
</searchDirectory>
</searchCategory>
<searchCategory>
<category type="u8_t">4</category>
</searchCategory>
</searchDirs>
</vs_search_dirs>
</root>

0 Kudos
Reply