Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Difference in performance of a COARRAY Fortran example on two similar PCs

avinashs
New Contributor I
2,814 Views

I ran a simple COARRAY Fortran example on two Windows 10 machines with the same project files (so exactly the same compiler settings). However, the output is different on both machines as shown below. Is this expected behavior? Any help will be appreciated, especially with settings that need to be changed.

Computer 1: Intel i7 4770K,    16 GB RAM, Cores = 4, Threads = 8, special order through an engineering software provider
Computer 2: Intel i7 6820HQ, 32 GB RAM, Cores = 4, Threads = 8, special order directly from Dell

The program code is:

program main
  ! Test COARRAY Fortran 2008
  if (this_image() == 1) then
     write(*,'(1x,a,1x,i0,1x,a)') 'Coarray Fortran program running with', num_images(), 'images'
  end if
  sync all
  write(*,'(1x,a,1x,i0)') 'Hello from image', this_image()
  if (this_image() == 1) read *
1 continue
end program main

The output on Computer 1 is as advertised in the tutorial:

         Coarray Fortran program running with 8 images
         Hello from image 1
         Hello from image 5
         Hello from image 2
         Hello from image 6
         Hello from image 3
         Hello from image 4
         Hello from image 7
         Hello from image 8

However, the output on Computer 2 is different as seen below and reports that only 1 image is used.

         Coarray Fortran program running with 1 images
         Hello from image 1
         Coarray Fortran program running with 1 images
         Hello from image 1
         Coarray Fortran program running with 1 images
         Hello from image 1
         Coarray Fortran program running with 1 images
         Hello from image 1
         Coarray Fortran program running with 1 images
         Hello from image 1
         Coarray Fortran program running with 1 images
         Hello from image 1
         Coarray Fortran program running with 1 images
         Hello from image 1
         Coarray Fortran program running with 1 images
         Hello from image 1
         
I will add that Computer 2 in general runs slower than Computer 1 on all Fortran applications although it is a newer and potentially superior computer.
 

0 Kudos
48 Replies
JAlexiou
New Contributor I
1,585 Views

I just want to note that for maximizing parallel performance and certainly if you are doing engineering simulations (FEA, CFD) you need to turn off hyperthreading so that your 4 cores have 4 threads. This is done in the BIOS usually.

As fas as Fortran, I can reproduce the output 1 on my machine. it seems that if co-arrays are not turned on it produces the output 2. So check your compile settings again. 

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,585 Views
CPU        i7-4770K(Desktop)       i7-6820HQ(Mobile)
Clock      3.5GHz (Turbo 3.9GHz)   2.7GHz(Turbo 3.6GHz)
Mem Bandw  25.6GB/s                34.1 GB/s
Mem Type   DDR3-1333/1600          DDR4-2133,LPDDR3-1866,DDR3L-1600
Mem Ch     2                       2

For highly compute bound (iow not significantly memory access bound), expect  Desktop/Mobile ~3.5/2.7=1.3

For highly memory bound streaming access Desktop/Mobile ~25.6/34.1 = 0.75 (Mobile/Desktop ~1.33)

For somewhat random memory access, it may favor the Desktop.

Both systems have same instruction set extensions (AVX2 latest)

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,585 Views

It should also be noted that Mobile configurations generally do not have as good of cooling as Desktop configurations (iow less time at Turbo for active programs).

Highly compute multi-threaded programs tend to run at base frequency.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,585 Views

You can experiment with HyperThreadding. Typical parallel compute applications may experience ~15% improvement. I think the main objection about using HT comes from the objecting person looking at the scaling factor alone and not at the total throughput of the application.

Jim Dempsey

0 Kudos
Lorri_M_Intel
Employee
1,585 Views

How did you build and run your programs?

0 Kudos
avinashs
New Contributor I
1,585 Views

Lorri Menard (Intel) wrote:

How did you build and run your programs?

The programs were built with Intel(R) Visual Fortran Compiler 19.0.1.144 [IA-32] and Microsoft Visual Studio (MSVC\14.15.26726) on both machines. Effectively, the project was first built on computer 1 and then the project file was simply copied to computer 2, which maintains the same relative path to the project files.

The only real setting other than the default was Configuration Properties -> Fortran -> Language -> Enable Coarrays -> For Shared Memory (/Qcoarray:shared).

0 Kudos
avinashs
New Contributor I
1,585 Views

John Alexiou wrote:

I just want to note that for maximizing parallel performance and certainly if you are doing engineering simulations (FEA, CFD) you need to turn off hyperthreading so that your 4 cores have 4 threads. This is done in the BIOS usually.

As fas as Fortran, I can reproduce the output 1 on my machine. it seems that if co-arrays are not turned on it produces the output 2. So check your compile settings again. 

 

 

 

Thanks for your comment. I am using CFD. Computer #2 through Dell was recommended by ANSYS. The Dell Precision Optimizer has a settings profile that is tailored for ANSYS. The IVF is used to develop other standalone in-house codes that we use. Since Computer #2 is not as "fast" as #1 on most IVF simulations, I am doing some debugging tests to understand why.

0 Kudos
avinashs
New Contributor I
1,585 Views

Note: If coarrays are turned off i.e.  Configuration Properties -> Fortran -> Language -> Enable Coarrays -> No, the output is

 Coarray Fortran program running with 1 images
 Hello from image 1

Again, this is as advertised in the tutorial. Therefore for Computer 2, the coarray with image 1 is being called 8 times corresponding to 8 threads in coarray mode. So the output of Computer 2 is not the result of coarrays being turned off (@John Alexiou).

0 Kudos
Lorri_M_Intel
Employee
1,585 Views

Are you running the application inside Visual Studio, or from a command line?

If it's from a command line, please issue this command, and then run your application again:

set FOR_COARRAY_DEBUG_STARTUP=TRUE

You'll see a line something like this:

      Generated MPI command line is 'mpiexec.exe -localonly -n 8 testme.exe '.

 

 

0 Kudos
j0e
New Contributor I
1,585 Views

Is the MPI service running on computer 2?

0 Kudos
Steve_Lionel
Honored Contributor III
1,585 Views

The Coarray "Hello World" sample is a terrible sample, as it doesn't use coarrays at all. In version 19 there is, in the same folder, a mcpi_coarray_final.f90 source that goes along with a tutorial. Unfortunately, parts of the tutorial were not included - I've asked the Intel folks to fix this. (I wrote this sample before I left Intel in 2016.) I've attached the source here if you want to try it.

That said, the results shown in the first post are very strange. I think Lorri is on the right track in asking for the debug info, as there are 8 copies of the program running as single images. I wonder if some non-Intel MPI is in PATH before the Intel MPI.

0 Kudos
avinashs
New Contributor I
1,585 Views

Thanks everyone for the informative responses. As a first step, I switched the example to the Monte Carlo integration as suggested by @Steve Lionel (Ret.) Since the Windows shell closes automatically when the program ends, I added a PAUSE statement after the final print statement. I will note that a READ * statement did not work in coarray mode in this example.

The broad results are as follows:

I tried 4 options with a project set up in MSVSC 2017: Debug x86, Debug x64, Release x86,  Release x64 and a further 2 options directly from the command line shells for IA-32 and Intel 64.

Results for Computer 1:

1. Computer 1 was able to solve the MCI example in all cases except that x64 Debug/Release did not work in MSVSC 2017 (the shell froze and had to be forcibly closed).

2, However, with Computer 1, the Intel 64 shell was able to run the case.

3. For clarity, the output from Computer 1 is posted below in a separate comment. The performance shows x32 Debug < x32 Release < IA-32 command line < Intel 64 command line.

4. All the cases failed (either did not run or crashed) on Computer 2 except the Intel 64 shell, where the performance was similar to Computer 1 on Intel 64.

5. I also took the executables that successfully worked on Computer 1 and ran them on Computer 2. See results below.

0 Kudos
avinashs
New Contributor I
1,585 Views
Results from Computer 1:

Computer 1 Results: x86 Debug
	Computing pi using 1800000000 trials across 8 images
	Computed value of pi is 3.1415677, Relative Error: .794E-05
	Elapsed time is 20.5 seconds
	Fortran Pause - Enter command<CR> or <CR> to continue.
			
Computer 1 Results: x86 Release
	Computing pi using 1800000000 trials across 8 images
	Computed value of pi is 3.1415817, Relative Error: .349E-05
	Elapsed time is 17.0 seconds
	Fortran Pause - Enter command<CR> or <CR> to continue.
	
Computer 1 Results: x64 Debug ()
	No results were produced - not even the first line "Computing pi using 1800000000 trials across 8 images"
	
Computer 1 Results: x64 Release
	No results were produced - not even the first line "Computing pi using 1800000000 trials across 8 images"
	
Computer 1 Results: ia32 command line
	Computing pi using 1800000000 trials across 8 images
	Computed value of pi is 3.1415979, Relative Error: .167E-05
	Elapsed time is 15.1 seconds
	Fortran Pause - Enter command<CR> or <CR> to continue.
	
Computer 1 Results: Intel 64 command line
	MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported.
	MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
	
	Computing pi using 1800000000 trials across 8 images
	Computed value of pi is 3.1416074, Relative Error: .470E-05
	Elapsed time is 10.5 seconds
	Fortran Pause - Enter command<CR> or <CR> to continue.
0 Kudos
avinashs
New Contributor I
1,585 Views
Results from Computer 2: IA32 command line (note: only the final PAUSE statement was actually active)

Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computed value of pi is 3.1416665, Relative Error: .235E-04
Elapsed time is 133. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416268, Relative Error: .109E-04
Elapsed time is 134. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416263, Relative Error: .107E-04
Elapsed time is 135. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416074, Relative Error: .470E-05
Elapsed time is 134. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416522, Relative Error: .189E-04
Elapsed time is 134. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416160, Relative Error: .742E-05
Elapsed time is 134. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1415499, Relative Error: .136E-04
Elapsed time is 135. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416143, Relative Error: .690E-05
Elapsed time is 135. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.

 

0 Kudos
avinashs
New Contributor I
1,585 Views
Results from Computer 2: Intel 64 command line

MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported.
MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
Computing pi using 1800000000 trials across 8 images
Computed value of pi is 3.1415707, Relative Error: .698E-05
Elapsed time is 10.2 seconds
Fortran Pause - Enter command<CR> or <CR> to continue.

 

0 Kudos
avinashs
New Contributor I
1,585 Views
Results from Computer 2: Using the x86 Debug executable from Computer 1
Note: towards the end, the shell froze and the computer fan went into overdrive.

Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images
Computing pi using 1800000000 trials across 1 images

Computed value of pi is 3.1415241, Relative Error: .218E-04
Elapsed time is 147. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416645, Relative Error: .229E-04
Elapsed time is 147. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416850, Relative Error: .294E-04
Elapsed time is 148. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1415485, Relative Error: .140E-04
Elapsed time is 149. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1415784, Relative Error: .455E-05
Elapsed time is 150. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1416506, Relative Error: .184E-04
Elapsed time is 150. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1415736, Relative Error: .605E-05
Elapsed time is 153. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.
Computed value of pi is 3.1415370, Relative Error: .177E-04
Elapsed time is 155. seconds
Fortran Pause - Enter command<CR> or <CR> to continue.

[proxy:0:0@ComputerName] ..\windows\src\hydra_sock.c (379): write error (errno = 0)
[proxy:0:0@ComputerName] proxy_cb.c (256): error writing data
[proxy:0:0@ComputerName] ..\windows\src\hydra_demux.c (203): callback returned error
[proxy:0:0@ComputerName] proxy.c (989): error waiting for event

 

0 Kudos
Steve_Lionel
Honored Contributor III
1,585 Views

Ah, a clue! Computer 2 has some other MPI installed that is being invoked. 

On computer 2, open a Command Prompt (not PowerShell) window. One way to do this is to click on the search icon in the lower left, type cmd, then when Command Prompt appears, click that.

In the window, type:

set path > c:\path.txt

Paste the contents of path.txt into a reply here.

0 Kudos
avinashs
New Contributor I
1,585 Views

Attached is the path.txt file as requested by @Steve Lionel (Ret.).

 

0 Kudos
Steve_Lionel
Honored Contributor III
1,585 Views

Hmm - well, so much for my theory... Now I have no idea. I think I had best leave this to Lorri, who is THE expert on this.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,481 Views

>>Results from Computer 2: Using the x86 Debug

x86 is not a proper Platform. Use either Win32 or x64

Jim Dempsey

0 Kudos
Reply