- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran a simple COARRAY Fortran example on two Windows 10 machines with the same project files (so exactly the same compiler settings). However, the output is different on both machines as shown below. Is this expected behavior? Any help will be appreciated, especially with settings that need to be changed.
Computer 1: Intel i7 4770K, 16 GB RAM, Cores = 4, Threads = 8, special order through an engineering software provider
Computer 2: Intel i7 6820HQ, 32 GB RAM, Cores = 4, Threads = 8, special order directly from Dell
The program code is:
program main ! Test COARRAY Fortran 2008 if (this_image() == 1) then write(*,'(1x,a,1x,i0,1x,a)') 'Coarray Fortran program running with', num_images(), 'images' end if sync all write(*,'(1x,a,1x,i0)') 'Hello from image', this_image() if (this_image() == 1) read * 1 continue end program main
The output on Computer 1 is as advertised in the tutorial:
Coarray Fortran program running with 8 images
Hello from image 1
Hello from image 5
Hello from image 2
Hello from image 6
Hello from image 3
Hello from image 4
Hello from image 7
Hello from image 8
However, the output on Computer 2 is different as seen below and reports that only 1 image is used.
Coarray Fortran program running with 1 images
Hello from image 1
Coarray Fortran program running with 1 images
Hello from image 1
Coarray Fortran program running with 1 images
Hello from image 1
Coarray Fortran program running with 1 images
Hello from image 1
Coarray Fortran program running with 1 images
Hello from image 1
Coarray Fortran program running with 1 images
Hello from image 1
Coarray Fortran program running with 1 images
Hello from image 1
Coarray Fortran program running with 1 images
Hello from image 1
I will add that Computer 2 in general runs slower than Computer 1 on all Fortran applications although it is a newer and potentially superior computer.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just want to note that for maximizing parallel performance and certainly if you are doing engineering simulations (FEA, CFD) you need to turn off hyperthreading so that your 4 cores have 4 threads. This is done in the BIOS usually.
As fas as Fortran, I can reproduce the output 1 on my machine. it seems that if co-arrays are not turned on it produces the output 2. So check your compile settings again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CPU i7-4770K(Desktop) i7-6820HQ(Mobile) Clock 3.5GHz (Turbo 3.9GHz) 2.7GHz(Turbo 3.6GHz) Mem Bandw 25.6GB/s 34.1 GB/s Mem Type DDR3-1333/1600 DDR4-2133,LPDDR3-1866,DDR3L-1600 Mem Ch 2 2
For highly compute bound (iow not significantly memory access bound), expect Desktop/Mobile ~3.5/2.7=1.3
For highly memory bound streaming access Desktop/Mobile ~25.6/34.1 = 0.75 (Mobile/Desktop ~1.33)
For somewhat random memory access, it may favor the Desktop.
Both systems have same instruction set extensions (AVX2 latest)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It should also be noted that Mobile configurations generally do not have as good of cooling as Desktop configurations (iow less time at Turbo for active programs).
Highly compute multi-threaded programs tend to run at base frequency.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can experiment with HyperThreadding. Typical parallel compute applications may experience ~15% improvement. I think the main objection about using HT comes from the objecting person looking at the scaling factor alone and not at the total throughput of the application.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How did you build and run your programs?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lorri Menard (Intel) wrote:How did you build and run your programs?
The programs were built with Intel(R) Visual Fortran Compiler 19.0.1.144 [IA-32] and Microsoft Visual Studio (MSVC\14.15.26726) on both machines. Effectively, the project was first built on computer 1 and then the project file was simply copied to computer 2, which maintains the same relative path to the project files.
The only real setting other than the default was Configuration Properties -> Fortran -> Language -> Enable Coarrays -> For Shared Memory (/Qcoarray:shared).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John Alexiou wrote:I just want to note that for maximizing parallel performance and certainly if you are doing engineering simulations (FEA, CFD) you need to turn off hyperthreading so that your 4 cores have 4 threads. This is done in the BIOS usually.
As fas as Fortran, I can reproduce the output 1 on my machine. it seems that if co-arrays are not turned on it produces the output 2. So check your compile settings again.
Thanks for your comment. I am using CFD. Computer #2 through Dell was recommended by ANSYS. The Dell Precision Optimizer has a settings profile that is tailored for ANSYS. The IVF is used to develop other standalone in-house codes that we use. Since Computer #2 is not as "fast" as #1 on most IVF simulations, I am doing some debugging tests to understand why.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note: If coarrays are turned off i.e. Configuration Properties -> Fortran -> Language -> Enable Coarrays -> No, the output is
Coarray Fortran program running with 1 images
Hello from image 1
Again, this is as advertised in the tutorial. Therefore for Computer 2, the coarray with image 1 is being called 8 times corresponding to 8 threads in coarray mode. So the output of Computer 2 is not the result of coarrays being turned off (@John Alexiou).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you running the application inside Visual Studio, or from a command line?
If it's from a command line, please issue this command, and then run your application again:
set FOR_COARRAY_DEBUG_STARTUP=TRUE
You'll see a line something like this:
Generated MPI command line is 'mpiexec.exe -localonly -n 8 testme.exe '.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the MPI service running on computer 2?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Coarray "Hello World" sample is a terrible sample, as it doesn't use coarrays at all. In version 19 there is, in the same folder, a mcpi_coarray_final.f90 source that goes along with a tutorial. Unfortunately, parts of the tutorial were not included - I've asked the Intel folks to fix this. (I wrote this sample before I left Intel in 2016.) I've attached the source here if you want to try it.
That said, the results shown in the first post are very strange. I think Lorri is on the right track in asking for the debug info, as there are 8 copies of the program running as single images. I wonder if some non-Intel MPI is in PATH before the Intel MPI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks everyone for the informative responses. As a first step, I switched the example to the Monte Carlo integration as suggested by @Steve Lionel (Ret.) Since the Windows shell closes automatically when the program ends, I added a PAUSE statement after the final print statement. I will note that a READ * statement did not work in coarray mode in this example.
The broad results are as follows:
I tried 4 options with a project set up in MSVSC 2017: Debug x86, Debug x64, Release x86, Release x64 and a further 2 options directly from the command line shells for IA-32 and Intel 64.
Results for Computer 1:
1. Computer 1 was able to solve the MCI example in all cases except that x64 Debug/Release did not work in MSVSC 2017 (the shell froze and had to be forcibly closed).
2, However, with Computer 1, the Intel 64 shell was able to run the case.
3. For clarity, the output from Computer 1 is posted below in a separate comment. The performance shows x32 Debug < x32 Release < IA-32 command line < Intel 64 command line.
4. All the cases failed (either did not run or crashed) on Computer 2 except the Intel 64 shell, where the performance was similar to Computer 1 on Intel 64.
5. I also took the executables that successfully worked on Computer 1 and ran them on Computer 2. See results below.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Results from Computer 1: Computer 1 Results: x86 Debug Computing pi using 1800000000 trials across 8 images Computed value of pi is 3.1415677, Relative Error: .794E-05 Elapsed time is 20.5 seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computer 1 Results: x86 Release Computing pi using 1800000000 trials across 8 images Computed value of pi is 3.1415817, Relative Error: .349E-05 Elapsed time is 17.0 seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computer 1 Results: x64 Debug () No results were produced - not even the first line "Computing pi using 1800000000 trials across 8 images" Computer 1 Results: x64 Release No results were produced - not even the first line "Computing pi using 1800000000 trials across 8 images" Computer 1 Results: ia32 command line Computing pi using 1800000000 trials across 8 images Computed value of pi is 3.1415979, Relative Error: .167E-05 Elapsed time is 15.1 seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computer 1 Results: Intel 64 command line MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported. MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started. Computing pi using 1800000000 trials across 8 images Computed value of pi is 3.1416074, Relative Error: .470E-05 Elapsed time is 10.5 seconds Fortran Pause - Enter command<CR> or <CR> to continue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Results from Computer 2: IA32 command line (note: only the final PAUSE statement was actually active) Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computed value of pi is 3.1416665, Relative Error: .235E-04 Elapsed time is 133. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416268, Relative Error: .109E-04 Elapsed time is 134. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416263, Relative Error: .107E-04 Elapsed time is 135. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416074, Relative Error: .470E-05 Elapsed time is 134. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416522, Relative Error: .189E-04 Elapsed time is 134. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416160, Relative Error: .742E-05 Elapsed time is 134. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1415499, Relative Error: .136E-04 Elapsed time is 135. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416143, Relative Error: .690E-05 Elapsed time is 135. seconds Fortran Pause - Enter command<CR> or <CR> to continue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Results from Computer 2: Intel 64 command line MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported. MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started. Computing pi using 1800000000 trials across 8 images Computed value of pi is 3.1415707, Relative Error: .698E-05 Elapsed time is 10.2 seconds Fortran Pause - Enter command<CR> or <CR> to continue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Results from Computer 2: Using the x86 Debug executable from Computer 1 Note: towards the end, the shell froze and the computer fan went into overdrive. Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computing pi using 1800000000 trials across 1 images Computed value of pi is 3.1415241, Relative Error: .218E-04 Elapsed time is 147. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416645, Relative Error: .229E-04 Elapsed time is 147. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416850, Relative Error: .294E-04 Elapsed time is 148. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1415485, Relative Error: .140E-04 Elapsed time is 149. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1415784, Relative Error: .455E-05 Elapsed time is 150. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1416506, Relative Error: .184E-04 Elapsed time is 150. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1415736, Relative Error: .605E-05 Elapsed time is 153. seconds Fortran Pause - Enter command<CR> or <CR> to continue. Computed value of pi is 3.1415370, Relative Error: .177E-04 Elapsed time is 155. seconds Fortran Pause - Enter command<CR> or <CR> to continue. [proxy:0:0@ComputerName] ..\windows\src\hydra_sock.c (379): write error (errno = 0) [proxy:0:0@ComputerName] proxy_cb.c (256): error writing data [proxy:0:0@ComputerName] ..\windows\src\hydra_demux.c (203): callback returned error [proxy:0:0@ComputerName] proxy.c (989): error waiting for event
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah, a clue! Computer 2 has some other MPI installed that is being invoked.
On computer 2, open a Command Prompt (not PowerShell) window. One way to do this is to click on the search icon in the lower left, type cmd, then when Command Prompt appears, click that.
In the window, type:
set path > c:\path.txt
Paste the contents of path.txt into a reply here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm - well, so much for my theory... Now I have no idea. I think I had best leave this to Lorri, who is THE expert on this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Results from Computer 2: Using the x86 Debug
x86 is not a proper Platform. Use either Win32 or x64
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page