Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Using parallelization

groupw_bench
Beginner
462 Views
I'm working with the latest Intel Fortran compiler and a legacy Fortran program written as a single thread application, and got VTune for a trial. It showed, not surprisingly, that only one or two of the 8 logical CPUs (in a 4 core i7 machine) are being used. By turning on the compiler parallelization feature and dropping the parallelization threshold to 25, VTune shows much better utilization of multiple cores (average went from 2.35 to 6.39).

However, the analysis also shows a great deal of time spent by kmp_fork_call and NtDelayExection which are nothing explicitly called by the program. I haven't been able to find much out about what these are, why they're being called, and what's calling them. But I do know that execution time has increased by about 50%. Setting the parallelization threshold to anything other than 100 results in a performance hit, and setting it at 100 gives the same results as turning parallelization off.

Can I assume that this means there's no way to take advantage of the multiple processors except by reorganizing the program code for multiple thread operation -- which isn't practical? It's evident that the compiler's attempt at identifying and implementing multiple threads is doing more harm than good.

Please let me know if this would be more appropriately posted in the VTune, Fortran compiler, or some other sub-forum.

Thanks!
0 Kudos
2 Replies
TimP
Honored Contributor III
462 Views
If you're willing to give more detail (or a sample source code), follow-up on a Fortran compiler forum would be appropriate (Windows or linux/Mac, as the case may be).
A likely supposition might be that the loops aren't long enough to benefit from so many threads, or that the application isn't suitable for HyperThreading. You could set OMP_NUM_THREADS to at most 4, and set KMP_AFFINITY=compact,1 (spread the threads on separate cores).
The Openmp-profile report (at reduced number of threads) might be useful to quote. It should show you parallel library performance data by parallel region.
If the program is structured so as to make OpenMP impractical, it may not be surprising that the compiler's attempt to do it automatically is disappointing.
0 Kudos
SergeyKostrov
Valued Contributor II
462 Views
Quoting groupw_bench
...Can I assume that this means there's no way to take advantage of the multiple processors except by reorganizing the program
code for multiple thread operation
-- which isn't practical? It's evident that the compiler's attempt at
identifying and implementing multiple threads is doing more harm than good...

You've experiencedabsolutely expected problem. There is no magic here and may be anext generation of
compilers will finally do this better. I've experienced a completely differentproblem when a single-threaded
application stopped working on a computer with four CPUs and the only solution was to call 'SetThreadAffinityMask( ... )'
Win32 API function to enforce execution on one CPU. So, you'll need to consider a multi-threaded version of
your legacy Fortran program in order to take advantages of multiple CPUs.
0 Kudos
Reply