- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/nologo /O3 /Qparallel /Qopenmp /Qvec-report0 /Qzero /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /traceback /libs:static /threads /winapp /c
The dual Xeon execution time is 76.7 hrs. and the single Xeon 56.7 hrs.
Then I turned off the parallelization options like this:
/nologo /O3 /Qpar-threshold:0 /Qvec-threshold:0 /assume:nocc_omp /Qvec-report0 /Qzero /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /traceback /libs:static /threads /winapp /c
Then the execution times are:
Dual Xeon: 72 hrs. , single Xeon: 52 hrs.
In all tests the single processor is consistently faster than the dual processor, even when paralleliztion is turned off.
I understand the parallelization is not helping anyway, but the thing this is a legacy code migrated from F77 that was not very efficiently programmed in the first place. We are trying to optimized a bit, but dual processor is hurting instead of helping.
Any ideas?
Thanks,
R//G
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think there's a misconception here. /Qpar-threshold:0 does not turn auto-parallelization off, it enables more parallelization, of loops for which the performance is not expected to improve (such as loops with low trip counts). To disable auto-parallelization, remove /Qparallel, or use /Qparallel-. Likewise, /Qvec-threshold:0 vectorizes loops that really shouldn't be vectorized. I recommend that you remove it. There's no reason to disable vectorization. I wouldn't normally use both /Qparallel and /Qopenmp together. If you don't have OpenMP directives, don't use /Qopenmp. If you do, then don't use /Qparallel.
Then, you can compare a build without /Qparallel to a build with /Qparallel, and to a second build with /Qparallel /Qpar-threshold:99. Those are very long run times. Can you test on smaller workloads or data sets?
Whether /Qparallel helps at all will depend on how your code is structured. If you are willing to spend the time, there's usually a lot more potential in OpenMP, because you can thread at a higher level. Unless you're fortunate enough to have one or two hot and simple kernels (such as matrix multiplications).
Have you tried to identify your hotspots?
Some useful (I hope)references:
http://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/
http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/
http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/
http://software.intel.com/en-us/articles/requirements-for-vectorizable-loops/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page