Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

Using OMP on Inner Loops

ash1
Beginner
1,045 Views
Hello everyone,

I used openmp for my outer do loops. Can I apply also openmp to my inner loops the same way or openmp is limited to outer loops only. How can I parallelize my inner loops without affecting the openmp of the outer loops.

Appreciate the advise on this matter.

Best regards.
0 Kudos
3 Replies
TimP
Honored Contributor III
1,045 Views
Quoting - ash1


I used openmp for my outer do loops. Can I apply also openmp to my inner loops the same way or openmp is limited to outer loops only. How can I parallelize my inner loops without affecting the openmp of the outer loops.


This is possible, with the support engaged by OMP_NESTED. The reasons for doing it would be unusual and specialized. The more usual way of parallelizing inner loop is with vectorization but not threading.
0 Kudos
ash1
Beginner
1,045 Views
Quoting - tim18
This is possible, with the support engaged by OMP_NESTED. The reasons for doing it would be unusual and specialized. The more usual way of parallelizing inner loop is with vectorization but not threading.

Is it possible to use vectorization of inner loops in conjunction with openmp of outer loops and how. The best speed improvement I got by utilizing openmp on outer loop was twice the speed. What can I do to obtain further speed and what is the limit.

Thank you.
0 Kudos
TimP
Honored Contributor III
1,045 Views
Quoting - ash1

Is it possible to use vectorization of inner loops in conjunction with openmp of outer loops and how. The best speed improvement I got by utilizing openmp on outer loop was twice the speed. What can I do to obtain further speed and what is the limit.

Thank you.
Even more so than 20 years ago, combining inner loop vectorization with OpenMP threading of outer loop depends on arranging for stride 1 inner loops and removing sequential dependencies at both loop levels. The -opt-report options give information about the success of the compiler in parallelization and auto-vectorization. The -openmp-report and -vec-report options are subsets of it.
The compiler has little freedom under -openmp to swap loop nest levels. If it succeeds in auto-vectorization when -openmp is not set, but not with -openmp, you would look for ways to optimize the source and gain the combined optimizations.
Memory bandwidth often sets a limit on performance. When that is not a factor, you may consider a goal of linear speedup according to number of cores and the width of the vector parallel instructions (4 for single, 2 for double precision), when the loop lengths are suitable. If memory bandwidth is a factor, performance will depend strongly on minimizing extra data movement.
The compiler targets loop length 100 for vectorization, unless sufficient information on length is present in the source code. Outer loop lengths of several hundred may give best OpenMP speedup with vectorized inner loops.
As you will see in my examples http://sites.google.com/site/tprincesite/levine-callahan-dongarra-vectors you may require the OpenMP if clause to restrict threaded parallel to the case where the outer loop is sufficiently long.
For the -parallel option, the compiler assumes a small outer loop trip count, which typically induces it not to parallelize when a vectorizable inner loop is assumed of length 100.
0 Kudos
Reply