Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
26753 Discussions

Question on Optimization: Maximize Speed plus Higher Level Optimizations (/O3)

LTthink
Novice
252 Views

Hi All,

I try to improve performance of lots of old Fortran code. I identified the bottleneck (Function A) using Intel VTune and change Optimization from no to Maximize Speed plus Higher Level Optimizations (/O3) plus other Optimization flags. The new profile shows the execution time of Function A is significantly reduced, but there is this new bottleneck appeared: [OpenMP worker] from BaseThreadInitThunk. The total execution time is not improved due to this.

Am I correct that this OpenMP worker is the overhead cost for /O3? Is there anyway to reduce this so I can achieve overall performance gain?

BTW, Function A is a routine to integrate DEs. Any other way to improve its performance?

Thanks!

0 Kudos
5 Replies
Steve_Lionel
Black Belt Retired Employee
242 Views

I don't think there is any connection between O3 and OpenMP usage. My guess is that this "new bottleneck" was masked by the longer execution time of the now optimized code.

I am not an OpenMP expert so I will leave it to others to offer suggestions.

LTthink
Novice
227 Views

You are right Dr. Steve. It was introduced by turning on Parallelization under Optimization.

This help a lot, but the OpenMP Worker introduced cost too much. I am thinking maybe the execution time showing for OpenMP is not an overhead cost but the actual execution of the now parallelized calculation for Function A. 

Thanks!

jimdempseyatthecove
Black Belt
238 Views

>> the execution time of Function A is significantly reduced, but there is this new bottleneck appeared

Did the overall runtime go down by the estimated savings in Function A optimizations?

You are aware you will always have a worst section of code.

Not sure about the BaseThreadInitThunk in the report. OpenMP threads (for a well structured program) tend to get initialized once (or at least seldom). Subsequent to initialization, the threads are "parked" into a pool (or pools).

Should your application be an unwell structured program (for threading), then you may have OpenMP thread initialization issued.

Example of unwell: C# or C++:

loop:
     ...
     create thread to run Fortran library with OpenMP
     ...
 end loop

IOW a different system thread enters the first OpenMP parallel region in the library
(thus a new thread pool each time)

Jim Dempsey

LTthink
Novice
224 Views

Thank you Jim.

I found out that this OpenMP Worker was introduced by turning on Parallelization under Optimization, so it automatically created.

The overall runtime is similar (maybe slightly better). The cost for initializing the MP threads seems fill in the time saved executing Function A parallel. Or maybe no time saving for Function A at all, and VTune just report the time under the OpenMP Worker.

BTW, this is different than my other question about call Fortran from C. There, I am trying to using Threading in C to call Fortran. Here, I am trying to optimize the Fortran code itself.

Thanks!  

 

jimdempseyatthecove
Black Belt
200 Views

>>I found out that this OpenMP Worker was introduced by turning on Parallelization under Optimization

Use either OpenMP or Auto Parallelization, not both.

>>The cost for initializing the MP threads seems fill in the time saved executing Function A parallel.

For (representative) testing of performance, at the start of the PROGRAM, insert a "warmup" parallel region that basically does nothing. Note, it must do at least something, else compiler optimizations might elide the code.

Secondly, enclose the function in a loop that takes a few representative timings. You would be interested in: First time, and least time. Use at least 3 or 4 iterations..

As a practice, parallelization is to be made at the outer-most practical level as possible, then everything within the parallel code you address with vectorization.

>> There, I am trying to using Threading in C to call Fortran

When doing that, and Fortran is also parallelized, either:

a) Use the same C thread to make the call
or
b) Use the same collection of C threads and tune the number of Fortran threads to avoid over-subscription.

Jim Dempsey

Reply