I try to improve performance of lots of old Fortran code. I identified the bottleneck (Function A) using Intel VTune and change Optimization from no to Maximize Speed plus Higher Level Optimizations (/O3) plus other Optimization flags. The new profile shows the execution time of Function A is significantly reduced, but there is this new bottleneck appeared: [OpenMP worker] from BaseThreadInitThunk. The total execution time is not improved due to this.
Am I correct that this OpenMP worker is the overhead cost for /O3? Is there anyway to reduce this so I can achieve overall performance gain?
BTW, Function A is a routine to integrate DEs. Any other way to improve its performance?
I don't think there is any connection between O3 and OpenMP usage. My guess is that this "new bottleneck" was masked by the longer execution time of the now optimized code.
I am not an OpenMP expert so I will leave it to others to offer suggestions.
You are right Dr. Steve. It was introduced by turning on Parallelization under Optimization.
This help a lot, but the OpenMP Worker introduced cost too much. I am thinking maybe the execution time showing for OpenMP is not an overhead cost but the actual execution of the now parallelized calculation for Function A.
>> the execution time of Function A is significantly reduced, but there is this new bottleneck appeared
Did the overall runtime go down by the estimated savings in Function A optimizations?
You are aware you will always have a worst section of code.
Not sure about the BaseThreadInitThunk in the report. OpenMP threads (for a well structured program) tend to get initialized once (or at least seldom). Subsequent to initialization, the threads are "parked" into a pool (or pools).
Should your application be an unwell structured program (for threading), then you may have OpenMP thread initialization issued.
Example of unwell: C# or C++:
create thread to run Fortran library with OpenMP
IOW a different system thread enters the first OpenMP parallel region in the library
(thus a new thread pool each time)
Thank you Jim.
I found out that this OpenMP Worker was introduced by turning on Parallelization under Optimization, so it automatically created.
The overall runtime is similar (maybe slightly better). The cost for initializing the MP threads seems fill in the time saved executing Function A parallel. Or maybe no time saving for Function A at all, and VTune just report the time under the OpenMP Worker.
BTW, this is different than my other question about call Fortran from C. There, I am trying to using Threading in C to call Fortran. Here, I am trying to optimize the Fortran code itself.
>>I found out that this OpenMP Worker was introduced by turning on Parallelization under Optimization
Use either OpenMP or Auto Parallelization, not both.
>>The cost for initializing the MP threads seems fill in the time saved executing Function A parallel.
For (representative) testing of performance, at the start of the PROGRAM, insert a "warmup" parallel region that basically does nothing. Note, it must do at least something, else compiler optimizations might elide the code.
Secondly, enclose the function in a loop that takes a few representative timings. You would be interested in: First time, and least time. Use at least 3 or 4 iterations..
As a practice, parallelization is to be made at the outer-most practical level as possible, then everything within the parallel code you address with vectorization.
>> There, I am trying to using Threading in C to call Fortran
When doing that, and Fortran is also parallelized, either:
a) Use the same C thread to make the call
b) Use the same collection of C threads and tune the number of Fortran threads to avoid over-subscription.