- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm using Intel Visual Fortran Compiler Pro 11.1 to compile my code on an Intel core i5 architecture.
Because I would like to parallelize the exectution of the programm i use the "-c /Qparallel" option at the compilation step, and the "/Qpar-report" option outputs that almost all the loops have been parrallelized.
But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the proccessors seem to work simultaneously. I've tried to set the priority of the process at "/high" when i execute the programm, with no effects, and the affinity is set by default on all the 4 processors.
I don't know what is going wrong, thanks in advance for any help.
JB
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This thread has been running for quite a while !
If your process is reporting 25% CPU in task manager with a core i5, then there is no parallel threads being effectively utilised; just one stream being fully committed.
There are two possibilities:
a) If the parallelisation is being achieved by the !$OMP SECTION commands, then one of the sections is running for a significantly longer time than the others, or there is a clash and only one section is running at a time, or
b) the !$OMP commands are being ignored and there is only one running stream.
You should run the program with and without OpenMP being selected and see what is different.
Is it possible to estimate the run times of each of the sections? Elapsed time with QueryPerform or RDTSC might give the precision you require to identify what might be happening. Ignore the complaints about these timing routines not being accurate; as bad as they are, they are probably the best you have available.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
QueryPerformance and RDTSC are good alternative to task manager.At least you will measure performance in cpu cycles(RDTSC).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everybody!
Sorry I wasn't able to follow these numerous posts this week-end, thanks again for your help.
A lot of issues have been raised as the posts went on and I tried differents leads:
1)Compiler options (for /Qparallel or /Qax or /Qopenmp)
2)Environnement variables adjustment;
3)CPU load control;
4)I/O control;
1): I managed (many thanks to Sergey) to run auto parallelization, which is a bit more efficient than the openMP directives that I've choosen to pu, but remains still slower than compiling without any option!
Using a /Qpar-report2 yields to different cases:
- existence of parallel dependence (unanswerable!)
- insufficient computational work (if somebody could shed light on this I'd be glad)
- LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop
Is a loop really parallelized in the 3rd case? Moreover the indicated lines don't reach any reliable line in my code (sometimes it's even commented lines)
2): I tried to set KMP_AFFINITY whit another configuration, results got worse.
3): I've implemented yesterday an rdtsc counter (thanks to this example) which allowed me to compute time between each step and to notice that the stream step was one of the heaviest but not the only one. Fortunately the longest loops are those which /Qparallel would tend to parallelize.
4): I tried to understand why I/O could interact whit auto-parallelization but I didn't find any information on this. Sergey, how did you think it can prohibit parallelization? How can I solve this problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding I/O at least you can use xperf and identify those thread(s) which are performing I/O by callstack examination.You can check if there is interdependencies between I/O thread(s) and thread(s) which is performing calculation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose that full scale debugging coupled with callstack analysis of every thread(in case when parallelism was achieved) could reveal the root cause of your problem.While running under debugger you will need to do single step and call tracing and to observe the execution of your code .It is not easy task,but it can be helpful in order to understand the failure of autoparallelism.Before usage of debugger if you are interested please ispect your code's import table with dumpbin.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
<<<so that my process isn't constraint by any other process>>>
Your process is only memory mapped container and cannot be constained by other process.What can be preempted it is your process's threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Allright, thanks for your answer!
@Sergey: I think this can be suitable for me, I just have to learn how to interface my code with Fortran-DDL! (I'm a bit new in programing and specially in fortran)
@iliyapolak: I'll try what you said about XPerf at work as soon as my IT will resolve to install it on my workstation. My own computer is not powerfull enough to run my code properly and XPerf would not give accurate informations.
I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi JB D,
before running Xperf please verify that no other program is usng Kernel Logger.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does your loop have some interdependencies and does it use constant at compile time values.By looking at compiler message is this possible that your inner loop runs for very short time and simply overhead needed to parallelise it is too large?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
JB D. wrote:
I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?
Without seeing the actual report, there are two possibilities:
- The compiler generated multiple code paths for this region. One was parallelized and the other was not.
- There is an inner and outer loop, and only the outer loop was parallelized. This includes loops created by the compiler for arrays.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Annalee,
what insufficient inner loop can mean?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sergey: here is a subroutine which generate this kind of warnings and should qualify for auto-parallelization (I guess):
[fortran]
SUBROUTINE computeMacros(f,rho,u,uSqr)
USE simParam, ONLY: xDIm, yDim
use omp_lib
implicit none
double precision, INTENT(IN):: f(yDim,xDim,0:8)
double precision, INTENT(INOUT):: u(yDim,xDim,0:1), rho(yDim, xDim), uSqr(yDim, xDim)
integer:: x,y
do x = 1, xDim
do y = 1, yDim
rho(y,x) = f(y,x,0) + f(y,x,1) + f(y,x,2) + f(y,x,3) + f(y,x,4) + f(y,x,5) + f(y,x,6) + f(y,x,7) + f(y,x,8)
u(y,x,0) = (f(y,x,1) - f(y,x,3) + f(y,x,5) - f(y,x,6) - f(y,x,7) + f(y,x,8)) / rho(y,x)
u(y,x,1) = (f(y,x,2) - f(y,x,4) + f(y,x,5) + f(y,x,6) - f(y,x,7) - f(y,x,8)) / rho(y,x)
uSqr(y,x) = u(y,x,0) * u(y,x,0) + u(y,x,1) * u(y,x,1)
end do
end do
END SUBROUTINE computeMacros
[/fortran]
@Annalee: Is this problem related to the algorithm or to the compiler? I can't give you more precision on this report because I'm away until Monday, but I might rememeber that for this routine for example, the first line of the related report was LOOP WAS AUTO-PARALLELIZED and the inner loop warning appeared 3 or 4 times.
>>The compiler generated multiple code paths for this region. One was parallelized and the other was not>>
Does the number just before the warning (for instance: (649)) refer to these paths? because I can't link it with any line number of my script.
>>what insufficient inner loop can mean?
I wonder too!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your program was succesfully parallelized. For this case, what it means is the compiler parallelized around the outer loop: "do x = 1, xDim" and left the inner loop is left as. This is exactly what should happen.
The number before the warning, just tells us what warning is being displayed.
"insufficient inner loop" means there was not enough work within the inner loop for it to be effecient to parallize it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Annalee
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, thousand thanks ! You definitely shed light on this point! I Will try to follow the advice of Sergey about the Dynamic Linked Library. I hope this will improuve my computation because parallelization still slow my execution, which is not really the expected behaviour...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page