Use of only 25% of CPU with Auto-Parallelization - Page 2

JB_D_ · ‎04-30-2013

Hi,

I'm using Intel Visual Fortran Compiler Pro 11.1 to compile my code on an Intel core i5 architecture.

Because I would like to parallelize the exectution of the programm i use the "-c /Qparallel" option at the compilation step, and the "/Qpar-report" option outputs that almost all the loops have been parrallelized.

But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the proccessors seem to work simultaneously. I've tried to set the priority of the process at "/high" when i execute the programm, with no effects, and the affinity is set by default on all the 4 processors.

I don't know what is going wrong, thanks in advance for any help.

JB

SergeyKostrov · ‎05-04-2013

>>...it is not directly related to the topic,but I thought it could shed some light on so measuring cpu load as percentage of time... The problem JB D experienced is related to Auto-Parallelization of some processing with I/O operations ( Not measuring CPU load ) and because of this Intel Visual Fortran compiler didn't do Auto-Parallelization assuming that integrity of the processing will not be achieved.

Bernard · ‎05-04-2013

His post is indirectly related to measuring cpu load.Adding a few tips regarding precision of measurement will not do any rules:)

>>>What is weird is that the CPU allocation of my process is always staked at 25% precisely!>>>

SergeyKostrov · ‎05-05-2013

>>...His post is indirectly related to measuring cpu load... I really don't see a reason for all these irrelevant explanations. It is Not clear for me if you do any programming with Fortran.

John_Campbell · ‎05-05-2013

This thread has been running for quite a while !

If your process is reporting 25% CPU in task manager with a core i5, then there is no parallel threads being effectively utilised; just one stream being fully committed.
There are two possibilities:
a) If the parallelisation is being achieved by the !$OMP SECTION commands, then one of the sections is running for a significantly longer time than the others, or there is a clash and only one section is running at a time, or
b) the !$OMP commands are being ignored and there is only one running stream.

You should run the program with and without OpenMP being selected and see what is different.
Is it possible to estimate the run times of each of the sections? Elapsed time with QueryPerform or RDTSC might give the precision you require to identify what might be happening. Ignore the complaints about these timing routines not being accurate; as bad as they are, they are probably the best you have available.

John

Bernard · ‎05-05-2013

QueryPerformance and RDTSC are good alternative to task manager.At least you will measure performance in cpu cycles(RDTSC).

JB_D_ · ‎05-07-2013

Hello everybody!
Sorry I wasn't able to follow these numerous posts this week-end, thanks again for your help.
A lot of issues have been raised as the posts went on and I tried differents leads:
1)Compiler options (for /Qparallel or /Qax or /Qopenmp)
2)Environnement variables adjustment;
3)CPU load control;
4)I/O control;

1): I managed (many thanks to Sergey) to run auto parallelization, which is a bit more efficient than the openMP directives that I've choosen to pu, but remains still slower than compiling without any option!
Using a /Qpar-report2 yields to different cases:
      - existence of parallel dependence (unanswerable!)
      - insufficient computational work (if somebody could shed light on this I'd be glad)
      - LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop
Is a loop really parallelized in the 3rd case? Moreover the indicated lines don't reach any reliable line in my code (sometimes it's even commented lines)

2): I tried to set KMP_AFFINITY whit another configuration, results got worse.

3): I've implemented yesterday an rdtsc counter (thanks to this example) which allowed me to compute time between each step and to notice that the stream step was one of the heaviest but not the only one. Fortunately the longest loops are those which /Qparallel would tend to parallelize.

4): I tried to understand why I/O could interact whit auto-parallelization but I didn't find any information on this. Sergey, how did you think it can prohibit parallelization? How can I solve this problem?

SergeyKostrov · ‎05-07-2013

Hi, >>...how did you think it can prohibit parallelization?.. My main concern with I/O in your case is the 2nd part, that is O ( Output ), and it looks like compiler won't be able to synchronize all these operations ( executed in many threads ) or doesn't "know" where in some output file some data needs to be saved. >>...How can I solve this problem?.. I think the 2nd part, that is O ( Output ), needs to be delayed until all the processing is completed and only after that data could be saved in a regular Non-Parallel way. However, I don't know if it is possible to do in your case.

Bernard · ‎05-07-2013

Regarding I/O at least you can use xperf and identify those thread(s) which are performing I/O by callstack examination.You can check if there is interdependencies between I/O thread(s) and thread(s) which is performing calculation.

Bernard · ‎05-07-2013

I suppose that full scale debugging coupled with callstack analysis of every thread(in case when parallelism was achieved) could reveal the root cause of your problem.While running under debugger you will need to do single step and call tracing and to observe the execution of your code .It is not easy task,but it can be helpful in order to understand the failure of autoparallelism.Before usage of debugger if you are interested please ispect your code's import table with dumpbin.

Bernard · ‎05-07-2013

<<<so that my process isn't constraint by any other process>>>

Your process is only memory mapped container and cannot be constained by other process.What can be preempted it is your process's threads.

SergeyKostrov · ‎05-07-2013

>>...How can I solve this problem? Hi JB D, I know that my proposal could lead to a redesign of your application. So, I would follow a 3-phase approach: Phase 1 - Load data to be processed into memory ( Non-Parallel operation in an EXE-module / No Auto-Parallelization ) Phase 2 - Do data processing ( Parallel in a DLL-module / Compiled with Auto-Parallelization or usage of OpenMP ) Phase 3 - Store processed data to a file ( Non-Parallel operation in EXE-module / No Auto-Parallelization ) Unfortunately, I have No idea of what your application actually does and it is very hard to make a right decision / recommendation. Let me know if that approach is Not applicable.

JB_D_ · ‎05-08-2013

Allright, thanks for your answer!

@Sergey: I think this can be suitable for me, I just have to learn how to interface my code with Fortran-DDL! (I'm a bit new in programing and specially in fortran)

@iliyapolak: I'll try what you said about XPerf at work as soon as my IT will resolve to install it on my workstation. My own computer is not powerfull enough to run my code properly and XPerf would not give accurate informations.

I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?

SergeyKostrov · ‎05-08-2013

>>...I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: >>insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?.. Could you post some codes? At least a part related to these two diagnostic messages.

Bernard · ‎05-08-2013

Hi JB D,

before running Xperf please verify that no other program is usng Kernel Logger.

Bernard · ‎05-08-2013

Does your loop have some interdependencies and does it use constant at compile time values.By looking at compiler message is this possible that your inner loop runs for very short time and simply overhead needed to parallelise it is too large?

Anonymous66 · ‎05-08-2013

JB D. wrote:

I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?

Without seeing the actual report, there are two possibilities:

The compiler generated multiple code paths for this region. One was parallelized and the other was not.
There is an inner and outer loop, and only the outer loop was parallelized. This includes loops created by the compiler for arrays.

Bernard · ‎05-08-2013

Hi Annalee,

what insufficient inner loop can mean?

JB_D_ · ‎05-08-2013

@Sergey: here is a subroutine which generate this kind of warnings and should qualify for auto-parallelization (I guess):

[fortran]
SUBROUTINE computeMacros(f,rho,u,uSqr)
    USE simParam, ONLY: xDIm, yDim
    use omp_lib
    implicit none

    double precision, INTENT(IN):: f(yDim,xDim,0:8)
    double precision, INTENT(INOUT):: u(yDim,xDim,0:1), rho(yDim, xDim), uSqr(yDim, xDim)
    integer:: x,y
    do x = 1, xDim
        do y = 1, yDim
            rho(y,x) = f(y,x,0) + f(y,x,1) + f(y,x,2) + f(y,x,3) + f(y,x,4) + f(y,x,5) + f(y,x,6) + f(y,x,7) + f(y,x,8)
            u(y,x,0) = (f(y,x,1) - f(y,x,3) + f(y,x,5) - f(y,x,6) - f(y,x,7) + f(y,x,8)) / rho(y,x)
            u(y,x,1) = (f(y,x,2) - f(y,x,4) + f(y,x,5) + f(y,x,6) - f(y,x,7) - f(y,x,8)) / rho(y,x)
            uSqr(y,x) = u(y,x,0) * u(y,x,0) + u(y,x,1) * u(y,x,1)
        end do
    end do
END SUBROUTINE computeMacros
[/fortran]

@Annalee: Is this problem related to the algorithm or to the compiler? I can't give you more precision on this report because I'm away until Monday, but I might rememeber that for this routine for example, the first line of the related report was LOOP WAS AUTO-PARALLELIZED and the inner loop warning appeared 3 or 4 times.

>>The compiler generated multiple code paths for this region. One was parallelized and the other was not>>
Does the number just before the warning (for instance: (649)) refer to these paths? because I can't link it with any line number of my script.

>>what insufficient inner loop can mean?
I wonder too!

Anonymous66 · ‎05-08-2013

Your program was succesfully parallelized. For this case, what it means is the compiler parallelized around the outer loop: "do x = 1, xDim" and left the inner loop is left as. This is exactly what should happen.

The number before the warning, just tells us what warning is being displayed.

"insufficient inner loop" means there was not enough work within the inner loop for it to be effecient to parallize it.

Bernard · ‎05-08-2013

Thank you Annalee

JB_D_ · ‎05-08-2013

Yes, thousand thanks ! You definitely shed light on this point! I Will try to follow the advice of Sergey about the Dynamic Linked Library. I hope this will improuve my computation because parallelization still slow my execution, which is not really the expected behaviour...