Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
공지
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

Use of only 25% of CPU with Auto-Parallelization

JB_D_
초급자
5,093 조회수

Hi,

I'm using Intel Visual Fortran Compiler Pro 11.1 to compile my code on an Intel core i5 architecture.

Because I would like to parallelize the exectution of the programm i use the "-c /Qparallel" option at the compilation step, and the "/Qpar-report" option outputs that almost all the loops have been parrallelized.

But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the proccessors seem to work simultaneously. I've tried to set the priority of the process at "/high" when i execute the programm, with no effects, and the affinity is set by default on all the 4 processors.

I don't know what is going wrong, thanks in advance for any help.

JB

0 포인트
42 응답
TimP
명예로운 기여자 III
2,488 조회수

DId you examine with /Qpar-report to see whether the important parts of your program are parallelized, or get diagnostics on why not?

If your objective is simply to max out your multiple thread meter, you might add /Qpar-threshold0  This asserts you want to maximize parallelism at the expense of performance.

0 포인트
JB_D_
초급자
2,488 조회수

Thank you to answer,

I actually tried to use a treshold0 option to ensure that all the loops are parallelized, but it doesn't change the CPU usage, enven if all the loops are parallelized according to the /Qpar-report.

It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...

Many thanks for your help !

0 포인트
Anonymous66
소중한 기여자 I
2,488 조회수

What percentage of your program is spent in the loops? There could be memory bottle necks or other issues preventing your program from fully utilizing each core.

Annalee

0 포인트
JB_D_
초급자
2,488 조회수

The program is a sequence of imbricated loops (5 steps of 2-level loops at least). I guess this schem fit well for auto-parallelism isn't it?

 Do you think that using Open MP may deeply increase the efficiency of the parallelization? What is weird is that the CPU allocation of my process is always staked at 25% precisely!

0 포인트
jimdempseyatthecove
명예로운 기여자 III
2,488 조회수

Identify a process intensive loop that has been reported as being parallelized. Run in Debug mode, place break in loop, run to break point. Open the Debug Window for Threads, how many threads are listed?

Jim Dempsey

0 포인트
TimP
명예로운 기여자 III
2,488 조회수

Applying OpenMP may give you more insight; among other things you can check the number of threads assigned within a parallel region, and see whether your loops can be successfully parallelized without hidden transformations used by -Qparallel.

I suspect you must set /O explicitly along with /Qparallel for it to operate in debug build.

0 포인트
JB_D_
초급자
2,488 조회수

Thank you for your answer, I'm going to check that.

0 포인트
JB_D_
초급자
2,488 조회수

Thank you for your answer, I'm going to check that.

0 포인트
JB_D_
초급자
2,488 조회수

Hi all,

First I would like to thank a lot jim and iliyapolak, the debugger and xperf helped me to find that there was no parallelization in my code. I found in this forum that I had to check data dependency in my loops before using /Qparallel savagely :), and I realized that there's no magic tool for parallelization.

Because my code is pretty much light, I tried to use OpenMP directives in my code, mostly to parallelize independent implicit loops in a subroutine.  The parallelization works fine, but my program is slower than before. Here is the code of this routine:

[fortran]

!    ========================================================
!    Streaming step: the population functions are shifted
!        one site along their corresponding lattice direction
!        (no temporary memory is needed)
!    ========================================================
SUBROUTINE stream(f)
    USE simParam
    implicit none

    double precision, INTENT(INOUT):: f(yDim,xDim,0:8)
    double precision:: periodicHor(yDim), periodicVert(xDim)

!$OMP PARALLEL SHARED(f,xDim,yDim) PRIVATE(periodicHor,periodicVert)
 !$OMP SECTIONS
    !$OMP SECTION
    !    -------------------------------------
    !    right direction
    periodicHor   = f(:,xDim,1)
    f(:,2:xDim,1) = f(:,1:xDim-1,1)
    f(:,1,1)      = periodicHor
   
    !$OMP SECTION
    !    -------------------------------------
    !    up direction
    periodicVert    = f(1,:,2)
    f(1:yDim-1,:,2) = f(2:yDim,:,2)
    f(yDim,:,2)     = periodicVert
   
    !$OMP SECTION
    !    -------------------------------------
    !    left direction
    periodicHor     = f(:,1,3)
    f(:,1:xDim-1,3) = f(:,2:xDim,3)
    f(:,xDim,3)     = periodicHor
   
    !$OMP SECTION
    !    -------------------------------------
    !    down direction
    periodicVert  = f(yDim,:,4)
    f(2:yDim,:,4) = f(1:yDim-1,:,4)
    f(1,:,4)      = periodicVert
   
    !$OMP SECTION
    !    -------------------------------------
    !    up-right direction
    periodicVert         = f(1,:,5)
    periodicHor          = f(:,xDim,5)
    f(1:yDim-1,2:xDim,5) = f(2:yDim,1:xDim-1,5)
    f(yDim,2:xDim,5)     = periodicVert(1:xDim-1)
    f(yDim,1,5)          = periodicVert(xDim)
    f(1:yDim-1,1,5)      = periodicHor(2:yDim)
   
    !$OMP SECTION
    !    -------------------------------------
    !    up-left direction
    periodicVert           = f(1,:,6)
    periodicHor            = f(:,1,6)
    f(1:yDim-1,1:xDim-1,6) = f(2:yDim,2:xDim,6)
    f(yDim,1:xDim-1,6)     = periodicVert(2:xDim)
    f(yDim,xDim,6)         = periodicVert(1)
    f(1:yDim-1,xDim,6)     = periodicHor(2:yDim)
       
    !$OMP SECTION
    !    -------------------------------------
    !    down-left direction
    periodicVert         = f(yDim,:,7)
    periodicHor          = f(:,1,7)
    f(2:yDim,1:xDim-1,7) = f(1:yDim-1,2:xDim,7)
    f(1,1:xDim-1,7)      = periodicVert(2:xDim)
    f(1,xDim,7)          = periodicVert(1)
    f(2:yDim,xDim,7)     = periodicHor(1:yDim-1)
   
    !$OMP SECTION
    !    -------------------------------------
    !    down-right direction
    periodicVert       = f(yDim,:,8)
    periodicHor        = f(:,xDim,8)
    f(2:yDim,2:xDim,8) = f(1:yDim-1,1:xDim-1,8)
    f(1,2:xDim,8)      = periodicVert(1:xDim-1)
    f(1,1,8)           = periodicVert(xDim)
    f(2:yDim,1,8)      = periodicHor(1:yDim-1)

  !$OMP END SECTIONS NOWAIT
!$OMP END PARALLEL

END SUBROUTINE stream
[/fortran]

I think this must be caused by a scheduling issue but I don't know what kind of directive is realy efficient in that case. Thank you so much for your help !

JB

0 포인트
Bernard
소중한 기여자 I
2,488 조회수

>>>It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...>>>

What load was reported by Xperf.Was Idle thread consuming remaining 75% of cpu time?

0 포인트
JB_D_
초급자
2,488 조회수


Hello everybody,

Sorry I guess I messed up by mistaking the fact that my first post wasn't immediately released and thus posting a new one. That's why there are two conversations on this topic.

@Annalee:
>>>If your code sections are small, the overhead  involved in running in parallel may be higher than the performance gains>>>

I think you must be right, this routine is the just one of the 8 steps within a main loop. But I assumed that this step was the heaviest because there are nested implicit loops and xDim and yDim are almost equal to 1000. By the way is there a specific directive for this kind of array operations? Does the OMP_NESTED=.TRUE. will improve this kind of loop?

@TimP:
I think the tasks are quite well balanced because there is only 1 heavy operation in each section, fore instance: f(2:ny,2:nx,8) = f(1:ny-1,1:nx-1,8). So according to you KMP_AFFINITY may help, but I think I should know better my processor architecture to use this parameter efficiently, isn'it? I tried OMP_SCHEDULE wihtout any impovement.

@iliyapolak:
I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.

To better see how parallelization slow my execution, I tried to set OMP_THREAD_LIMIT from 4 to 1 and i noticed that speed decreases linearily while the number of thread increases.

Many thanks, I ask more and more questions not really related to the first topic, may I beging a new conversation?

0 포인트
SergeyKostrov
소중한 기여자 II
2,488 조회수
>>...But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the >>proccessors seem to work simultaneously... >> >>... there was no parallelization in my code... Did you check with Task Manager ( I assume you use Windows ) how many threads are used? Another question is: Are there any I/O operations with the file system during processing?
0 포인트
JB_D_
초급자
2,488 조회수

Hi Sergey,

I managed to see that there was only one thread running thanks to the debuger, I don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.

Second, your question about I/O is interesting. I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are nested loops). Does it influence parallelization? The step in wich my program write data into a file is not included between parallelization directive.

Thank you so much for your help!

JB

0 포인트
SergeyKostrov
소중한 기여자 II
2,488 조회수
>>... I don't know how to check it with the task mananger?.. - Start Task Manager - Select Processes property page - Select View in main menu - Select Select Columns... and check on Thread Count >>...I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are >>nested loops). Does it influence parallelization? In that case I would simply comment that part in codes, build sources and repeat all tests / verifications.
0 포인트
JB_D_
초급자
2,488 조회수

Bravo ! Auto-Parallelisation works fine when I comment the output step!!

So how can I keep this and get auto-parallel working fine too?

Another question, why execution is not faster (and even a little bit slower than mono-processing)?

0 포인트
Bernard
소중한 기여자 I
2,488 조회수

>>>I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.>>>

Can you post the screenshot from your pc(when you executed Xperf)?

I would not recommend to look at percentage description of cpu load.Xperf and process explorer provide better and more clearer information about the load of cpu by your thread(s).This is done by counting cpu cycles instead of measuring timer interval(~10ms).

0 포인트
Bernard
소중한 기여자 I
2,488 조회수

>>>don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.>>>

If you want to ensure that running threads belong to your application you can also use process explorer with its detailed view(including per thread callstack) more advanced information can be obtained with the debugger.

0 포인트
SergeyKostrov
소중한 기여자 II
2,488 조회수
Hi JB, >>Bravo ! Auto-Parallelisation works fine when I comment the output step!! >> >>So how can I keep this and get auto-parallel working fine too? >> >>Another question, why execution is not faster (and even a little bit slower than mono-processing)? Thanks for the update and it looks like a light at the end of a tunnel. Regarding performance problems I wouldn't make any comments because there are too many unknowns for me and a verification with some performance utilities, like Intel VTune or Inspector, could show you why it happens. Note: Is it possible to do a couple of tests with smaller data sets?
0 포인트
jimdempseyatthecove
명예로운 기여자 III
2,488 조회수

JB D

In looking at your stream(f) function it essentially rotates sections of an array. This is memory bandwidth heavy. I cannot see the outer levels of your program, so I will throw something out for you to consider.

Rotation can be accomplished by using modulus arithmatic on the indicies.

[fortran]
xBase = xBase + 1 ! rotate in +x
yBase = yBase + 1 ! rotate in +y
do yRing = 1, yDim
  do xRing = 1, xDim
    x = MOD(xBase + xRing - 1, xDim) + 1
    y = MOD(yBase + yRing - 1, yDim) + 1
    ! use x and y as indicies as before
[/fortran]

Jim Dempsey

0 포인트
Bernard
소중한 기여자 I
2,402 조회수

>>>This is done by counting cpu cycles instead of measuring timer interval(~10ms).>>>

This is follow up.

Sorry if it is not directly related to the topic,but I thought it could shed some light on so measuring cpu load as percentage of time when cpu was executing some thread.Because of mentioned(quoted) timer interval which can be measured with clockres tool and it is around ~10ms some tools will report the usage as a 0% in fact thread can run even for shorter period of time than timer interval.So its cotribution is not counted.Better option is to use monitoring tool which will count cpu cycles.

0 포인트
응답