Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29282 Discussions

Use of only 25% of CPU with Auto-Parallelization

JB_D_
Beginner
5,051 Views

Hi,

I'm using Intel Visual Fortran Compiler Pro 11.1 to compile my code on an Intel core i5 architecture.

Because I would like to parallelize the exectution of the programm i use the "-c /Qparallel" option at the compilation step, and the "/Qpar-report" option outputs that almost all the loops have been parrallelized.

But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the proccessors seem to work simultaneously. I've tried to set the priority of the process at "/high" when i execute the programm, with no effects, and the affinity is set by default on all the 4 processors.

I don't know what is going wrong, thanks in advance for any help.

JB

0 Kudos
42 Replies
TimP
Honored Contributor III
2,467 Views

DId you examine with /Qpar-report to see whether the important parts of your program are parallelized, or get diagnostics on why not?

If your objective is simply to max out your multiple thread meter, you might add /Qpar-threshold0  This asserts you want to maximize parallelism at the expense of performance.

0 Kudos
JB_D_
Beginner
2,467 Views

Thank you to answer,

I actually tried to use a treshold0 option to ensure that all the loops are parallelized, but it doesn't change the CPU usage, enven if all the loops are parallelized according to the /Qpar-report.

It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...

Many thanks for your help !

0 Kudos
Anonymous66
Valued Contributor I
2,467 Views

What percentage of your program is spent in the loops? There could be memory bottle necks or other issues preventing your program from fully utilizing each core.

Annalee

0 Kudos
JB_D_
Beginner
2,467 Views

The program is a sequence of imbricated loops (5 steps of 2-level loops at least). I guess this schem fit well for auto-parallelism isn't it?

 Do you think that using Open MP may deeply increase the efficiency of the parallelization? What is weird is that the CPU allocation of my process is always staked at 25% precisely!

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,467 Views

Identify a process intensive loop that has been reported as being parallelized. Run in Debug mode, place break in loop, run to break point. Open the Debug Window for Threads, how many threads are listed?

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
2,467 Views

Applying OpenMP may give you more insight; among other things you can check the number of threads assigned within a parallel region, and see whether your loops can be successfully parallelized without hidden transformations used by -Qparallel.

I suspect you must set /O explicitly along with /Qparallel for it to operate in debug build.

0 Kudos
JB_D_
Beginner
2,467 Views

Thank you for your answer, I'm going to check that.

0 Kudos
JB_D_
Beginner
2,467 Views

Thank you for your answer, I'm going to check that.

0 Kudos
JB_D_
Beginner
2,467 Views

Hi all,

First I would like to thank a lot jim and iliyapolak, the debugger and xperf helped me to find that there was no parallelization in my code. I found in this forum that I had to check data dependency in my loops before using /Qparallel savagely :), and I realized that there's no magic tool for parallelization.

Because my code is pretty much light, I tried to use OpenMP directives in my code, mostly to parallelize independent implicit loops in a subroutine.  The parallelization works fine, but my program is slower than before. Here is the code of this routine:

[fortran]

!    ========================================================
!    Streaming step: the population functions are shifted
!        one site along their corresponding lattice direction
!        (no temporary memory is needed)
!    ========================================================
SUBROUTINE stream(f)
    USE simParam
    implicit none

    double precision, INTENT(INOUT):: f(yDim,xDim,0:8)
    double precision:: periodicHor(yDim), periodicVert(xDim)

!$OMP PARALLEL SHARED(f,xDim,yDim) PRIVATE(periodicHor,periodicVert)
 !$OMP SECTIONS
    !$OMP SECTION
    !    -------------------------------------
    !    right direction
    periodicHor   = f(:,xDim,1)
    f(:,2:xDim,1) = f(:,1:xDim-1,1)
    f(:,1,1)      = periodicHor
   
    !$OMP SECTION
    !    -------------------------------------
    !    up direction
    periodicVert    = f(1,:,2)
    f(1:yDim-1,:,2) = f(2:yDim,:,2)
    f(yDim,:,2)     = periodicVert
   
    !$OMP SECTION
    !    -------------------------------------
    !    left direction
    periodicHor     = f(:,1,3)
    f(:,1:xDim-1,3) = f(:,2:xDim,3)
    f(:,xDim,3)     = periodicHor
   
    !$OMP SECTION
    !    -------------------------------------
    !    down direction
    periodicVert  = f(yDim,:,4)
    f(2:yDim,:,4) = f(1:yDim-1,:,4)
    f(1,:,4)      = periodicVert
   
    !$OMP SECTION
    !    -------------------------------------
    !    up-right direction
    periodicVert         = f(1,:,5)
    periodicHor          = f(:,xDim,5)
    f(1:yDim-1,2:xDim,5) = f(2:yDim,1:xDim-1,5)
    f(yDim,2:xDim,5)     = periodicVert(1:xDim-1)
    f(yDim,1,5)          = periodicVert(xDim)
    f(1:yDim-1,1,5)      = periodicHor(2:yDim)
   
    !$OMP SECTION
    !    -------------------------------------
    !    up-left direction
    periodicVert           = f(1,:,6)
    periodicHor            = f(:,1,6)
    f(1:yDim-1,1:xDim-1,6) = f(2:yDim,2:xDim,6)
    f(yDim,1:xDim-1,6)     = periodicVert(2:xDim)
    f(yDim,xDim,6)         = periodicVert(1)
    f(1:yDim-1,xDim,6)     = periodicHor(2:yDim)
       
    !$OMP SECTION
    !    -------------------------------------
    !    down-left direction
    periodicVert         = f(yDim,:,7)
    periodicHor          = f(:,1,7)
    f(2:yDim,1:xDim-1,7) = f(1:yDim-1,2:xDim,7)
    f(1,1:xDim-1,7)      = periodicVert(2:xDim)
    f(1,xDim,7)          = periodicVert(1)
    f(2:yDim,xDim,7)     = periodicHor(1:yDim-1)
   
    !$OMP SECTION
    !    -------------------------------------
    !    down-right direction
    periodicVert       = f(yDim,:,8)
    periodicHor        = f(:,xDim,8)
    f(2:yDim,2:xDim,8) = f(1:yDim-1,1:xDim-1,8)
    f(1,2:xDim,8)      = periodicVert(1:xDim-1)
    f(1,1,8)           = periodicVert(xDim)
    f(2:yDim,1,8)      = periodicHor(1:yDim-1)

  !$OMP END SECTIONS NOWAIT
!$OMP END PARALLEL

END SUBROUTINE stream
[/fortran]

I think this must be caused by a scheduling issue but I don't know what kind of directive is realy efficient in that case. Thank you so much for your help !

JB

0 Kudos
Bernard
Valued Contributor I
2,467 Views

>>>It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...>>>

What load was reported by Xperf.Was Idle thread consuming remaining 75% of cpu time?

0 Kudos
JB_D_
Beginner
2,467 Views


Hello everybody,

Sorry I guess I messed up by mistaking the fact that my first post wasn't immediately released and thus posting a new one. That's why there are two conversations on this topic.

@Annalee:
>>>If your code sections are small, the overhead  involved in running in parallel may be higher than the performance gains>>>

I think you must be right, this routine is the just one of the 8 steps within a main loop. But I assumed that this step was the heaviest because there are nested implicit loops and xDim and yDim are almost equal to 1000. By the way is there a specific directive for this kind of array operations? Does the OMP_NESTED=.TRUE. will improve this kind of loop?

@TimP:
I think the tasks are quite well balanced because there is only 1 heavy operation in each section, fore instance: f(2:ny,2:nx,8) = f(1:ny-1,1:nx-1,8). So according to you KMP_AFFINITY may help, but I think I should know better my processor architecture to use this parameter efficiently, isn'it? I tried OMP_SCHEDULE wihtout any impovement.

@iliyapolak:
I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.

To better see how parallelization slow my execution, I tried to set OMP_THREAD_LIMIT from 4 to 1 and i noticed that speed decreases linearily while the number of thread increases.

Many thanks, I ask more and more questions not really related to the first topic, may I beging a new conversation?

0 Kudos
SergeyKostrov
Valued Contributor II
2,467 Views
>>...But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the >>proccessors seem to work simultaneously... >> >>... there was no parallelization in my code... Did you check with Task Manager ( I assume you use Windows ) how many threads are used? Another question is: Are there any I/O operations with the file system during processing?
0 Kudos
JB_D_
Beginner
2,467 Views

Hi Sergey,

I managed to see that there was only one thread running thanks to the debuger, I don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.

Second, your question about I/O is interesting. I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are nested loops). Does it influence parallelization? The step in wich my program write data into a file is not included between parallelization directive.

Thank you so much for your help!

JB

0 Kudos
SergeyKostrov
Valued Contributor II
2,467 Views
>>... I don't know how to check it with the task mananger?.. - Start Task Manager - Select Processes property page - Select View in main menu - Select Select Columns... and check on Thread Count >>...I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are >>nested loops). Does it influence parallelization? In that case I would simply comment that part in codes, build sources and repeat all tests / verifications.
0 Kudos
JB_D_
Beginner
2,467 Views

Bravo ! Auto-Parallelisation works fine when I comment the output step!!

So how can I keep this and get auto-parallel working fine too?

Another question, why execution is not faster (and even a little bit slower than mono-processing)?

0 Kudos
Bernard
Valued Contributor I
2,467 Views

>>>I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.>>>

Can you post the screenshot from your pc(when you executed Xperf)?

I would not recommend to look at percentage description of cpu load.Xperf and process explorer provide better and more clearer information about the load of cpu by your thread(s).This is done by counting cpu cycles instead of measuring timer interval(~10ms).

0 Kudos
Bernard
Valued Contributor I
2,467 Views

>>>don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.>>>

If you want to ensure that running threads belong to your application you can also use process explorer with its detailed view(including per thread callstack) more advanced information can be obtained with the debugger.

0 Kudos
SergeyKostrov
Valued Contributor II
2,467 Views
Hi JB, >>Bravo ! Auto-Parallelisation works fine when I comment the output step!! >> >>So how can I keep this and get auto-parallel working fine too? >> >>Another question, why execution is not faster (and even a little bit slower than mono-processing)? Thanks for the update and it looks like a light at the end of a tunnel. Regarding performance problems I wouldn't make any comments because there are too many unknowns for me and a verification with some performance utilities, like Intel VTune or Inspector, could show you why it happens. Note: Is it possible to do a couple of tests with smaller data sets?
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,467 Views

JB D

In looking at your stream(f) function it essentially rotates sections of an array. This is memory bandwidth heavy. I cannot see the outer levels of your program, so I will throw something out for you to consider.

Rotation can be accomplished by using modulus arithmatic on the indicies.

[fortran]
xBase = xBase + 1 ! rotate in +x
yBase = yBase + 1 ! rotate in +y
do yRing = 1, yDim
  do xRing = 1, xDim
    x = MOD(xBase + xRing - 1, xDim) + 1
    y = MOD(yBase + yRing - 1, yDim) + 1
    ! use x and y as indicies as before
[/fortran]

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
2,381 Views

>>>This is done by counting cpu cycles instead of measuring timer interval(~10ms).>>>

This is follow up.

Sorry if it is not directly related to the topic,but I thought it could shed some light on so measuring cpu load as percentage of time when cpu was executing some thread.Because of mentioned(quoted) timer interval which can be measured with clockres tool and it is around ~10ms some tools will report the usage as a 0% in fact thread can run even for shorter period of time than timer interval.So its cotribution is not counted.Better option is to use monitoring tool which will count cpu cycles.

0 Kudos
Reply