topic JB D in Intel® Fortran Compiler

Use of only 25% of CPU with Auto-Parallelization

JB_D_ — Tue, 30 Apr 2013 09:12:20 GMT

Hi,

I'm using Intel Visual Fortran Compiler Pro 11.1 to compile my code on an Intel core i5 architecture.

Because I would like to parallelize the exectution of the programm i use the "-c /Qparallel" option at the compilation step, and the "/Qpar-report" option outputs that almost all the loops have been parrallelized.

But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the proccessors seem to work simultaneously. I've tried to set the priority of the process at "/high" when i execute the programm, with no effects, and the affinity is set by default on all the 4 processors.

I don't know what is going wrong, thanks in advance for any help.

DId you examine with /Qpar

TimP — Tue, 30 Apr 2013 11:47:46 GMT

DId you examine with /Qpar-report to see whether the important parts of your program are parallelized, or get diagnostics on why not?

If your objective is simply to max out your multiple thread meter, you might add /Qpar-threshold0 This asserts you want to maximize parallelism at the expense of performance.

Thank you to answer,

JB_D_ — Tue, 30 Apr 2013 13:51:00 GMT

Thank you to answer,

I actually tried to use a treshold0 option to ensure that all the loops are parallelized, but it doesn't change the CPU usage, enven if all the loops are parallelized according to the /Qpar-report.

It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...

Many thanks for your help !

What percentage of your

Anonymous66 — Tue, 30 Apr 2013 14:25:24 GMT

What percentage of your program is spent in the loops? There could be memory bottle necks or other issues preventing your program from fully utilizing each core.

Annalee

The program is a sequence of

JB_D_ — Tue, 30 Apr 2013 14:52:00 GMT

The program is a sequence of imbricated loops (5 steps of 2-level loops at least). I guess this schem fit well for auto-parallelism isn't it?

Do you think that using Open MP may deeply increase the efficiency of the parallelization? What is weird is that the CPU allocation of my process is always staked at 25% precisely!

Identify a process intensive

jimdempseyatthecove — Tue, 30 Apr 2013 15:33:06 GMT

Identify a process intensive loop that has been reported as being parallelized. Run in Debug mode, place break in loop, run to break point. Open the Debug Window for Threads, how many threads are listed?

Jim Dempsey

Applying OpenMP may give you

TimP — Tue, 30 Apr 2013 16:09:11 GMT

Applying OpenMP may give you more insight; among other things you can check the number of threads assigned within a parallel region, and see whether your loops can be successfully parallelized without hidden transformations used by -Qparallel.

I suspect you must set /O explicitly along with /Qparallel for it to operate in debug build.

Thank you for your answer, I

JB_D_ — Tue, 30 Apr 2013 16:35:00 GMT

Thank you for your answer, I'm going to check that.

Thank you for your answer, I

JB_D_ — Tue, 30 Apr 2013 16:35:34 GMT

Thank you for your answer, I'm going to check that.

Hi all,

JB_D_ — Thu, 02 May 2013 12:33:00 GMT

Hi all,

First I would like to thank a lot jim and iliyapolak, the debugger and xperf helped me to find that there was no parallelization in my code. I found in this forum that I had to check data dependency in my loops before using /Qparallel savagely :), and I realized that there's no magic tool for parallelization.

Because my code is pretty much light, I tried to use OpenMP directives in my code, mostly to parallelize independent implicit loops in a subroutine. The parallelization works fine, but my program is slower than before. Here is the code of this routine:

[fortran]

!    ========================================================
!    Streaming step: the population functions are shifted
!        one site along their corresponding lattice direction
!        (no temporary memory is needed)
!    ========================================================
SUBROUTINE stream(f)
    USE simParam
    implicit none

double precision, INTENT(INOUT):: f(yDim,xDim,0:8)
double precision:: periodicHor(yDim), periodicVert(xDim)

!$OMP PARALLEL SHARED(f,xDim,yDim) PRIVATE(periodicHor,periodicVert)
!$OMP SECTIONS
    !$OMP SECTION
    !    -------------------------------------
    !    right direction
    periodicHor   = f(:,xDim,1)
    f(:,2:xDim,1) = f(:,1:xDim-1,1)
    f(:,1,1)      = periodicHor

    !$OMP SECTION
    !    -------------------------------------
    !    up direction
    periodicVert    = f(1,:,2)
    f(1:yDim-1,:,2) = f(2:yDim,:,2)
    f(yDim,:,2)     = periodicVert

    !$OMP SECTION
    !    -------------------------------------
    !    left direction
    periodicHor     = f(:,1,3)
    f(:,1:xDim-1,3) = f(:,2:xDim,3)
    f(:,xDim,3)     = periodicHor

    !$OMP SECTION
    !    -------------------------------------
    !    down direction
    periodicVert = f(yDim,:,4)
    f(2:yDim,:,4) = f(1:yDim-1,:,4)
    f(1,:,4)      = periodicVert

    !$OMP SECTION
    !    -------------------------------------
    !    up-right direction
    periodicVert         = f(1,:,5)
    periodicHor          = f(:,xDim,5)
    f(1:yDim-1,2:xDim,5) = f(2:yDim,1:xDim-1,5)
    f(yDim,2:xDim,5)     = periodicVert(1:xDim-1)
    f(yDim,1,5)          = periodicVert(xDim)
    f(1:yDim-1,1,5)      = periodicHor(2:yDim)

    !$OMP SECTION
    !    -------------------------------------
    !    up-left direction
    periodicVert           = f(1,:,6)
    periodicHor            = f(:,1,6)
    f(1:yDim-1,1:xDim-1,6) = f(2:yDim,2:xDim,6)
    f(yDim,1:xDim-1,6)     = periodicVert(2:xDim)
    f(yDim,xDim,6)         = periodicVert(1)
    f(1:yDim-1,xDim,6)     = periodicHor(2:yDim)

    !$OMP SECTION
    !    -------------------------------------
    !    down-left direction
    periodicVert         = f(yDim,:,7)
    periodicHor          = f(:,1,7)
    f(2:yDim,1:xDim-1,7) = f(1:yDim-1,2:xDim,7)
    f(1,1:xDim-1,7)      = periodicVert(2:xDim)
    f(1,xDim,7)          = periodicVert(1)
    f(2:yDim,xDim,7)     = periodicHor(1:yDim-1)

    !$OMP SECTION
    !    -------------------------------------
    !    down-right direction
    periodicVert       = f(yDim,:,8)
    periodicHor        = f(:,xDim,8)
    f(2:yDim,2:xDim,8) = f(1:yDim-1,1:xDim-1,8)
    f(1,2:xDim,8)      = periodicVert(1:xDim-1)
    f(1,1,8)           = periodicVert(xDim)
    f(2:yDim,1,8)      = periodicHor(1:yDim-1)

!$OMP END SECTIONS NOWAIT
!$OMP END PARALLEL

END SUBROUTINE stream
[/fortran]

I think this must be caused by a scheduling issue but I don't know what kind of directive is realy efficient in that case. Thank you so much for your help !

>>>It is like every thing was

Bernard — Thu, 02 May 2013 19:32:53 GMT

>>>It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...>>>

What load was reported by Xperf.Was Idle thread consuming remaining 75% of cpu time?

Hello everybody,

JB_D_ — Fri, 03 May 2013 10:16:34 GMT

Hello everybody,

Sorry I guess I messed up by mistaking the fact that my first post wasn't immediately released and thus posting a new one. That's why there are two conversations on this topic.

@Annalee:
>>>If your code sections are small, the overhead involved in running in parallel may be higher than the performance gains>>>

I think you must be right, this routine is the just one of the 8 steps within a main loop. But I assumed that this step was the heaviest because there are nested implicit loops and xDim and yDim are almost equal to 1000. By the way is there a specific directive for this kind of array operations? Does the OMP_NESTED=.TRUE. will improve this kind of loop?

@TimP:
I think the tasks are quite well balanced because there is only 1 heavy operation in each section, fore instance: f(2:ny,2:nx,8) = f(1:ny-1,1:nx-1,8). So according to you KMP_AFFINITY may help, but I think I should know better my processor architecture to use this parameter efficiently, isn'it? I tried OMP_SCHEDULE wihtout any impovement.

@iliyapolak:
I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.

To better see how parallelization slow my execution, I tried to set OMP_THREAD_LIMIT from 4 to 1 and i noticed that speed decreases linearily while the number of thread increases.

Many thanks, I ask more and more questions not really related to the first topic, may I beging a new conversation?

>>...But when i execute my

SergeyKostrov — Fri, 03 May 2013 13:33:07 GMT

>>...But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the >>proccessors seem to work simultaneously... >> >>... there was no parallelization in my code... Did you check with Task Manager ( I assume you use Windows ) how many threads are used? Another question is: Are there any I/O operations with the file system during processing?

Hi Sergey,

JB_D_ — Fri, 03 May 2013 14:20:26 GMT

Hi Sergey,

I managed to see that there was only one thread running thanks to the debuger, I don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.

Second, your question about I/O is interesting. I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are nested loops). Does it influence parallelization? The step in wich my program write data into a file is not included between parallelization directive.

Thank you so much for your help!

>>... I don't know how to

SergeyKostrov — Fri, 03 May 2013 14:31:59 GMT

>>... I don't know how to check it with the task mananger?.. - Start Task Manager - Select Processes property page - Select View in main menu - Select Select Columns... and check on Thread Count >>...I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are >>nested loops). Does it influence parallelization? In that case I would simply comment that part in codes, build sources and repeat all tests / verifications.

Bravo ! Auto-Parallelisation

JB_D_ — Fri, 03 May 2013 14:58:26 GMT

Bravo ! Auto-Parallelisation works fine when I comment the output step!!

So how can I keep this and get auto-parallel working fine too?

Another question, why execution is not faster (and even a little bit slower than mono-processing)?

>>>I'm at work at the moment

Bernard — Fri, 03 May 2013 16:29:10 GMT

>>>I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.>>>

Can you post the screenshot from your pc(when you executed Xperf)?

I would not recommend to look at percentage description of cpu load.Xperf and process explorer provide better and more clearer information about the load of cpu by your thread(s).This is done by counting cpu cycles instead of measuring timer interval(~10ms).

>>>don't know how to check it

Bernard — Fri, 03 May 2013 16:36:53 GMT

>>>don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.>>>

If you want to ensure that running threads belong to your application you can also use process explorer with its detailed view(including per thread callstack) more advanced information can be obtained with the debugger.

Hi JB,

SergeyKostrov — Sat, 04 May 2013 01:32:00 GMT

Hi JB, >>Bravo ! Auto-Parallelisation works fine when I comment the output step!! >> >>So how can I keep this and get auto-parallel working fine too? >> >>Another question, why execution is not faster (and even a little bit slower than mono-processing)? Thanks for the update and it looks like a light at the end of a tunnel. Regarding performance problems I wouldn't make any comments because there are too many unknowns for me and a verification with some performance utilities, like Intel VTune or Inspector, could show you why it happens. Note: Is it possible to do a couple of tests with smaller data sets?

JB D

jimdempseyatthecove — Sat, 04 May 2013 12:29:10 GMT

JB D

In looking at your stream(f) function it essentially rotates sections of an array. This is memory bandwidth heavy. I cannot see the outer levels of your program, so I will throw something out for you to consider.

Rotation can be accomplished by using modulus arithmatic on the indicies.

[fortran]
xBase = xBase + 1 ! rotate in +x
yBase = yBase + 1 ! rotate in +y
do yRing = 1, yDim
do xRing = 1, xDim
    x = MOD(xBase + xRing - 1, xDim) + 1
    y = MOD(yBase + yRing - 1, yDim) + 1
    ! use x and y as indicies as before
[/fortran]

Jim Dempsey