trouble getting parallel do to work

John_Paine · ‎10-22-2009

I've been working on using multiple cores to speed up some calculations and have failed to work out how to get my code to use OMP to perform the calculations in parallel. The cut-down code block looks like this:

c loop over the x cells

do ii=1, mx+1

c set up the x coordinates

....

c loop over the y cells

do jj=1, my+1

c set up the y coordinates

...

c loop over the z cells

!OMP PARALLEL

!OMP SINGLE
if(OMP_IN_PARALLEL()) then
write(*,'(a,3i8)')'# of threads: ',OMP_GET_MAX_THREADS(),OMP_GET_NUM_THREADS(),OMP_GET_THREAD_NUM()
endif
!OMP END SINGLE

!OMP DO SHARED(gmod) PRIVATE(nface,i,ztop,zbot,FieldValue) schedule(dynamic,10) COLLAPSE(2) REDUCTION(+-:gmod)

do kk=k0, mz+1

c set up z coordinates

polyCoord(3,1)=ztop
...
polyCoord(3,8)=zbot

c loop over the faces

do nface=1,3

c initialise the calculated values

do i=1,6
FieldValue(1,i)=0
end do

c check which face

if(nface.eq.1) then

c top face

call calculate_face_response_v2(sus,dens , geomag , cxI , cyI , czI , magx , magy , magz ,
3 , polyCoord, (nface-1)*2+1, indFace,0, 1 , locCoord, FieldValue, ierr )

c add it to the relevant cells

if(kk.le.mz) then
gmod(ii,jj,kk) = gmod(ii,jj,kk)-FieldValue(1,inds)
endif
if(kk.gt.1) then
gmod(ii,jj,kk-1) = gmod(ii,jj,kk-1)+FieldValue(1,inds)
endif

elseif(nface.eq.2) then

c front face

call calculate_face_response_v2(sus,dens , geomag , cxI , cyI , czI , magx , magy , magz ,
4 , polyCoord, 2, indFaceRect,0, 1 , locCoord, FieldValue, ierr )

c add it to the relevant cells

...

elseif(nface.eq.3) then

c left face

call calculate_face_response_v2(sus,dens , geomag , cxI , cyI , czI , magx , magy , magz ,
4 , polyCoord, 3, indFaceRect,0, 1 , locCoord, FieldValue, ierr )

c add it to the relevant cells

...

endif

end do

enddo
!OMP END DO
!OMP END PARALLEL

enddo

enddo

The code is calculating the response of a block model and storing the results in the real*4 gmod array dimensioned(mx,my,mz). The real*4 polyCoord array is dimensioned (3,8) and holds the xyz coords of the 8 corners of a rectangular prism.The real*4 FieldValue array is dimension (1,6) and holds the values calculated for the face. The other parameters to the call to calculate_face_response_v2 are not changed within the loop. Some of the parameters are held in common blocks and others are padded in as parameters. The response calculation routine does not use any common blocks.

The problem is that the code compiles fine using /Qopenmp and the compiled program reports that there are 8 threads available (hyperthreaded quad core processor), but the do loop uses only a single thread to run. I've checked this with process monitor, Process Explorer and the Intel concurrency checker and it definitely only uses one thread.

I added the Intel sample OMP code just before my loopsand the intel codeoperates as expected and uses 8 threads but mine does not. I've check the asm code generated and the Intel code has calls to routineswith names like ___kmpc_global_thread_num just after the !OMP PARALLEL line, but there are no such calls associated with my !OMP PARALLEL line.

I've tried dropping various parts of my code to see if I can isolate the problem, but nothing seems to change the code to allow the use of the multiple cores.

Does anyone have any suggestions as to what is causing the compiler to ignore my OMP directives?

Thanks in advance
John

jimdempseyatthecove · ‎10-22-2009

Begin with changing all "!OMP" to "!$OMP"
i.e. you are missing the $ infront of OMP

Jim Dempsey

John_Paine · ‎10-22-2009

Doh!

I had already worked around the problem by shifting my code into the Intel OMP block and found that it worked, but still couldn't see the obvious syntax problem. Need more coffee!

Next problem is that the code doesn't speed up by much even with using all 8 threads at about 80-90%. Can the overhead of the OMP PARALLEL setup really be that large? Of course I first need to do some more testing to ensure that the code is working properly (and to avoid having to stuff both my feet into my mouth at the same time).

Many thanks
John

jimdempseyatthecove · ‎10-23-2009

John,

You have a substantial amount of code inside the parallel regons omp do block. You should see good speedup. Something must account for this.

I notice the code sample is using collaps(2) on the OMP DO however I only see one loop prior to initialization code. Could be a problem of your posting a code snipet but maybe not.

Look inside your CALLed subroutines to see if they are using or calling something that uses critical sections (e.g. RANDOM enters/exits a critical section, ALLOCATE/DEALLOCATE enters/leaves critical section, you may...) and !$OMP ATOMIC slows down code.

Jim

John_Paine · ‎10-23-2009

Quoting - jimdempseyatthecove

John,

You have a substantial amount of code inside the parallel regons omp do block. You should see good speedup. Something must account for this.

I notice the code sample is using collaps(2) on the OMP DO however I only see one loop prior to initialization code. Could be a problem of your posting a code snipet but maybe not.

Look inside your CALLed subroutines to see if they are using or calling something that uses critical sections (e.g. RANDOM enters/exits a critical section, ALLOCATE/DEALLOCATE enters/leaves critical section, you may...) and !$OMP ATOMIC slows down code.

Jim

Hi Jim,

The COLLAPSE was just part of my testing and once I corrected the OMP line, I removed those extra bits. The called routine is pretty simple but computationally expensive and doesn't do any allocating or have critical sections (as far as I know). But it does seem odd that the parallel code is using all 8 threads but is no faster than the serial version.

I'll test out the various options and let you know once I figure it out.

One other problem I encountered is that when I try compiling an X64 version the linker compains that it cannot find the libiomp5md.lib library. This file does exist in the ia32 folder but not in the Intel64 folder (should I have an x64 folder or is the Intel64 the same as x64?). Do you know where the lib file is, or how to get around this problem?

Many thanks
John

jimdempseyatthecove · ‎10-25-2009

John,

You might be experiencing a memory bandwidth problem

In Fortran Array(I,J,K) and Array(I+1,J,K) are in adjacent memory (and candidate for vecrtorization, and candidate for coresidency in same cache line).

Restructure your data or loops such that the inner most loop indexes the left most index.
In C/C++, the right most index varies with adjacent data.

Perform these edits first with OpenMP disabled. Verify correctness before you re-enable OpenMP.

Jim

John_Paine · ‎10-26-2009

Quoting - tim18

The ia32 (32-bit) and Intel64 (64-bit) installers come both separately and in a combined version. Either way, you can add the one you didn't install, and run under the same license. If both are installed, you will have both ia32 and intel64 (which matches the Microsoft amd64) folders inside both Intel lib and bin folders.

I had the 64-bit version installed, but obviously I'd missed an update somewhere along the line.After running the latest Intel64 update,the linking now works fine.

Many thanks
John

John_Paine · ‎10-28-2009

Quoting - jimdempseyatthecove

John,

You might be experiencing a memory bandwidth problem

In Fortran Array(I,J,K) and Array(I+1,J,K) are in adjacent memory (and candidate for vecrtorization, and candidate for coresidency in same cache line).

Restructure your data or loops such that the inner most loop indexes the left most index.
In C/C++, the right most index varies with adjacent data.

Perform these edits first with OpenMP disabled. Verify correctness before you re-enable OpenMP.

Jim

Hi Jim,

Thanks for the suggestions. I approached it a little differently by using a small local array to hold the results for the innermost (k) loop and then stored the results after the parallel section ended. The resulting speedup in elapsed time was about 3 times using a hyperthreaded quadcore processor when compared to using the serial version on the same computer. There is a fair bit of overhead outside the main parallelised loop, so this speedup is probably close to the maximum I can expect.

Interestingly the total CPU time for the multicore version was about 3 times greater than the single core version. It looks to me that the hyperthreading is giving the multicore version 8 threads to work with, but each thread is doing twice as much work asthe single threaded code. To test this out I set the number of threads to 4 and changed the affinity settings so that the exe would only use one thread per physical core. The elapsed time for the 4-threaded version was almost identical to the 8-thread version, but the 4-thread version used only half the amount of cpu time. I think that this pretty much conclusively shows that the 2 threads/physical core provided by hyperthreading is not providing any speedup for this particular application. Presumably the doubling of thetotal CPU time indicates that on each core one thread is spinning half the time while the other one is actuallycalculating.

I did test out the reordering of the loops as you suggested (k then j then i rather than the i then j then k that I originally wrote), but the elapsed time was almost identical to just usinga temporary array to hold the results of the innermost k loop. So I'll stick with the logical order I started with rather than the more storage sensitive contiguous order.

I'm pretty happy with the speedup obtained using the OMP directives and can't really see any other big gains to be had in the remaining code. One thing that would be good would be to limit the threads so that there was only one per physical core, but I can't see any simple way to do that at the moment.

Many thanks
John

jimdempseyatthecove · ‎10-29-2009

Good work John,

If you are using Intel C++ (or one that supports OpenMP 3.xx) try setting environment variable

KMP_AFFINITY=granularity=fine,compact,1,0

Then set the number of threads to 1/2 that on the system.
Don't do this for systems without HT.
An easy way might be to create your own environment variable setting to indicate if HT on system

Jim Dempsey

John_Paine · ‎10-29-2009

Quoting - jimdempseyatthecove

Good work John,

If you are using Intel C++ (or one that supports OpenMP 3.xx) try setting environment variable

KMP_AFFINITY=granularity=fine,compact,1,0

Then set the number of threads to 1/2 that on the system.
Don't do this for systems without HT.
An easy way might be to create your own environment variable setting to indicate if HT on system

Jim Dempsey

Hi Jim,

Thanks for that. I'm using the Intel Fortran compiler and it does support the KMP_AFFINITY interface. That should help me restrict the threads to the physical cores, but I'm leaning towards deferring getting down to that level until I get more experience and feedback from users. My test system seems to be equally responsive whether the application is using 8 threads or 4 threads. So even though the total CPU time is doubled for the 8 thread case, it doesn't appear to be impacting on system performance. My next test will be trying to run two 8 thread apps and comparing that with two 4 thread apps as that might providebetter informationas to whether setting the affinity is worthwhile or not.

On the environmental side of the problem, I guess that the 8 and 4 thread cases will use pretty much the same amount of energy as in both cases the cores are running at full power for about the same length of time. The fact that the total CPU time for the 8 thread case is twice that of the 4 thread case indicates to methat CPU time isn't a good proxy for energy consumption. Probably I should track the temperature in the cores to check this.

Many thanks for your help
John

jimdempseyatthecove · ‎10-30-2009

John,

When you run the two copies of 4 threads and you are playing with the affinities remember not to get both apps on the same HT sibling. Running two apps will (may) have the effect of listening to a dual prop airplane with engines at different RPMs (WoW, WoW, WoW, ...). At one point both HT siblings will be fighting for the same resources while at other times they will not. When the do not, you will see a performance boost. In the single app situation the HT siblings tend to get in the way of each other (since you start and stop the same functional loops). Read the doc on KMP_AFFINITY, each process can overwrite KMP_AFFINITY with a local copy (i.e. one app can be set to use even numbered, the other app uses odd numbered).

The energy consumption in the system null task is much lower than an HT shoulder shoving contest. And newer series of processors (you will upgrade at some point in the future) have energy saving speed stepping technology.

Jim Dempsey