Interpreting AMD CodeAnalyst, OMP speed

gib · ‎03-10-2009

I know this is heresy - Steve, don't read any further!

I've been doing a bit of testing with CodeAnalyst, trying to get some insight into why my program runs only about twice as fast with 4 processors as with 1. The most time-consuming parts of the code have been parallelised with OpenMP.

I notice that in the 4-thread case, the timer-based profile shows that my program is using about 60% of the cpu, while about 30% is being used between libiomp5md.dll and ntkrnlpa.exe. Can someone tell me if this non-program cpu usage is unusually high?

As far as I can see my parallel section is quite straightforward. I am doing computations on arrays that are defined on a 3D lattice, and my approach is to divide the region up into 2N slices (N = number of CPUs), and process the odd-numbered slices in one sweep and the even-numbered slices in the next sweep. For example, if a typical array is dimensioned A(100,100,100), and N = 2, the slices will correspond to A(1:100,1:100,z_lo:z_hi) where

slice z_lo z_hi
1 1 25
2 26 50
3 51 75
4 76 100

I do this consistently, and expect to avoid any synchronous writing by different processors to adjacent memory locations.

Here is how I do the parallelisation:

real :: c1(100,100,100),c2(100,100,100)
...
do sweep = 0,1
!$omp parallel do
do kpar = 0,N-1
call par_sub(c1,c2,sweep,kpar)
enddo
!$omp end parallel do
enddo

...

subroutine par_sub(c1,c2,sweep,kpar)
real :: c1(:,:,:),c2(:,:,:)
integer :: sweep,kpar
...
z_lo = zoffset(sweep+2*kpar) + 1
z_hi = zoffset(sweep+2*kpar+1)
do z = z_lo,z_hi
do y = 1,NY
do x = 1,NX
! do processing involving c1, c2 and other arrays at (x,y,z) and neighbouring lattice points
enddo
enddo
enddo
...
end subroutine

TimP · ‎03-11-2009

Why not run openmp_profile? That should confirm whether time spent in libiomp5 is due to imbalance, whether it is even between the threads, and the code region involved.
As to explaining it, the identity of your platform could be important, and whether you have taken steps to permit each processor to keep its data locally.
What I make of your explanation is that the data regions of the 2 threads are interleaved. It may be better to allow each CPU to work on 1 contiguous block.

jimdempseyatthecove · ‎03-11-2009

Place a loop around your do sweep loop such that you get a reasonal computational run time. Then look at the percent of time spent in libiomp5md.dll and ntkrnlpa.exe. If you are still high, then do you have any Atomic or Critical sections?

Jim Dempsey

gib · ‎03-11-2009

Quoting - tim18

Why not run openmp_profile? That should confirm whether time spent in libiomp5 is due to imbalance, whether it is even between the threads, and the code region involved.
As to explaining it, the identity of your platform could be important, and whether you have taken steps to permit each processor to keep its data locally.
What I make of your explanation is that the data regions of the 2 threads are interleaved. It may be better to allow each CPU to work on 1 contiguous block.

Thanks. How should I make each CPU work on a contiguous block? BTW the code I presented is slightly simplified - in fact the arrays c1 and c2 each have a 4th dimension (small size, between 1 and 4, e.g. c1(100,100,100,2)). I wasn't sure whether to use (100,100,100,2) or (2,100,100,100) or whether it matters.

I have never used openmp_profile, will try this.

The machine is an Intel quad core.

I haven't taken steps to permit each processor to keep its data locally, and I don't know what that would entail. Could you explain please? Currently c1, c2 and other big arrays used in par_sub are all global.

gib · ‎03-11-2009

Quoting - jimdempseyatthecove

Place a loop around your do sweep loop such that you get a reasonal computational run time. Then look at the percent of time spent in libiomp5md.dll and ntkrnlpa.exe. If you are still high, then do you have any Atomic or Critical sections?

Jim Dempsey

Not sure if I understand, Jim. The sweep code is called repeatedly, and the run time for the timings I showed was CodeAnalyst's default, about 10 sec I think. There are no parallel instructions beyond what I showed, i.e. just the parallel do.

TimP · ‎03-11-2009

If the entire extent of this small dimension is always processed together, particularly if the short loop can be unrolled completely, it may be better to put it as the first dimension.
It seemed from your description that the threads may be operating on alternate blocks of memory. At each boundary between threads, there may be cache lines of the shared arrays which must be fetched by both threads, or even modified by one while the other has accessed it, so you may get delays which you would like to minimize.
If your platform has any NonUniformMemory characteristics, assuming you use default OpenMP static scheduling, and set KMP_AFFINITY, you could give the OpenMP library a chance to distribute the memory assignments efficiently by using the same OpenMP parallel scheme for the first initialization of the global data as will be used in the bulk of the calculation. You may not spend enough time in "first touch" for parallelization to matter for performance at that stage, but it could gain later on, particularly if your platform has a NUMA option in the BIOS, and you select that option. On such platforms, the non-NUMA option is intended to slow down all memory accesses equally, removing much of the advantage of correct memory placement.
If you read the news, you may have seen that Intel will join the others in supporting primarily NUMA multi-socket platforms over the next month.

gib · ‎03-11-2009

Quoting - tim18

If the entire extent of this small dimension is always processed together, particularly if the short loop can be unrolled completely, it may be better to put it as the first dimension.
It seemed from your description that the threads may be operating on alternate blocks of memory. At each boundary between threads, there may be cache lines of the shared arrays which must be fetched by both threads, or even modified by one while the other has accessed it, so you may get delays which you would like to minimize.
If your platform has any NonUniformMemory characteristics, assuming you use default OpenMP static scheduling, and set KMP_AFFINITY, you could give the OpenMP library a chance to distribute the memory assignments efficiently by using the same OpenMP parallel scheme for the first initialization of the global data as will be used in the bulk of the calculation. You may not spend enough time in "first touch" for parallelization to matter for performance at that stage, but it could gain later on, particularly if your platform has a NUMA option in the BIOS, and you select that option. On such platforms, the non-NUMA option is intended to slow down all memory accesses equally, removing much of the advantage of correct memory placement.
If you read the news, you may have seen that Intel will join the others in supporting primarily NUMA multi-socket platforms over the next month.

"At each boundary between threads, there may be cache lines of the shared arrays which must be fetched by both threads" I thought that the way I define the slices would avoid this. Am I wrong in thinking that, for example, in A(100,100,100) the memory locations of the block A(:,:,25) will be very remote from those in the block A(:,:,51)? If so I am more confused than I realised.

TimP · ‎03-11-2009

Quoting - gib

"At each boundary between threads, there may be cache lines of the shared arrays which must be fetched by both threads" I thought that the way I define the slices would avoid this. Am I wrong in thinking that, for example, in A(100,100,100) the memory locations of the block A(:,:,25) will be very remote from those in the block A(:,:,51)? If so I am more confused than I realised.

Yes, those boundaries are far enough apart, but it seems you may have more of them than necessary. I'm just pointing out some possibilities, as you haven't seen fit to tell anything about your platform.

gib · ‎03-11-2009

Quoting - tim18

Yes, those boundaries are far enough apart, but it seems you may have more of them than necessary. I'm just pointing out some possibilities, as you haven't seen fit to tell anything about your platform.

I did see fit to say it was an Intel quad core box. Maybe that isn't enough info.

Intel Core 2 Quad CPU Q6600. OS is Windows XP Pro SP2. IVF 11.0.072. I am not

I can't see how I could get by with fewer boundaries and still preclude simultaneous memory accesses on adjacent locations, but I'm open to suggestions.

BTW, I have set KMP_AFFINITY = compact, but it didn't seem to make any difference. With the "verbose" option I get the following messages, which don't mean much to me:

OMP: Warning #2: Cannot open message catalog "5129libiomp5ui.dll":

OMP: System error #126: The specified module could not be found

OMP: Info #3: Default messages will be used.

OMP: Info #157: KMP_AFFINITY: Affinity capable, using global cpuid instr info

OMP: Info #162: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3}

OMP: Info #164: KMP_AFFINITY: 4 available OS procs

OMP: Info #165: KMP_AFFINITY: Uniform topology

OMP: Info #167: KMP_AFFINITY: 1 packages x 4 cores/pkg x 1 threads/core (4 total cores)

OMP: Info #168: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):

OMP: Info #176: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]

OMP: Info #176: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]

OMP: Info #176: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]

OMP: Info #176: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]

OMP: Info #155: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}

OMP: Info #155: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}

OMP: Info #155: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}

OMP: Info #155: KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}

TimP · ‎03-11-2009

That looks OK, except maybe for the indication that it failed when trying to put out another message. Q6600 isn't a uniform topology, as there are 2 cores on each L2 cache. With your scheme, it doesn't look look like the normal advantage would accrue anyway from ensuring that threads 0 and 1 are on the same cache.
Applications I work with most often would be expected to see 80% speedup from 1 to 2 threads, and 45% from 2 to 4 threads, on such a CPU. The drop-off in scaling is due mainly to frequently running up against memory bus capacity. On other applications, cache capacity might limit scaling from 2 to 4 threads.
There shouldn't be any memory placement effects with a single socket.

gib · ‎03-11-2009

Quoting - tim18

That looks OK, except maybe for the indication that it failed when trying to put out another message. Q6600 isn't a uniform topology, as there are 2 cores on each L2 cache. With your scheme, it doesn't look look like the normal advantage would accrue anyway from ensuring that threads 0 and 1 are on the same cache.
Applications I work with most often would be expected to see 80% speedup from 1 to 2 threads, and 45% from 2 to 4 threads, on such a CPU. The drop-off in scaling is due mainly to frequently running up against memory bus capacity. On other applications, cache capacity might limit scaling from 2 to 4 threads.
There shouldn't be any memory placement effects with a single socket.

There is another aspect of my program that I left out of the example in order to simplify the explanation, but I now see that the bit I left out is crucial. While I am working on a cubic lattice, the region of interest is restricted to a roughly spherical region (let's call it the blob), centred at the centre of the lattice. I have an array of derived type, one component of which indicates if (x,y,z) is in the blob. In par_sub() this array is checked whenever a computation involves a point (x,y,z) and the loop cycles if the point is not in the blob. My (naive) approach to load sharing has been to set up the zoffset() array that defines the slices to ensure equal numbers of blob points in each slice - each slice has roughly equal volume of blob. Of course, the flaw in my reasoning is that in par_sub() there is still a significant amount of time spent scanning the non-blob points in the slice, so the CPU load for each slice is in fact unbalanced. I have confirmed this by running a test case in which the blob spans the whole cube. In this case there are no non-blob points and the slices are truly equal. The timings for this case are:

nthreads time
1 45.5
2 23.3
4 13.1

i.e. very nice quasi-linear speedup. So the poor speedup I'm seeing is the result of poor load sharing. Back to the drawing board to figure out a way to make effcient use of multiple processors. One complicating factor is that in production runs the blob changes size over the course of the program's execution.

(Edit later)
I am now setting up a list of blob sites for each slice before entering the parallel section, and getting much better load sharing with the blob.

nthreads time
1 31.1
2 18.1
4 12.1

This can probably be further improved.

Thanks for your help Tim.

peter_poliski · ‎05-10-2009

Quoting - gib

There is another aspect of my program that I left out of the example in order to simplify the explanation, but I now see that the bit I left out is crucial. While I am working on a cubic lattice, the region of interest is restricted to a roughly spherical region (let's call it the blob), centred at the centre of the lattice. I have an array of derived type, one component of which indicates if (x,y,z) is in the blob. In par_sub() this array is checked whenever a computation involves a point (x,y,z) and the loop cycles if the point is not in the blob. My (naive) approach to load sharing has been to set up the zoffset() array that defines the slices to ensure equal numbers of blob points in each slice - each slice has roughly equal volume of blob. Of course, the flaw in my reasoning is that in par_sub() there is still a significant amount of time spent scanning the non-blob points in the slice, so the CPU load for each slice is in fact unbalanced. I have confirmed this by running a test case in which the blob spans the whole cube. In this case there are no non-blob points and the slices are truly equal. The timings for this case are: clothes fashion korean wholesale fashion for model world. Asian fabrics korean wholesale fashion models garment .Movie film phim phim online viet entertainment. Music media phim han quoc drama clips. Movie clips korean drama for entertainment.
nthreads time
1 45.5 Fashion brand names thoi trang high quality. Top level cotton thoi trang for teen men women fashion
2 23.3 Cloth garment wholesale korean fashion for fashion world.
4 13.1Asian cottons wholesale korean fashion dresses blouses

i.e. very nice quasi-linear speedup. So the poor speedup I'm seeing is the result of poor load sharing. Back to the drawing board to figure out a way to make effcient use of multiple processors. One complicating factor is that in production runs the blob changes size over the course of the program's execution.

(Edit later)
I am now setting up a list of blob sites for each slice before entering the parallel section, and getting much better load sharing with the blob.

nthreads time
1 31.1
2 18.1
4 12.1

This can probably be further improved.

Thanks for your help Tim.

Gib, I followed all your experiementa in this thread and enjoyed them every bit. You have conducted amazing ones and reported result in detail for the dumbest.

Look forward to your threads to cure my noobiness.

Regards,

gib · ‎05-10-2009

Quoting - peter_poliski

Gib, I followed all your experiementa in this thread and enjoyed them every bit. You have conducted amazing ones and reported result in detail for the dumbest.

Look forward to your threads to cure my noobiness.

Regards,

:-) You are too kind. Noobiness is a relative designation - with respect to many here I am a noob too.

Best
Gib