Tips for improving nested loops + complex vars + OMP?

jason_kenney · ‎07-27-2012

I'm trying to parallelize simulation code using OpenMP. After adding some basic "!$OMP PARALLEL DO PRIVATE" blocks and interchanging loops, I found a series of blocks with similar structure that were bogging things down in general and didn't seem to speed up much if at all when increasing OMP_NUM_THREADS. The general form is:

...

!$OMP PARALLEL DO PRIVATE (ii,jj,it,kk)

do ii=1,max_i

do it=1,max_t

do kk=1,max_k

do jj=1,max_j

cmplx_01(jj,kk,it,ii,2)=cmplx_01(jj,kk,it,ii,2) &

+cmplx_00(kk)*dble_01(jj,it,ii,1)

cmplx_02(jj,kk,it,ii,2)=cmplx_02(jj,kk,it,ii,2) &

+cmplx_00(kk)*dble_02(jj,it,ii,1)

cmplx_03(jj,kk,it,ii,2)=cmplx_03(jj,kk,it,ii,2) &

+cmplx_00(kk)*dble_03(jj,it,ii,1)

end do

!$OMP END PARALLEL DO

...

All the arrays but cmplx_00 are in a module and allocated elsewhere. max_i and max_j are typically between 50 and 300. max_t is typically either 1 or ~300. max_k is usually 2. The subroutine with these blocks is called thousands of times, one of 5 subs that are called in series and iterated.

I typically run on machines with 2 packages x 6 cores (Xeon 5670s or 5680s) under Windows XP-64. Some machines have hyperthreading and show 24 CPUs in task manager, others only show 12 CPUs. These workstations have 16+ GB RAM and the test problem is about 7 GB.

For compilation, the best results so far have used `ifort -c /O2 /Qopenmp /QxSSSE3` (ifort 11.1.065). /O3 might be faster for one core, but slower at 2 cores generally. I've tried things like `/Qpar-affinity=verbose,granularity=thread,proclist=[0,2],explicit` and setting KMP_AFFINITY, but I haven't seen improvement (maybe my ignorance of optimal settings). I've also used `/Qvec-report:3` and found that the inner loops above aren't being vectorized, presumably because complex data types aren't supported?

I've used VTune Amplifier XE 2011 to examine hotspots and hardware issues. With OMP_NUM_THREADS=1, I'm seeing CPI ~1.6 with 0.893 Retire Stalls, 0.514 LLC Miss, and 0.406 LLC...Remote DRAM. With OMP_NUM_THREADS=2, those go to 2.014, 0.863, 0.525, and 0.357, respectively. I assume the first step is to improve the LLC Misses, but beyond having things in column-major order, I'm not sure how to improve. I see 'blocking data to fit into the LLC' mentioned, but I haven't seen example code that would give me some idea about how to go about this and what the block size should be.

Thanks in advance for any advice.

Jason

TimP · ‎07-28-2012

opt-report doesn't report SSE3 complex double vectorization as vectorized. If it uses SSE3 parallel instructions to execute real and complex parts in parallel, that's all you can expect. I've questioned whether the vec-report need use such pessimistic wording. I don't see anything here to indicate that you would need -complex-limited-range for effective optimization, but you didn't provide enough detail (such as at least a compilable code fragment). You could set -QxAVX simply to see if you get a VECTORIZED report, for confidence that the compiler is optimizing your source code.

It's difficult to get full performance with hyperthreading on Windows. KMP_AFFINITY for Westmere is potentially rather complicated. Setting OMP_THREADS=12 and KMP_AFFINITY=compact,1,1 (or equivalent) is one of the options you should be trying. As you appear to suspect cache performance issues, and your loop trip numbers aren't large, you should also try 8 threads placed with 1 thread on each of the first 2 pairs of cores, and one each on the 5th and 6th core of each socket, something like KMP_AFFINITY="proclist=[3,7-11:2,15,19-23:2],explicit,verbose" where verbose is set to give the full diagnostic output to see if your command line was interpreted as intended. This attempts to deal with the fact that the first 2 pairs of cores each have a total of 4 hyperthreads sharing a single path to L3, so you should be trying to emulate a Nehalem with hyperthreading disabled as one of your experiments. There's no documentation I know of which says that the cores which share cache paths will be at the head of the list as I assumed.
You will need to set KMP_AFFINITY values appropriate for hyperthreading disabled on those machines (e.g. compact,0,0 for 12 threads, [1,3-5,7,9-11] for 8 threads).
I find it potentially confusing that the number to assign to every other thread context with compact is 1, while with proclist it's 2, but I never heard anyone else complain.

John_Campbell · ‎07-28-2012

As an alternative approach, if the arrays are suitably sized (not shown) and that the index order could be changed, I would shift the KK index so that cmplx_i and dble_i have a similar structure.
I would also use a subroutine wrapper to reduce the rank and use array syntax.
I have provided this example below.

This might not suit the $OMP PARALLEL approach but should still be suitable for vectorisation. Could the 3 calls to cmplx_group be managed in parallel ? (I'd like to know if this is a good or unsuitable idea.)

These approachers definately would suit older compilersbut perhaps not ifort.
[bash] subroutine test (max_i, max_t, max_j, max_k, dble_01, dble_02, dble_03, & cmplx_00, cmplx_01, cmplx_02, cmplx_03) ! integer*4 max_i, max_t, max_j, max_k, ii, it, jj, kk double precision dble_01(max_j,max_t,max_i,2), & dble_02(max_j,max_t,max_i,2), & dble_03(max_j,max_t,max_i,2) complex cmplx_01(max_j,max_t,max_i,max_k,2), & cmplx_02(max_j,max_t,max_i,max_k,2), & cmplx_03(max_j,max_t,max_i,max_k,2), & cmplx_00(max_k) !... !$OMP PARALLEL DO PRIVATE (ii,jj,it,kk) ! ! one option is to re-order the array arguments so cmplx_i and dble_i are the same do kk=1,max_k ! do ii=1,max_i do it=1,max_t do jj=1,max_j cmplx_01(jj,it,ii,kk,2) = cmplx_01(jj,it,ii,kk,2) + cmplx_00(kk)*dble_01(jj,it,ii,1) cmplx_02(jj,it,ii,kk,2) = cmplx_02(jj,it,ii,kk,2) + cmplx_00(kk)*dble_02(jj,it,ii,1) cmplx_03(jj,it,ii,kk,2) = cmplx_03(jj,it,ii,kk,2) + cmplx_00(kk)*dble_03(jj,it,ii,1) end do end do end do ! ! alternative option is to simplify the array addressing call cmplx_group (cmplx_01(1,1,1,kk,2), dble_01(1,1,1,1), cmplx_00(kk), max_i*max_t*max_j) call cmplx_group (cmplx_02(1,1,1,kk,2), dble_02(1,1,1,1), cmplx_00(kk), max_i*max_t*max_j) call cmplx_group (cmplx_03(1,1,1,kk,2), dble_03(1,1,1,1), cmplx_00(kk), max_i*max_t*max_j) ! end do !$OMP END PARALLEL DO !... end subroutine cmplx_group (cmplx_i, dble_i, cmplx_00, n) integer n complex cmplx_i(n), cmplx_00 double precision dble_i(n) ! cmplx_i = cmplx_i + cmplx_00 * dble_i ! end
! This could be simplified and be conforming as:

!...
!$OMP PARALLEL DO PRIVATE (kk)
!
!  one option is to re-order the array arguments so cmplx_i and dble_i are the same
        do kk=1,max_k
!
        cmplx_01(:,:,:,kk,2) = cmplx_01(:,:,:,kk,2) + dble_01(:,:,:,1) * cmplx_00(kk)
        cmplx_02(:,:,:,kk,2) = cmplx_02(:,:,:,kk,2) + dble_02(:,:,:,1) * cmplx_00(kk)
        cmplx_03(:,:,:,kk,2) = cmplx_03(:,:,:,kk,2) + dble_03(:,:,:,1) * cmplx_00(kk)
!
        end do
!$OMP END PARALLEL DO
!...[/bash]

jason_kenney · ‎07-31-2012

Thanks to you both for the tips. I'll try to find time to test them this week and see if I can come up with a simplified code fragment that exhibits the same issues and is compilable.