Optimizing indirect array access on Xeon Phi - Page 2

james_B_8 · ‎08-01-2013

Hi all

I'm currently porting an application to Xeon Phi. By offloading the OpenMP regions to MIC I have gained nearly a 2x speed up over 8 Xeon cores already. Currently trying to tune it to make it even better. The bottle neck in the code is this loop here:

[fortran]

do n = lower, upper
                !Full Bessel integration
                a2=aa(n)
                bes_ix=bes_index(n)

                J_l=a2*ajl(bes_ix,j)+(1-a2)*(ajl(bes_ix+1,j) - ((a2+1) &
                     *ajlpr(bes_ix,j)+(2-a2)*ajlpr(bes_ix+1,j))* fac(n)) !cubic spline
                J_l = J_l*pTimeSteps_dpoints(n)

                !The unwrapped form is faster

                sum1 = sum1 + IV%Source_q(n,1)*J_l
                sum2 = sum2 + IV%Source_q(n,2)*J_l
                sum3 = sum3 + IV%Source_q(n,3)*J_l

end do

[/fortran]

This occurs inside a parallel for loop, so every thread executes this entire loop by itself. The problem is that this loop won't automatically vectorize. Likely due to the reduction component / rounding errors etc. I've used a SIMD reduction, which the compiler accepts however it's the J_l= line that kills all performance.

ajl and ajlpr are look up tables for some bessel functions and they are being accessed indirectly, which results in lots of L1 cache misses. Array indirection is a dirty business, but I've read that Xeon Phi allows the programmer to do manual prefetching which may help reduce the cache misses and allow some performance gain from vectorising.

How should I go about implementing the manual prefetching?

James_C_Intel2 · ‎08-29-2013

p.s. You wrote "Should the load reference L3, it is likely that the skew in arival time to load exceeds that of moving data from L3 to L1. ", don't forget that KNC has no L3... only L2 and L1. Which probably doesn't affect your argument, but since we're debating u-architectural nits, this is quite a big one!

(I'll shut up now)

jimdempseyatthecove · ‎08-29-2013

James,

RE: no L3... only L2 and L1

Though there is nothing labeled "L3", the other cores have there own L2's. The other cores L2's are accessible with varing degrees of additional overhead over than of in-core L2, yet faster than RAM. Conceptually inter-core L2 fetch is a cache hit that lies between intra-core L2 fetch and RAM fetch. This could concievably be referred to as L2.1, L2.2, L2.3, ... depending on the ring path (latency) to the remote L2. Alternately, this could be referred to as L3 that has a variable latency. A matter of semantics.

RE: adding prefetch instructions

Prefetch in producer thread will add clock cycles to the producer code (work performed not associated with prefetching). Additionally, this may consume a general purpose register which could be better used as a base or index.

Nailing down cache latencies for Xeon Phi is somewhat difficult due to design canges between models. Using rough numbers:

L1=3, L2=22, L2.1=230, L2.n=280, RAM=333

Longer times on respective TLB miss

More later...

Jim Dempsey

jimdempseyatthecove · ‎08-29-2013

I guess the proof of the pudding is to taste it. Run the code. Unfortunately I do not have a MIC.

To setup a grossly simplified "proof of the pudding" test, without recoding into C++ and using the QuickThread threading toolkit, you could setup an OpenMP configuration with 2 threads isolated to same core. Then use !$OMP SECTIONS.

Attached is a mockup that compiles. The arrays have to be sized and filled in with valid data before running the test.

Compile (on Windows x64):

/nologo /debug:full /Ob0 /QxHost /Qip /Qopenmp /warn:interfaces /module:"x64\Debug\\" /object:"x64\Debug\\" /Fd"x64\Debug\vc100.pdb" /traceback /check:none /libs:static /threads /dbglibs /c

You can use Release build if you wish.

Debug build permits me to break on the write(*,*) and peek at the optimized code. (outputting to ASM listing does not enable full optimizations)

I am setting inlineing off IPO is on which seems contradictory for this program, however, higher optimizaitons do not occur with IPO off.
no inlining improved register usage.

Jim Dempsey

james_B_8 · ‎08-29-2013

Wow thanks Jim.

I'll give that a shot tomorrow. The real headache is that this bessel part is itself called from within an OpenMP loop that's why I was trying to focus on serial optimisation. I'll disable that OpenMP loop and run your version with the helper threads anyway just to see if it can help on MIC. If so then perhaps I could redesign the parallelism since that section has no parallel dependencies in it anyway. I'm only using 2 threads per core on the MIC so theoretically the two others could be used as helper threads with nested parallelism. But I'm doubtful that it's possible to make that work with openmp tbh.

Cheers,
James B

jimdempseyatthecove · ‎08-29-2013

James B.

The code provided is for proof-of-concept only, it will not be suitable for you to expand for use by your OpenMP application. The reason being:

In proof-of-concept configuration you configure for 2 threads only. Plus, using KMP_AFFINITY or other means you force the affinity of both threads to the same core (either 2 specific threads of same core or all 4 specific threads of same core). It this setup, the two threads, one for each section, will run on the same core. This will produce a valid test of HT sibling helper thread prefetcher. Though it will not reach peak performance for one core of Xeon Phi, the results data will be indicative of helpfulness.

A second proof-of-concept test is to restrict the app to 4 threads of same core, run your outer parallel region with 2 theads, and the nested enabled inner region (SECTIONS) with 2 threads. This should reach peak performance for one core.

Under actual programming usage, you will have no control over which helper threads run on which core. Therefore this is not a valid setup for your purposed using Fortran with OpenMP on MIC.

Not knowing your application it is hard for me to offer meaningful suggestions.

The test results data may be important for James Crownie (Intel).

Your data should include one core restricted 4 thread OpenMP, 4 threads in outer loop, non-nested using the BesselProducer _without_ the setting of the volatile n_v.

compared with one core restricted, 4 thread OpenMP, 2 threads in outer loop, Nested, using BesselProducer _with_ the setting of the volatile n_v together with BesselFetcher.

You would like to use varying sizes of data, and produce a set of charts. It will be interesting to see the effects as you cross boundaries, L1, L2, neigboring L2's RAM.

Should this produce significant benefits, then you can decide if it is worth the effort to follow this path. I am available to consult with you.

Jim Dempsey

jimdempseyatthecove · ‎08-29-2013

Sketch of potential workable solution:

CoreFound = .false. ! clear array indicating core found
!$OMP PARALLEL ... ! all threads start (4 threads/core, n cores)
iCore = FunctionToGetCoreNumber()
CoreFound(iCore) = .true.
iHTread = FunctionToGetHThreadSiblingNumberInCore()
!$OMP BARRIER
nCores = NumberOfTruesIn(CoreFound)
! without using nested parallelism, use iCore and iHThread to divide work
! e.g. IAND(iHthread,1) == 0 == producer, == 1 == fetcher
! this will work on host with 2 HT siblings/core
! *** keeping in mind the possiblility of missing cores in CoreFound
call doWork(CoreFound, nCores, iCore, iHThread, ...)
!$OMP END PARALLEL

Jim Dempsey

jimdempseyatthecove · ‎08-29-2013

The above requires a KMP_AFFINITY setting to fix threads to one logical processor (for each thread).

And you will have to add sanity code in the event that you have unexpected availability of HT threads in core.

e.g. 0 not present, 1 present (core present, producer won't run, fetcher may hang)

System administrator may setup something goofy.

Jim Dempsey

James_C_Intel2 · ‎08-30-2013

"The above requires a KMP_AFFINITY setting to fix threads to one logical processor (for each thread)."

The default affinity on KNC binds threads to the logical processors, so you don't need any extra KMP_AFFINITY to force that.

"And you will have to add sanity code in the event that you have unexpected availability of HT threads in core."

If you use KMP_PLACE_THREADS (http://software.intel.com/en-us/blogs/2013/02/15/new-kmp-place-threads-openmp-affinity-variable-in-update-2-compiler) you can explicitly choose the number of hardwrae threads/CPU that you want to use.

McCalpinJohn · ‎08-30-2013

Something to consider:

My experiments with OpenMP on Xeon Phi have shown that the synchronization overheads are quite high even across the four threads running on a single physical core.

Using the EPCC OpenMP Benchmarks in C, version 2, I consistently see a barrier overhead as:

1 thread (null operation): 0.081 microseconds: ~90 cycles
2 threads (same core): 1.21 microseconds: ~1325 cycles
3 threads (same core): 1.81 microseconds: ~1985 cycles
4 threads (same core): 2.48 microseconds: ~2728 cycles

This increases to >10,000 cycles using 64 threads on 16 cores, and to almost 15,000 cycles using 244 threads on 61 cores.

So I would recommend that you "roll your own" synchronization mechanisms in shared memory for the four threads on one core if you need fine-grain synchronization. Given the shared L1 cache, this should be a lot faster. Performance will depend on store to load forwarding latency, which may not be as fast as one would like, but between two threads on the same core I would guess that a synchronization latency on the order of a few 10's of cycles should be possible.

jimdempseyatthecove · ‎08-30-2013

James,

KMP_PLACE_THREADS assumes (requires) that the O/S had not placed any restrictions of thread placement. The sanity code is the test the assumption, and take corrective action if desired.

John,

You have good points. Realize that the suggested code is for proof-of concept programming using existing programming models. Ideally, you would like to have something like:

!DEC$ HT PREFETCH
DO I=1,N
...
END DO

Where the prefetch requirement of the code in the loop is relatively simple to interpret. In the above case, a hiden loop is auto-generated that consists of the memory fetches, no-substantial computation, no use of SSE/AVX. The prefetch loop runs subject to availability of HT sibling (iow, code will run on non-HT processor)

For complicated cases

!DEC$ HT PREFETCH SECTIONS
... code to perform work
!DEC$ HT PREFETCH SECTION
... code to perform prefetch (subject to availability of HT sibling)
!DEC$ END HT PREFETCH SECTIONS

Similar features for C++ (#pragma HT ...)

As you have stated in your post, these would use the lighest weight form of barriers and task context setup.

Jim Dempsey