Optimizing indirect array access on Xeon Phi

james_B_8 · ‎08-01-2013

Hi all

I'm currently porting an application to Xeon Phi. By offloading the OpenMP regions to MIC I have gained nearly a 2x speed up over 8 Xeon cores already. Currently trying to tune it to make it even better. The bottle neck in the code is this loop here:

[fortran]

do n = lower, upper
                !Full Bessel integration
                a2=aa(n)
                bes_ix=bes_index(n)

                J_l=a2*ajl(bes_ix,j)+(1-a2)*(ajl(bes_ix+1,j) - ((a2+1) &
                     *ajlpr(bes_ix,j)+(2-a2)*ajlpr(bes_ix+1,j))* fac(n)) !cubic spline
                J_l = J_l*pTimeSteps_dpoints(n)

                !The unwrapped form is faster

                sum1 = sum1 + IV%Source_q(n,1)*J_l
                sum2 = sum2 + IV%Source_q(n,2)*J_l
                sum3 = sum3 + IV%Source_q(n,3)*J_l

end do

[/fortran]

This occurs inside a parallel for loop, so every thread executes this entire loop by itself. The problem is that this loop won't automatically vectorize. Likely due to the reduction component / rounding errors etc. I've used a SIMD reduction, which the compiler accepts however it's the J_l= line that kills all performance.

ajl and ajlpr are look up tables for some bessel functions and they are being accessed indirectly, which results in lots of L1 cache misses. Array indirection is a dirty business, but I've read that Xeon Phi allows the programmer to do manual prefetching which may help reduce the cache misses and allow some performance gain from vectorising.

How should I go about implementing the manual prefetching?

TimP · ‎08-01-2013

software.intel.com/sites/default/files/article/.../5.3-prefetching-on-mic-4.pdf

discusses prefetching for such cases.

I agree that the prefetching question is likely to be your first priority. If you are limited only by L1 misses and not by L2, it shouldn't be too difficult.

If you solve the prefetch issue, promoting vectorization ought to give a significant further boost:

#pragma [omp] simd reduction(+: sum1,sum2,sum3) private(a2,bes_ix,J_l)

It's probably safer to specify the reduction clause, although I've seen cases which work better without it, as you don't need to promote more aggressive batching of the sums.

The integer data types are likely to work best as plain signed int (you leave us guessing).

james_B_8 · ‎08-01-2013

Some insight into the bes_index array that may provide a way in. From inspecting the array in ddt I see that it's like a descending stair case. Ie it's values are like: [10,10,10,10,10,9,9,9,9,8,8,8,8,8,7,7,7,7,7....] and so on until 1. At least it's not random access!

So my understanding of the L1 cache is that ajl will miss everytime it goes down a step. I figure that if it was going the other direction then this wouldn't be a problem right? The next entry of ajl would still be in the cache line. I ran the loop backwards to see if this would help, alas it made the performance worse.

I'm trying to prefetch backwards using:

[fortran]

             call mm_prefetch( ajl(0,j), 1)
             call mm_prefetch( ajlpr(0,j), 1)
             do n = lower,upper
                !Full Bessel integration
                a2=aa(n)
                bes_ix=bes_index(n)

                J_l=a2*ajl(bes_ix,j)+(1-a2)*(ajl(bes_ix+1,j) - ((a2+1) &
                     *ajlpr(bes_ix,j)+(2-a2)*ajlpr(bes_ix+1,j))* fac(n)) !cubic spline
                J_l = J_l*TimeSteps%dpoints(n)

                call mm_prefetch( ajl(bes_ix-4,j), 0)
                call mm_prefetch( ajlpr(bes_ix-4,j), 0)

                !The unwrapped form is faster

                sum1 = sum1 + IV%Source_q(n,1)*J_l
                sum2 = sum2 + IV%Source_q(n,2)*J_l
                sum3 = sum3 + IV%Source_q(n,3)*J_l
             end do

[/fortran]

But the number of L1 cache misses still remains the same. Please correct me if I am misunderstanding the caches / prefetching.

TimP · ‎08-01-2013

my previous response didn't show up. Assuming you want to prefetch the next lower cache line, something like

call mm_prefetch( &ajl(bes_ix-8,j), 0)

call mm_prefetch( &ajlpr(bes_ix-8,j), 0)

If I read the doc correctly, Better form, use the defined macro names for L0 hint.

http://www.google.com/url?q=http://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-4.pdf&sa=U&ei=0-D6UZ63D-TeyQH_hoCwCg&ved=0CBgQFjAA&sig2=gFucgg4G_APpjhv9SyZzlQ&usg=AFQjCNGE9vqLLTV-8Jx13FMab6IAgh_1fA

james_B_8 · ‎08-02-2013

Thanks for that pdf. I tried those prefetches. It makes no appreciable difference to the performance.
I honestly have no idea what I'm doing here or how the prefetching stuff even works, but it should be possible since I know what the access pattern on that array will look like so I could tell it what to load into the cache.

I'm stumped. :)
Any more ideas? Do you know anyone else who has experience of actually doing manual prefetching?

TimP · ‎08-02-2013

If you're certain that these prefetches go in the correct direction, you may need to prefetch several cache lines ahead, rather than just one. I'm sure some readers of this list have worked successfully with indirect prefetch.

In my own cases, I've only succeeded in moving cache misses to different locations, without significant net benefit. You can try to infer from VTune analysis where the prefetches actually occur, whether they are discarded, etc. I've had cases where no L1 prefetch was just as good.

jimdempseyatthecove · ‎08-03-2013

Untested code sketch to help with vectorization:

[fortran]
real(8) :: sum1v(4), sum2v(4), sum3v(4), a2(4), j_l(4)
integer :: bes_index(n+4)
...
bes_index(n+1:n+4) = -1
...
sum1v = 0.0D0
sum2v = 0.0D0
sum3v = 0.0D0
n = lower
do
bes_ix=bes_index(n)
if(((n+1) < upper) .and. (bes_ix == bes_index(n+1))) then
    if(((n+2) < upper) .and. (bes_ix == bes_index(n+2))) then
      if(((n+3) < upper) .and. (bes_ix == bes_index(n+3))) then
        ! 4-wide calculation
        a2(1:4)=aa(n:n+3)
        J_l(1:4)=a2(1:4)*ajl(bes_ix,j)+(1-a2(1:4))*(ajl(bes_ix+1,j) - ((a2(1:4)+1) &
          *ajlpr(bes_ix,j)+(2-a2(1:4))*ajlpr(bes_ix+1,j))* fac(n:n+3)) !cubic spline
        J_l(1:4) = J_l(1:4)*TimeSteps%dpoints(n:n+3)
        sum1v(1:4) = sum1v(1:4) + IV%Source_q(n:n+3,1)*J_l(1:4)
        sum2v(1:4) = sum2v(1:4) + IV%Source_q(n:n+3,2)*J_l(1:4)
        sum3v(1:4) = sum3v(1:4) + IV%Source_q(n:n+3,3)*J_l(1:4)
        n = n+4
      else
        ! 3-wide calculation
        a2(1:3)=aa(n:n+2)
        J_l(1:3)=a2(1:3)*ajl(bes_ix,j)+(1-a2(1:3))*(ajl(bes_ix+1,j) - ((a2(1:3)+1) &
          *ajlpr(bes_ix,j)+(2-a2(1:3))*ajlpr(bes_ix+1,j))* fac(n:n+2)) !cubic spline
        J_l(1:3) = J_l(1:3)*TimeSteps%dpoints(n:n+2)
        sum1v(1:3) = sum1v(1:3) + IV%Source_q(n:n+2,1)*J_l(1:3)
        sum2v(1:3) = sum2v(1:3) + IV%Source_q(n:n+2,2)*J_l(1:3)
        sum3v(1:3) = sum3v(1:3) + IV%Source_q(n:n+2,3)*J_l(1:3)
        n = n+3
      endif
    else
      ! 2-wide calculation
      a2(1:2)=aa(n:n+1)
      J_l(1:2)=a2*ajl(bes_ix,j)+(1-a2)*(ajl(bes_ix+1,j) - ((a2+1) &
        *ajlpr(bes_ix,j)+(2-a2)*ajlpr(bes_ix+1,j))* fac(n:n+1)) !cubic spline
      J_l(1:2) = J_l(1:2)*TimeSteps%dpoints(n:n+1)
      sum1v(1:2) = sum1v(1:2) + IV%Source_q(n:n+1,1)*J_l(1:2)
      sum2v(1:2) = sum2v(1:2) + IV%Source_q(n:n+1,2)*J_l(1:2)
      sum3v(1:2) = sum3v(1:2) + IV%Source_q(n:n+1,3)*J_l(1:2)
      n = n+2
    endif
else
    ! 1-wide calculation
    a2(1)=aa(n)
    J_l(1)=a2(1)*ajl(bes_ix,j)+(1-a2(1))*(ajl(bes_ix+1,j) - ((a2(1)+1) &
      *ajlpr(bes_ix,j)+(2-a2(1))*ajlpr(bes_ix+1,j))* fac(n)) !cubic spline
    J_l(1) = J_l(1)*TimeSteps%dpoints(n)
    sum1v(1) = sum1v(1) + IV%Source_q(n,1)*J_l(1)
    sum2v(1) = sum2v(1) + IV%Source_q(n,2)*J_l(1)
    sum3v(1) = sum3v(1) + IV%Source_q(n,3)*J_l(1)
    n = n+1
endif
if(n > upper) exit
end do
sum1 = sum1v(1)+sum1v(2)+sum1v(3)+sum1v(4)
sum2 = sum2v(1)+sum2v(2)+sum2v(3)+sum2v(4)
sum3 = sum3v(1)+sum3v(2)+sum3v(3)+sum3v(4)
[/fortran]

Jim Dempsey

james_B_8 · ‎08-03-2013

hi jim

Thanks for that code. I tried it out and it didn't improve things, but what it did show me from running it with VTune is that the majority of the time is spent in the 1-wide calculation. It seems later in the calculation the rate at which the bes_index array value decreases, increases. I managed to get produce a graph of bes_index (attached) from later in the calculation. It seems to decrease at a rate of about 2-3 at its steepest. My thinking was that because it's going backwards, everytime it decreases then we have to go to the L2 cache to get the next ajl and ajlpr values. Hence I was trying to prefetch a couple lines backwards, but with no luck. I'm not sure if my diagnosis is even correct.

Cheers,
James

jimdempseyatthecove · ‎08-04-2013

Try this. ~same code, but with helpers for the code optimization.

[fortran]

real(8) :: sum1v(4), sum2v(4), sum3v(4), a2(4), j_l(4)
integer :: bes_index(n+4)
real(8), pointer :: ajl_p(:), ajlpr_p(:) ! pointers to (:,j) slice
real(8) :: ajl_ix_ix1(2) ! prefetch {ajl(bes_ix),ajl(bes_ix+1)} to register not immediately used
real(8) :: ajlpr_ix_ix1(2) ! prefetch {ajlpr(bes_ix),ajlpr(bes_ix+1)} to register not immediately used

...
bes_index(n+1:n+4) = -1
...
ajl_p => ajl(:,j)
ajlpr_p => ajlpr(:,j)
sum1v = 0.0D0
sum2v = 0.0D0
sum3v = 0.0D0
n = lower
do
bes_ix=bes_index(n)
ajl_ix_ix1 = ajl_p(bes_ix:bes_ix+1) ! prefetch {ajl(bes_ix),ajl(bes_ix+1)} to register not immediately used
ajlpr_ix_ix1 = ajlpr_p(bes_ix:bes_ix+1) ! prefetch {ajlpr(bes_ix),ajlpr(bes_ix+1)} to register not immediately used
if(((n+1) < upper) .and. (bes_ix == bes_index(n+1))) then
    if(((n+2) < upper) .and. (bes_ix == bes_index(n+2))) then
      if(((n+3) < upper) .and. (bes_ix == bes_index(n+3))) then
        ! 4-wide calculation
        a2(1:4)=aa(n:n+3)
        J_l(1:4)=a2(1:4)*ajl_ix_ix1(1)+(1-a2(1:4))*(ajl_ix_ix1(2) - ((a2(1:4)+1) &
          *ajlpr_ix_ix1(1)+(2-a2(1:4))*ajlpr_ix_ix1(2))* fac(n:n+3)) !cubic spline
        J_l(1:4) = J_l(1:4)*TimeSteps%dpoints(n:n+3)
        sum1v(1:4) = sum1v(1:4) + IV%Source_q(n:n+3,1)*J_l(1:4)
        sum2v(1:4) = sum2v(1:4) + IV%Source_q(n:n+3,2)*J_l(1:4)
        sum3v(1:4) = sum3v(1:4) + IV%Source_q(n:n+3,3)*J_l(1:4)
        n = n+4
      else
        ! 3-wide calculation
        a2(1:3)=aa(n:n+2)
        J_l(1:3)=a2(1:3)*ajl_ix_ix1(1)+(1-a2(1:3))*(ajl_ix_ix1(2) - ((a2(1:3)+1) &
          *ajlpr_ix_ix1(1)+(2-a2(1:3))*ajlpr_ix_ix1(2))* fac(n:n+2)) !cubic spline
        J_l(1:3) = J_l(1:3)*TimeSteps%dpoints(n:n+2)
        sum1v(1:3) = sum1v(1:3) + IV%Source_q(n:n+2,1)*J_l(1:3)
        sum2v(1:3) = sum2v(1:3) + IV%Source_q(n:n+2,2)*J_l(1:3)
        sum3v(1:3) = sum3v(1:3) + IV%Source_q(n:n+2,3)*J_l(1:3)
        n = n+3
      endif
    else
      ! 2-wide calculation
      a2(1:2)=aa(n:n+1)
      J_l(1:2)=a2*ajl_ix_ix1(1)+(1-a2)*(ajl_ix_ix1(2) - ((a2+1) &
        *ajlpr_ix_ix1(1)+(2-a2)*ajlpr_ix_ix1(2))* fac(n)) !cubic spline
      J_l(1:2) = J_l(1:2)*TimeSteps%dpoints(n)
      sum1v(1:2) = sum1v(1:2) + IV%Source_q(n:n+1,1)*J_l(1:2)
      sum2v(1:2) = sum2v(1:2) + IV%Source_q(n:n+1,2)*J_l(1:2)
      sum3v(1:2) = sum3v(1:2) + IV%Source_q(n:n+1,3)*J_l(1:2)
      n = n+2
    endif
else
    ! 1-wide calculation
    a2(1)=aa(n)
    J_l(1)=a2(1)*ajl_ix_ix1(1)+(1-a2(1))*(ajl_ix_ix1(2) - ((a2(1)+1) &
      *ajlpr_ix_ix1(1)+(2-a2(1))*ajlpr_ix_ix1(2))* fac(n)) !cubic spline
    J_l(1) = J_l(1)*TimeSteps%dpoints(n)
    sum1v(1) = sum1v(1) + IV%Source_q(n,1)*J_l(1)
    sum2v(1) = sum2v(1) + IV%Source_q(n,2)*J_l(1)
    sum3v(1) = sum3v(1) + IV%Source_q(n,3)*J_l(1)
    n = n+1
endif
if(n > upper) exit
end do
sum1 = sum1v(1)+sum1v(2)+sum1v(3)+sum1v(4)
sum2 = sum2v(1)+sum2v(2)+sum2v(3)+sum2v(4)
sum3 = sum3v(1)+sum3v(2)+sum3v(3)+sum3v(4)
[/fortran]

Jim Dempsey

james_B_8 · ‎08-05-2013

Thanks again, but it still runs slightly worse than the original. VTune says that there're lots of L1 cache misses, but looking at the hardware counts there are also L2 misses apparently. I ran VTune with general exploration for MIC for all three cases. Can't seem to get a formatted table on this so I just screenshotted it o_0.
Since there were L2 misses, I tried tiling the outer j loop and the inner n loop, but this made things doubly slow. I'm out of ideas. Do the statistics from VTune give you any further insight into the problem, Jim?

Thanks,
James

jimdempseyatthecove · ‎08-06-2013

I will have to prefix this with my interpretation of this report is somewhat lacking experience. This said, some of the numbers do not make sense when relating to CPU Time. e.g. CPU Time going up as L1 Hit Ratio going up. Are you sure you did not make a typographical error (like not advancing n at +4, +3, +2, +1 for the different sections)?

I have an idea: Because, CPU Time going up as L1 Hit Ratio going up, this would indicate that the xxxx(4) and xx(2) local arrays were not registerized. Two suggestions: a) extract the inner loop (starting at sum1v=0.0) and place into a seperate callable subroutine), b) add ALIGN:32 attribute to xxxx(4) and ALIGN:16 attribute to xx(2). See if these two changes migrate the local of xxxx(4) and xx(2) into ymm registers.

Jim Dempsey

james_B_8 · ‎08-27-2013

Hi again,

Yeah Jim, none of your propositions helped it. They just moved the cache misses around. I managed to get 20% speed up on it by stating the LOOP COUNT for the inner and outer loops of it. It is still cache-nasty though.

It would seem that software prefetching is the only way I could reduce these misses given that I have a rough idea of the access pattern on the ajl and ajlpr arrays. However I don't really understand when prefetching happens in a loop though. Are prefetches called every iteration or every n iterations or something? Intels examples don't come with much of an explanation... Any advice?

TimP · ‎08-27-2013

I think you must run VTune and look at both source and asm views to get much understanding of how prefetches are working in practice. There is a compile option to skip prefetches which would extend beyond the end of the loop, but normally all the prefetches are issued each time, as each prefetch brings 1 cache line, and, ideally, an entire cache line is consumed on each iteration.

The opt-report statistics on prefetch can give some idea whether you have an excessive number of prefetches which can result in many of them being dropped.

The way you have written it, what you call prefetches don't necessarily accomplish the same thing as explicit prefetch intrinsics.

jimdempseyatthecove · ‎08-27-2013

Here is a minor tweak. The compiler may rearrange the code this way, it is worth an experimental run:

[fortran]
do n = lower, upper
!Full Bessel integration
a2=aa(n)
bes_ix=bes_index(n)
fac_n = fac(n)
ts_n = TimeSteps%dpoints(n)
J_l=a2*ajl(bes_ix,j)+(1-a2)*(ajl(bes_ix+1,j) - ((a2+1) &
*ajlpr(bes_ix,j)+(2-a2)*ajlpr(bes_ix+1,j))* fac_n) !cubic spline
J_l = J_l*ts_n
!The unwrapped form is faster
sum1 = sum1 + IV%Source_q(n,1)*J_l
sum2 = sum2 + IV%Source_q(n,2)*J_l
sum3 = sum3 + IV%Source_q(n,3)*J_l
end do
[/fortran]

What this tries to do is to force future memory fetches into the memory wait time after fetching of bes_ix. IOW eliminate memory stall on:

*ajl(bes_ix,j)
*fac(n)
*TimeSteps%dpoints(n)

Also consider:

real :: sum(3)
...
sum = sum + IV%Source_Q(n,1:3)*J_l

I am not certain that the above is faster or even works as expected. Have you considered re-arranging index order on Source_q?

Due to the scatter nature, it is difficult to perform the prefetch.

On other threading models, you could do a tag team aproach using HT siblings. One sibling performing the fetches into L1 in advance of the other thread. Note, on MIC each core has 4 HT siblings. Generaly rule of thumb is the AVX is maxed out using 2 of the HT siblings. The two prefetcher threads (assuming you can setup this way in Fortran, which I don't think you can), would have the arrays redefined as integer arrays (to avoid using AVX). Then then read through the bes_index, then indirectly and innocuously use the memory in an integer (or character) operation bypassing AVX/SSE instructions.

Intel has the means to add a non-standard feature that enables teaming of two HT siblings (meaning on MIC each core would have two teams when feature enabled). Then the programmer would use something like an OMP SECTIONS, say !$OMP HT SECTIONS, where the first section is gauranteed to run, and the second section only runs when feature is enabled/available.

********* the following is NOT a feature of IVF -- it is a suggestion

[fortran]
! *** hypthetical - don't try this at home ***
integer, volatile :: n_v
!$OMP HT SECTIONS PRIVATE(n, a2, bes_ix)
do n = lower,upper
!Full Bessel integration
n_v = n
a2=aa(n)
bes_ix=bes_index(n)
J_l=a2*ajl(bes_ix,j)+(1-a2)*(ajl(bes_ix+1,j) - ((a2+1) &
       *ajlpr(bes_ix,j)+(2-a2)*ajlpr(bes_ix+1,j))* fac(n)) !cubic spline
J_l = J_l*TimeSteps%dpoints(n)
!The unwrapped form is faster
sum1 = sum1 + IV%Source_q(n,1)*J_l
sum2 = sum2 + IV%Source_q(n,2)*J_l
sum3 = sum3 + IV%Source_q(n,3)*J_l
end do
!$OMP HT SECTION
do n = lower,upper
bes_ix=bes_index(n) ! integer fetch of indirect index
! perform integer operations on memory locations to be used by other HT sibling
if( ISNAN(aa(n)) &
      || ISNAN(ajl(bes_ix,j)) &
      || ISNAN(ajl(bes_ix+1,j)) &
      || ISNAN(ajlpr(bes_ix,j)) &
      || ISNAN(ajlpr(bes_ix+1,j)) &
      || ISNAN(fac(n)) &
      || ISNAN(TimeSteps%dpoints(n)) then
    ! if you want, do something
endif
DO WHILE(n .GT. (n_v + 1000))
     ! Don't overflow L1
     CALL SLEEPQQ(0)
END DO
end do
!$OMP END HT SECTIONS
[/fortran]

Jim Dempsey

jimdempseyatthecove · ‎08-28-2013

TRANSFER(fac(n),1_8) may be more effectve, haven't checked the code generated.

Jim Dempsey

james_B_8 · ‎08-28-2013

Sorry, but more effective for what? What's your reasoning behind using TRANSFER?

James

jimdempseyatthecove · ‎08-28-2013

What you need to do is to use integer register reference to read the memory locations into L1. Fortran does not have a simple to use CAST operator, however TRANSFER, when not required to reorganize the data, behaves like a cast. The assumption, and something needed to check out, is if the TRANSFER as used by the above code, actually avoids use of SSE/AVX registers. IOW you want to assure it does not use an SSE xmm or AVX ymm, or AVX2 zmm register in the process of "transfering" the data. The intrinsic ISNAN will likely (my assumption), fetch and examine the float/double using GP registers (e.g. eax, rax, ...) because the test to perform the NaN are bit tests, though ISNAN will have more overhead than a simple fetch and discard.

N.B. though you could use MM_PREFETCH, you will not be assured that the data actually reaches L1 as page faults may interfere with your intentions, also, the implimentationis permitted to ignore the request, whereas an actual read of the data assures transfer into L1.

It is important to realize that the (pre)fetching thread must be an HT sibling of the processing thread as these reside in the same core and share the same L1 (L2, L3).

FWIW this technique could be performed in C++ using a threading toolkit that I produced. I haven't checked out the toolkit on MIC, but it will run on the host system. The code would look something like this:

[cpp]
volatile int n_v = lower;
parallel_task(
L1$, // enqueue task to L1 sibling
[&](){ // Lambda function
    for(int n=lower; n <= upper; ++i)
    {
      int bes_ix = bes_index;
      int64_t trash =
         *((int64_t*)&aa)
         | *((int64_t*)&ajl[bes_ix]) // row,col swapped from Fortran, also using array of double*
         ...
      if(trash == 0) // reference trash to avoid optimizer from removing above code
        __mm_pause(); // only issued in unlikely event when all reads were 0.0
      while(n > (n_v + 1000))
        __mm_pause(); // or sched_yield()
    } // for
} // end lambda
); // end parallel_task
// while above is running... perform following code
// Full Bessle integration as implemented in C++
for(int n=lower; n <= upper; ++i)
{
n_v = n;
int bes_ix = bes_index;
...
} // for
[/cpp]

Jim Dempsey

James_C_Intel2 · ‎08-29-2013

@JimDempsey are you forgetting that KNC (MIC, Xeon Phi) is an in-order core?

Surely that means that you stall on the cache miss at the load instruction that misses, not at the point of use of the loaded register. Therefore your attempt to hide the latencies of some misses behind others (by having multiple missing loads in flight all at onc)e is doomed to failure. (Because you can't issue the subsequent ones after the first one stalls).

Being able to setup multiple moves from the memory to the cache is the point of the prefetch instructions, which don't stall, so you can get many of them in flight at once.

This is also one of the reasons for having four hardware threads in the core; the other threads givs the core something to do in the periods when one thread is stalled on a cache miss. (Which is more like the other approach you are advocating of having a pre-fetching helper thread).

So, your "do all the misses at once" approach might work on an out-of-order Xeon processor, but I don't see how it can on a KNC.

jimdempseyatthecove · ‎08-29-2013

James,

I think you missed understanding the functionality of the code. Or I am making an incorrect assumption about the operation of the Xeon Phi in-order cord. Let's resolve my assumption:

My assumption is: When one thread within a core issues a load instruction with cach miss, all remaining threads of that core not also waiting for stall due to its own load instruction with cach miss, these threads can continue to run with cache hits.

IOW: satisfies "This is also one of the reasons for having four hardware threads in the core; the other threads givs the core something to do in the periods when one thread is stalled on a cache miss."

This is to say we are both in agreement about stalls on one thread of a core not (materially) affecting non-stalled other threads of the same core.

The intent of the code is for the HT sibling advance-fetch thread to run the gauntlet of cache misses in advance of the HT producer thread performing the computations. Should there be a large degree of cache misses, it is concievable that the HT producer thread will catch up to the advance-fetch thread. In which case both threads stall for fetch to complete. Upon completion of fetch, both threads resume. Note now that the producer thread will generally proceed slower to the next load due to the fact it is performing computations via AVX (possibly in competition with the other producer thread in the core). The difference in the run time between the advance-fetch thread reaching the next stalling load and the producer thread reaching the same stalling load is available to start the load operation. Should the load reference L3, it is likely that the skew in arival time to load exceeds that of moving data from L3 to L1. In event of load from RAM, the difference in skew in arival time to load is recovered due to the load being initiated skew time earlier.

Jim Dempsey

James_C_Intel2 · ‎08-29-2013

Yes, we agree about the KNC architecture. A cache-miss only affects the hardware thread that caused it.

And yes, I had missed some of what you're doing :-(

I'm still not clear, though, why you think the complexity of the pre-fetch threads is beneficial over adding prefetch instructions at some suitable (OK, "suitable" is a weasel word :-)) point in the loop to prefetch data for the iteration n-ahead. (Since we know we're on KNC, the argument that "prefetch instructions can be ignored" won't fly). That seems much simpler, avoids the need to manage extra threads in the code, and teh danger that the prefetching gets out of sync with the worker (for instance if it's too far ahead it could be evicting data before it has been used).

Of course, measured performance is the decider!

James_C_Intel2 · ‎08-29-2013

FWIW it was this comment that confused me

"What this tries to do is to force future memory fetches into the memory wait time after fetching of bes_ix. IOW eliminate memory stall on:

*ajl(bes_ix,j)
*fac(n)
*TimeSteps%dpoints(n)"

Which doesn't involve yout prefetch thread, and does seem to require that multiple loads from a single thread are outstanding at once to achieve the claimed effect.