Optimizing Scatter Intrinsics

Michael_J_5 · ‎06-11-2016

Hello all,

I'm using Intel intrinsics instructions for a specific function in my code and it is working wonderfully. I'm able to utilize all the SIMD lanes on my Xeon Phi (5110P) in the exact manner I had planned and the -O3 flag on the compiler is working with my code to speed it up to where I expect it to be. However, I'm experiencing major slowdowns when I use the scatter intrinsic (specified below, at the end of intrinsic_function). What I want the code to do is pass in B[64], modify it within the code, and then place it back into B[64]. This is the spot that I'm getting most of my problems.

You see, there are 3 problems that occur at this part. I either get super fast code, as if nothing compiled (I have a solution, which strangely works and doesn't detriment performance to my knowledge); very fast code, but no way to pass the value along (before adding in scatter), this is the performance I was expecting, if a bit faster than I thought; very slow (albeit faster than the SISD code), on the order of 70x slower when I add in a single scatter at the end (each subsequent scatter slows it down by a little, which I'd normally expect anyway).

The solution to the first problem was adding in "printf("");" in the function. I'm assuming this works because ICC might be optimizing my code to the point where it sees that the function isn't doing any work that is contributing somewhere else, and so it's optimizing by getting rid of what it thinks is useless busy work. This printf() keeps the compiler from simply throwing the function away. I get the same result as using printf if I add in a variable that gets modified and passed back. The second and third problem I haven't quite figured out. Problem two could be easily solved if I had a way to pass B[64] or the _m512i equivalent to the next function for use since the value as it leaves the function remains the same between them. It should be in the registers already, or at most in the L1 cache, so the biggest performance hit I'd expect would maybe be up to 3 clocks IIRC. I figured the scatter instruction wouldn't cause too many problems, but even with prefetching this doesn't seem to be the case.

To test my function, I'm using the same 2 variable inputs in a for-loop. This is similar to how the function will actually be used, so it's important that I understand why I'm experiencing so much slowdown here.

Thanks!

__attribute__ ((target(mic))) int intrinsic_function(uint32_t B[64], const uint32_t Bx[64])
{
    // Special load indexes for our operations
    __m512i reg0_index = {0,5,10,15, 16,21,26,31, 32,37,42,47, 48,53,58,63};
    __m512i reg1_index = {12,1,6,11, 28,17,22,27, 44,33,38,43, 60,49,54,59};
    __m512i reg2_index = {8,13,2,7,  24,29,18,23, 40,45,34,39, 56,61,50,55};
    __m512i reg3_index = {4,9,14,3,  20,25,30,19, 36,41,46,35, 56,61,50,55};

    // SIMD Registers for B, Bx, and temps
    __m512i reg0, reg0x, B0;
    __m512i reg1, reg1x, B1;
    __m512i reg2, reg2x, B2;
    __m512i reg3, reg3x, B3;
    __m512i t0, t1, t2, t3;


    reg0 = _mm512_i32gather_epi32(reg0_index, &B[0], 4);
    reg0x= _mm512_i32gather_epi32(reg0_index, &Bx[0], 4);

    reg1 = _mm512_i32gather_epi32(reg1_index, &B[0], 4);
    reg1x= _mm512_i32gather_epi32(reg1_index, &Bx[0], 4);

    reg2 = _mm512_i32gather_epi32(reg2_index, &B[0], 4);
    reg2x= _mm512_i32gather_epi32(reg2_index, &Bx[0], 4);

    reg3 = _mm512_i32gather_epi32(reg3_index, &B[0], 4);
    reg3x= _mm512_i32gather_epi32(reg3_index, &Bx[0], 4);

    #ifdef __MIC__

    reg0 = _mm512_xor_epi32(reg0, reg0x);
    B0 = reg0;
    reg1 = _mm512_xor_epi32(reg1, reg1x);
    B1 = reg1;
    reg2 = _mm512_xor_epi32(reg2, reg2x);
    B2 = reg2;
    reg3 = _mm512_xor_epi32(reg3, reg3x);
    B3 = reg3;


    for(int i = 0; i < 4; i++){
    //Majority of function operations in here.
    // Plenty of time between gather and scatter functions
    }

    B0 = _mm512_add_epi32(reg0, B0);
    B1 = _mm512_add_epi32(reg1, B1);
    B2 = _mm512_add_epi32(reg2, B2);
    B3 = _mm512_add_epi32(reg3, B3);

    _mm512_i32scatter_epi32(&B[0],reg0_index,B0, 4);
    _mm512_i32scatter_epi32(&B[0],reg1_index,B1, 4);
    _mm512_i32scatter_epi32(&B[0],reg2_index,B2, 4);
    _mm512_i32scatter_epi32(&B[0],reg3_index,B3, 4);
    //printf("");
    #endif

    return 0;
}

TimP · ‎06-11-2016

I believe a scatter instruction can hit at most 2 cache lines in a single execution on KNC. If you generate a scatter from C source, you will see an iteration loop which continues until all cache lines have been updated, before it can issue the next one. So it is not much pipelined.

Even compiler generated prefetch is ineffective for scatter. If you are certain the data are already in cache, scatter should be faster than serial code with prefetching, otherwise likely not.

Michael_J_5 · ‎06-11-2016

So, what you're saying is that if I have a list of data items I wish to store into memory and I use a scatter instruction, then it's just serially sending the items to memory?

jimdempseyatthecove · ‎06-12-2016

Rather, if you have a list of data items to be stored into different cache line located memory locations...

The next generation KNL may be better at scatter (to 16 different cache lines).

On KNC, if you wish to do the work, you could store the packed outputs of the final add

B0 = _mm512_add_epi32(reg0, B0);

into shared memory locations, shared by a hardware thread running in the same core. The additional thread could then read the packed data (from L1) and perform the scatter operations.

An alternative to try is when your intrinsic_function is used to process an array of B[64] and/or Bx[64] items, is to construct teams of the 4 hardware threads per core, each thread of each team processing every 4th item (different items per thread), of the adjacent items for the section assigned to the team. This may yield better L1 cache utilization. And this code can effectively be reused on KNL.

Jim Dempsey

Michael_J_5 · ‎06-12-2016

I think that makes sense.

For your first suggestion, what you're saying is to place the items for another thread to work on and then continue with the next array of items?

The second suggestion sounds like it might work, but I have a swizzle in the for loop that might complicate things. I end up needing to swap every element around twice. Once in the middle of the loop, and again at the end before the loop starts again. So it might be possible, but I'm not sure how that might complicate what you're suggesting.

        /***********
        //Swizzle-Step
        ************/
        reg2 = _mm512_shuffle_epi32(reg2, 0x4e);
        //Swap reg1 and reg3; do this with XOR or temp variable.
        //Temp Swap
        t0 = reg1;
        reg1 = reg3;
        reg3 = t0;
        //Xor Swap - Slower because of extra operations.
        /*
        reg1 = _mm512_xor_epi32(reg1, reg3);
        reg3 = _mm512_xor_epi32(reg1, reg3);
        reg1 = _mm512_xor_epi32(reg1, reg3);
        */
        //Swizzle for reg1 and reg3
        reg1 = _mm512_shuffle_epi32(reg1, 0x93);
        reg3 = _mm512_shuffle_epi32(reg3, 0x39);

jimdempseyatthecove · ‎06-12-2016

Michael,

You effectively create a single producer-single consumer queue. You will also need to have a progress indicator, most likely a counter that you increment after the producer inserts 4 new entries (what were stored in the final B0:B3 prior to store in memory). The queue would have to have overflow protection so you do not fill over the top of entries prior to them being fetched for the scatter store by the other thread. As to if you use 2 threads per core or 4 threads per core, this will require you to test both. If your compute section is fairly long they possibly 3 producers (to separate queues) and one consumer working on storing the produced data (less competition for the memory controller)

The producer thread will trade off 4 multi-cache line scatter writes for 4 single adjacent cache line writes plus a counter increment and store. Overwrite protection can cost you an additional operation. IIF your computation section (not shown) is fairly long, then you may only need one set of 4 buffers. But you may need more. IOW construct a ring buffer for the queue. Report back on your progress (or lack thereof).

Jim Dempsey