Thanks Ben, I could respond

Ben_Rush · ‎03-26-2016

I looked and I'm not sure if there is a better forum for this, so I'll ask here. If there is, please let me know and I'll move my question.

I'm quite new at developing for the GPU in general, and so some topics are still a bit new/confusing for me. One of them is the notion of divergence when it comes to branch instructions: how the SIMD EU can come become "stalled" when different kernel instances take different paths (please check my terminology here too). It's my understanding that, with NVIDIA, there's a notion of a "wavefront" across the the EU, whereby the individual threads are okay if they're each executing the same instruction; if that no longer becomes the case, then some threads are stalled. This means the SIMD lanes aren't maximally used and you can run into situations where performance fails.

My high-level question is this: when programming for Intel's GPGPU, is this still the case? Or does it become more of an issue of instruction cache being not optimally used? Or both? Or neither? This article seems to indicate this is an "issue" with the Intel GPGPU as well: https://software.intel.com/en-us/node/540425. And I understand it's not really an issue, but just the way GPUs work, possibly.

I ask because I am trying to put a decision tree (a machine learning technique) on the GPU. Not the training, but the use of the decision tree. I'm using it for computer vision and so each pixel in an image I run down the decision tree. The decision tree is a binary tree, and at each node there's a question asked of the pixel; the answer causes slightly different logic to be executed. Then we branch left or right until we reach an end node. I figured I would gain massive speed increases by putting this on the GPU. I was wrong. I see little speed improvements. The problem, at least as far as I can tell, is that this is precisely a worst-case scenario for the GPU because different threads will be executing slightly different logic based on the particular pixel they're working on. Therefore causing massive stalling as divergence and merging occurs. But, again, the world of CPU and GPU seems to be blurring more and more everyday, and so perhaps I'm wrong and this isn't a good reason why my code may be executing much slower than expected.

Thoughts? Again, I apologize if this is the wrong forum.

KitturGanesh · ‎03-28-2016

Hi Ben,
Just letting you know I've passed on your question to my peer who specializes on offload support so you can get a correct response to your questions. Appreciate your patience till then.
Kittur

Ben_Rush · ‎03-29-2016

Thanks Kittur,

Actually, I have a more detailed question regarding the exact same problem: detailed in the sense that I can give you more to chew on as I've isolated the "issue" to sample code and I have debugging info I can share. I'm sure it's a product of me not understanding some internals. But I think this forum is the best opportunity for me to learn, so I kind of have to ask it here. (Please note: I'd still be interested in your engineers' response regarding branch divergence too as I believe that's also an issue).

The following code (also, please note and also pass along to anyone who may be looking, I understand vector addition can be done more simply, the reason I'm doing it this way is to coax out of the compiler the quirk I'm seeing):

#define RUNSLOW

int* AddRandomVectorsNTimesGPU(int numberOfTimes, int arrayLength, int* arrayA, int* arrayB)
{
    int* result = new int[arrayLength];
    result[0:arrayLength] = 0;

#pragma offload target(gfx) pin(arrayA,arrayB,result:length(arrayLength))
    cilk_for(int c = 0; c < arrayLength; c++)
    {
#ifdef RUNSLOW
        int localA = arrayA; 
#endif
        int localB = arrayB; 
        int localResult = 0; 
        for (int d = 0; d < numberOfTimes; d++)
        {
#ifdef RUNSLOW
            localResult += localA + localB;
#else
            localResult += arrayA + localB;
#endif
        }
        result = localResult;
    }

    return result;
}

Okay. So if I comment out RUNSLOW it runs fast. If I put RUNSLOW back in, it literally runs much slower. This is the difference in output when I set GFX_LOG_OFFLOAD = 3.

Run slower:

GFX(15:37:07): LOOP0 - lower:0 upper:1000 iterations:32 per_thread:1 stride:32
GFX(15:37:07): whole groups per s/slice: 7, threads in group: 8
GFX(15:37:07): threads per s/slice: 56, wasted threads per s/slice: 0
GFX(15:37:07): whole scheduling 'rounds': 0, wasted threads per round: 0
GFX(15:37:07): wasted threads in last 'round': 80, total: 80
GFX(15:37:07): threads wasted due to iter/thread space incongruence: x:0, y:0, total:0
GFX(15:37:07): H/W waste is 49.7% total: 49.7% - scheduling, 0.0% - iter/thread space incongruence
GFX(2/15:37:07): Kernel parameters for level 1 loop nest (L__AddRandomVectorsNTimesGPU2__YAPEAHHHPEAH0_Z_DecisionTreeSimulatorIntel_cpp_289_289__par_region0_2):
GFX(2/15:37:07): utcnt_0(0)=32 start_0(1)=0 limit_0(2)=1000 iclen_0(3)=32

Run faster:

GFX(15:48:30): LOOP0 - lower:0 upper:1000 iterations:1000 per_thread:1 stride:1
GFX(15:48:30): whole groups per s/slice: 7, threads in group: 8
GFX(15:48:30): threads per s/slice: 56, wasted threads per s/slice: 0
GFX(15:48:30): whole scheduling 'rounds': 8, wasted threads per round: 0
GFX(15:48:30): wasted threads in last 'round': 8, total: 8
GFX(15:48:30): threads wasted due to iter/thread space incongruence: x:0, y:0, total:0
GFX(15:48:30): H/W waste is 0.6% total: 0.6% - scheduling, 0.0% - iter/thread space incongruence
GFX(2/15:48:30): Kernel parameters for level 1 loop nest (L__AddRandomVectorsNTimesGPU__YAPEAHHHPEAH0_Z_DecisionTreeSimulatorIntel_cpp_272_272__par_region0_2):
GFX(2/15:48:30): utcnt_0(0)=1000 start_0(1)=0 limit_0(2)=1000 iclen_0(3)=1

I have a feeling it has to do with how the compiler is able to vectorize, no? But what I'm trying to do is put local variables used in the above loop on the GPU registers instead of continually accessing pinned memory. So at first it seemed like the right idea. Apparently no? Regardless, can you explain to me the reason behind it changing the iteration count? It literally looks as though more threads are being dispatched for the faster version than the slower version.

Here is another part of the output which seems to indicate such:

Run faster:

GFX(2/15:48:30): Thread group space created (w x h): [1 x 8] threads, [1 x 125] groups

Run slower:

GFX(2/15:37:07): Thread group space created (w x h): [1 x 8] threads, [1 x 4] groups

Again, I'm trying to actually understand what's going on under the hood here, so that's why I'm asking all of these questions. Also, are groups EUs in this sense? s/slice = subslice?

Kittur Ganesh (Intel) wrote:

Hi Ben,
Just letting you know I've passed on your question to my peer who specializes on offload support so you can get a correct response to your questions. Appreciate your patience till then.
Kittur

KitturGanesh · ‎03-29-2016

Thanks Ben, I could respond to some of the questions but I'd rather wait for a complete proper detailed response from my peer to whom I've passed on the new update from you as well. Stay tuned...
Kittur

Question about divergence and branch instructions when coding to Intel Graphics.