Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Ben_Rush
Beginner
64 Views

Example code for GfxImage2D, etc?

We have a situation where we're operating on some large data structures that are larger than the 4KB per-thread caches afforded by the GPU. In particular, I'm trying to implement something similar to this: 

http://www.msr-waypoint.com/pubs/71445/ForestFire.pdf

What I've got is slow. I did some profiling to determine where the bottleneck was, and if this link is to be trusted, then the following sentence tells me that I need some memory optimization work: 

If the EU Array Stalled metric value is non-zero and correlates with the GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to optimize memory accesses and layout.

Here is the output from the profiler (I was having trouble uploading the image to this forum post, so I uploaded it to my server): http://www.ben-rush.net/output.png

I'm hoping there is maybe salvation in the GfxImage2D function(s) to enable me to more intelligently work with the GPGPU's memory when using large data structures (things like images, for example). 

Or am I barking up the wrong tree? 

0 Kudos
3 Replies
Anoop_M_Intel
Employee
64 Views

Hi Ben,

Since your data structure is more than 4KB per individual hardware thread, are you using tiling (cache blocking) to make sure the data loaded General Register File (GRF) is utilized in all computation. If you don't tile the data to 4KB, the memory spill will be happen and the spilled data goes to system RAM which is way slower than the GRF. Could you please share the kernel which you are offloading to integrated graphics so I can suggest some tuning options.

Thanks
Anoop

Ben_Rush
Beginner
64 Views

Just a heads up, part of the issue that I'm experiencing here may be that I'm still quite green at programming for GPUs in general; so some of these issues I'm facing may be self-inflicted. But onto your question...

Anoop Madhusoodhanan Prabha (Intel) wrote:

are you using tiling (cache blocking)


No. The problem is that it's an entire decision tree, so it's not a "tilable" chunk of memory that can be broken up into discrete pieces to be operated on in parallel (like an image would, for example). I think the tiling is useful for things like large image structures, but for the particular task at hand, I'm thinking it's not suitable. Maybe I'm wrong. The tree itself is approximately one megabyte in size in-memory, and consists of over 100 thousand nodes. We can probably optimize that, but I doubt it'd ever be small enough to fit within the per-thread cache. 

It works like this: for each pixel in an image, we use the decision tree to classify it. Each node encodes a "question" we ask of a pixel. So we start at the root node, ask a question, then branch left or right, ask another question, then branch left or right, and so on until we hit a leaf node. The leaf node encodes the "class" of the pixel. So sort of like this (but much much larger)....

tree_0.png

One burning question I also have right now is if there are a lot of if/else conditionals in the kernel, then there'll be a performance penalty right? I think this might also be an issue for us because our kernel does have a lot of conditionals. Using the VTune Analyzer, is there a metric I can view to identify how much this may be hurting us? And to sanity check: the reason the if/elses are bad is the fact divergent code blocks the SIMD nature of the EUs (branch divergence) and also because GPUs don't have as much instruction cache as CPUs, right? 

Another question is would you recommend I start trying to leverage Shared Local Memory? The best documentation I've found thus far is this: 
https://software.intel.com/en-us/node/583701

I imagine I have more questions, but I'd rather not flood you right now. And I can share the kernel with you privately, but it's internal code and I don't feel comfortable sharing it publicly. 

Ben_Rush
Beginner
64 Views

Anoop Madhusoodhanan Prabha (Intel) wrote:

Hi Ben,

Since your data structure is more than 4KB per individual hardware thread, are you using tiling (cache blocking) to make sure the data loaded General Register File (GRF) is utilized in all computation. If you don't tile the data to 4KB, the memory spill will be happen and the spilled data goes to system RAM which is way slower than the GRF. Could you please share the kernel which you are offloading to integrated graphics so I can suggest some tuning options.

Thanks
Anoop

Anoop, 

I have some code for you to look at. I have absolutely no idea why my second version is slower than the original. I have to imagine there's just a fundamental misunderstanding I have here. It's from a different area of the code than what started this thread, but the results are the same: it's slower than I expect it ought to be.  

In this kernel I'm doing "point conversion", or converting to different camera perspectives (again, computer vision). 

 

#pragma offload target(gfx) pin(projectivePoints,depthPixels,realWorldPoints,bedPoints,floorPoints:length(depthPixelsLength)) pin(affines:length(24))
    cilk_for(int i = 0; i < depthPixelsLength; i++)
    {
        projectivePoints.Y = i / width;
        projectivePoints.X = i - (projectivePoints.Y * width);
        projectivePoints.Z = depthPixels;
        projectivePoints.Perspective = 2;

        realWorldPoints.X = projectivePoints.Z * (projectivePoints.X - rotatedCx) * inverseRotatedFx;
        realWorldPoints.Y = projectivePoints.Z * (rotatedCy - projectivePoints.Y) * inverseRotatedFy;
        realWorldPoints.Z = projectivePoints.Z;
        realWorldPoints.Perspective = 1;

        bedPoints.X = (affines[0] * realWorldPoints.X + affines[1] * realWorldPoints.Y + affines[2] * realWorldPoints.Z) + affines[3];
        bedPoints.Y = (affines[4] * realWorldPoints.X + affines[5] * realWorldPoints.Y + affines[6] * realWorldPoints.Z) + affines[7];
        bedPoints.Z = (affines[8] * realWorldPoints.X + affines[9] * realWorldPoints.Y + affines[10] * realWorldPoints.Z) + affines[11];
        bedPoints.Perspective = 3;

        floorPoints.X = (affines[12] * realWorldPoints.X + affines[13] * realWorldPoints.Y + affines[14] * realWorldPoints.Z) + affines[15];
        floorPoints.Y = (affines[16] * realWorldPoints.X + affines[17] * realWorldPoints.Y + affines[18] * realWorldPoints.Z) + affines[19];
        floorPoints.Z = (affines[20] * realWorldPoints.X + affines[21] * realWorldPoints.Y + affines[22] * realWorldPoints.Z) + affines[23];
        floorPoints.Perspective = 4;
    }

Now when I do the following, which as far as I can tell ought to make it operate faster (because I'm vectorizing the "bedPoints" and "floorPoints" computations) it actually operates much slower. Why? The "realWorldPoint" array should be in the GRF, no? I'm using vectorization so it should be vectorized. I'm just doing something wrong. 

 

#pragma offload target(gfx) pin(projectivePoints,depthPixels,realWorldPoints,bedPoints,floorPoints:length(depthPixelsLength)) pin(affines:length(24))
    cilk_for(int i = 0; i < depthPixelsLength; i++)
    {
        projectivePoints.Y = i / width;
        projectivePoints.X = i - (projectivePoints.Y * width);
        projectivePoints.Z = depthPixels;
        projectivePoints.Perspective = 2;

        realWorldPoints.X = projectivePoints.Z * (projectivePoints.X - rotatedCx) * inverseRotatedFx;
        realWorldPoints.Y = projectivePoints.Z * (rotatedCy - projectivePoints.Y) * inverseRotatedFy;
        realWorldPoints.Z = projectivePoints.Z;
        realWorldPoints.Perspective = 1;

        float realWorldPoint[3]; 
        realWorldPoint[0] = realWorldPoints.X; 
        realWorldPoint[1] = realWorldPoints.Y;
        realWorldPoint[2] = realWorldPoints.Z;

        bedPoints.X = __sec_reduce_add(affines[0:3] * realWorldPoint[0:3]) + affines[3]; 
        bedPoints.Y = __sec_reduce_add(affines[4:3] * realWorldPoint[0:3]) + affines[7]; 
        bedPoints.Z = __sec_reduce_add(affines[8:3] * realWorldPoint[0:3]) + affines[11];
        bedPoints.Perspective = 3;

        floorPoints.X = __sec_reduce_add(affines[12:3] * realWorldPoint[0:3]) + affines[15];
        floorPoints.Y = __sec_reduce_add(affines[16:3] * realWorldPoint[0:3]) + affines[19];
        floorPoints.Z = __sec_reduce_add(affines[20:3] * realWorldPoint[0:3]) + affines[23];
        floorPoints.Perspective = 4;
    }

The biggest difference I can find in the logging is that the faster version has a stride value > 1. Here is the slow version's logging. Note the stride: 1. 

GFX(13:39:32): LOOP0 - lower:0 upper:217088 iterations:217088 per_thread:1 stride:1
GFX(13:39:32): whole groups per s/slice: 7, threads in group: 8
GFX(13:39:32): threads per s/slice: 56, wasted threads per s/slice: 0
GFX(13:39:32): whole scheduling 'rounds': 1938, wasted threads per round: 0
GFX(13:39:32): wasted threads in last 'round': 80, total: 80
GFX(13:39:32): threads wasted due to iter/thread space incongruence: x:0, y:0, total:0
GFX(13:39:32): H/W waste is 0.0% total: 0.0% - scheduling, 0.0% - iter/thread space incongruence
GFX(2/13:39:32): Kernel parameters for level 1 loop nest (L__CreatePoints_CreatePointsIntel__SAHPEAGHHPEAX111PEAMNNNN_Z_CreatePointsIntel_cpp_39_39__par_region0_2):
GFX(2/13:39:32):   utcnt_0(0)=217088  start_0(1)=0  limit_0(2)=217088  iclen_0(3)=1


And here is the faster version. Note the stride: 32. 

GFX(13:40:18): LOOP0 - lower:0 upper:217088 iterations:6784 per_thread:1 stride:32
GFX(13:40:18): whole groups per s/slice: 7, threads in group: 8
GFX(13:40:18): threads per s/slice: 56, wasted threads per s/slice: 0
GFX(13:40:18): whole scheduling 'rounds': 60, wasted threads per round: 0
GFX(13:40:18): wasted threads in last 'round': 48, total: 48
GFX(13:40:18): threads wasted due to iter/thread space incongruence: x:0, y:0, total:0
GFX(13:40:18): H/W waste is 0.5% total: 0.5% - scheduling, 0.0% - iter/thread space incongruence
GFX(2/13:40:18): Kernel parameters for level 1 loop nest (L__CreatePoints_CreatePointsIntel__SAHPEAGHHPEAX111PEAMNNNN_Z_CreatePointsIntel_cpp_70_70__par_region0_2):
GFX(2/13:40:18):   utcnt_0(0)=6784  start_0(1)=0  limit_0(2)=217088  iclen_0(3)=32

So I thought maybe loop unrolling would help. When I tried that it went even SLOWER. 

 

#pragma offload target(gfx) pin(projectivePoints,depthPixels,realWorldPoints,bedPoints,floorPoints:length(depthPixelsLength)) pin(affines:length(24)) 
    cilk_for(int m = 0; m < depthPixelsLength; m+= Offset)
    {
#pragma unroll (Offset)
        for (int j = 0; j < Offset; j++)
        {
            int i = j + m; 

            projectivePoints.Y = i / width;
            projectivePoints.X = i - (projectivePoints.Y * width);
            projectivePoints.Z = depthPixels;
            projectivePoints.Perspective = 2;

            realWorldPoints.X = projectivePoints.Z * (projectivePoints.X - rotatedCx) * inverseRotatedFx;
            realWorldPoints.Y = projectivePoints.Z * (rotatedCy - projectivePoints.Y) * inverseRotatedFy;
            realWorldPoints.Z = projectivePoints.Z;
            realWorldPoints.Perspective = 1;

            float realWorldPoint[3];
            realWorldPoint[0] = realWorldPoints.X;
            realWorldPoint[1] = realWorldPoints.Y;
            realWorldPoint[2] = realWorldPoints.Z;

            bedPoints.X = __sec_reduce_add(affines[0:3] * realWorldPoint[0:3]) + affines[3];
            bedPoints.Y = __sec_reduce_add(affines[4:3] * realWorldPoint[0:3]) + affines[7];
            bedPoints.Z = __sec_reduce_add(affines[8:3] * realWorldPoint[0:3]) + affines[11];
            bedPoints.Perspective = 3;

            floorPoints.X = __sec_reduce_add(affines[12:3] * realWorldPoint[0:3]) + affines[15];
            floorPoints.Y = __sec_reduce_add(affines[16:3] * realWorldPoint[0:3]) + affines[19];
            floorPoints.Z = __sec_reduce_add(affines[20:3] * realWorldPoint[0:3]) + affines[23];
            floorPoints.Perspective = 4;
        }
    }

 

Reply