Memory spill in my kernel

rajesh_k_ · ‎02-28-2018

Hi ,

I have a question related to the compiler generated information.

I have compiled the kernel and see the following information.I am working on Gen9 GT2 system.

1> Maximum work-group size: 256

1> Compiler work-group size: (0, 0, 0)

1> Local memory size: 0

1> Preferred multiple of work-group size: 8

1> Minimum amount of private memory: 768

1> Amount of spill memory used by the kernel: 1024

I assume that private memory is mapped to the registers by the compiler. My workgroup size is "1". each EU thread has 128 GRFs and each GRF size is 256-bits . the total memory size of the private space is 4096 bytes. I see my kernel requires only 768 bytes of the GRF space which is much less than 4096 bytes , but compiler is generating a spill code of 1024 bytes.

could you please explain me why this is happening?

Best Regards,

Rajesh

rajesh_k_ · ‎03-01-2018

Hi,

Could you please respond?

Thanks

Rajesh

Ben_A_Intel · ‎03-02-2018

Odd, when I went back to reply to this question yesterday I didn't see it. Our GPU OpenCL compiler will compile kernels for 8, 16, or 32 work items per EU thread. You may hear this referred to as the "vectorization width", or as compiling a kernel "SIMD8", "SIMD16", or "SIMD32". What this means is that the 4K EU thread total GRF size is shared among more than one OpenCL work item, so the effective GRF size per work item is considerably less than 4K. (In theory our compiler could compile a kernel where a single work item executes in one EU thread, but this would be a separate codepath through our compiler, and these kernels don't generally run well on a GPU.) You can find out what SIMD size your kernel was compiled to by querying CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Since your preferred work group size multiple is 8, your kernel was compiled SIMD8. The value we return for CL_KERNEL_PRIVATE_MEM_SIZE is additional private memory that we need per work item, above and beyond what we can store in the register file. This is typically caused by large private memory arrays that we can't put in the register file, or at least can't put in the register file efficiently. Most frequently, these are private memory arrays that are indexed dynamically, say based on a loop counter. Hope this helps!

rajesh_k_ · ‎03-06-2018

Thanks Ben.

Let me give you a background of what am i trying to achieve.

Iam processing a 1D vector of length 64. i will do MAC operations on these vector samples. the algorithm is quite lengthy where i will have to pass the 1D vector through many feed forward stages. what i thought of doing was,i will assign a single work item and process the 1D vector in SIMD additions and multiplications meaning i will process 8 elements additions or multiplication in a single workitem instead of processing it in 8-work items. As you mentioned above i want to work in the another code path that you suggested- "(In theory our compiler could compile a kernel where a single work item executes in one EU thread, but this would be a separate codepath through our compiler, and these kernels don't generally run well on a GPU)

when you are compiling 8 different work items per EU thread then will it not limit the program code of the kernel?. if i break my algorithm into smaller kernels would it not incur the latencies of VFE->TSG->TDG to spawn the new threads and read the data again?

please suggest a better kernel programming to process the scenario i have described.

Thanks

Rajesh

Ben_A_Intel · ‎03-06-2018

rajesh k. wrote:

please suggest a better kernel programming to process the scenario i have described.

Are the operations on your 1D vector all component-wise operations? If so, the usual way to do this is to have a work group process one (or possibly more than one) vector, and each work item in the work group process one (or possibly more than one) component of the vector. I'd start with one vector per work group and one component per work item, and play around with the partitioning from there once it's working, since different partitioning might give different performance. If they aren't component-wise operations then there are options for cross-work-item sharing, but this is more complicated.

rajesh_k_ · ‎03-07-2018

Hi Ben,

Just sharing more information about my algorithm.

my kernel function is set of of functions. the output of one function flows down to the next one like below.

Mykerne()

{

func_1();

func_2();

func_3();

func_4();

func_5();

}

one of the functions looks like below.

i have two input vectors v1[17] and another V2[80] and generates an output of V3[64]

To generate one component of output v3 , i will have to shift the input v2 by one and multiply by V1 and sum them up.

once output is generated v3 will be input to next function and so on.

to generate v3 using multiple work items may not straightforward.

this is the simplest function of the kernel. other functions little more complex.

what i have implemented was, i used vector operations to generate v3 ,vload16 for v1 and v2 and using "dotp" i will generate v3. i will do all of my kernel functions in a single work-item and i was expecting it will be mapped to single EU thread so that i will have sufficient registers to generate this functionality without memory spill.

my question is there any way i can map one single work item to single EU thread at the same time i would like complier to generate SIMD 8 additions and multiplication. please let me know.

void func_1(float *v1, float *v2, float *v3 )

{
float sum;
int i;

for (i = 0; i < 64; i++)

{
sum = v2;
sum += v1[1] * v2[i - 1];
sum += v1[2] * v2[i - 2];
sum += v1[3] * v2[i - 3];
sum += v1[4] * v2[i - 4];
sum += v1[5] * v2[i - 5];
sum += v1[6] * v2[i - 6];
sum += v1[7] * v2[i - 7];
sum += v1[8] * v2[i - 8];
sum += v1[9] * v2[i - 9];
sum += v1[10] * v2[i - 10];
sum += v1[11] * v2[i - 11];
sum += v1[12] * v2[i - 12];
sum += v1[13] * v2[i - 13];
sum += v1[14] * v2[i - 14];
sum += v1[15] * v2[i - 15];
sum += v1[16] * v2[i - 16];
v2 = sum;
}

return;
}

rajesh_k_ · ‎03-09-2018

Hi Ben,

could you please share your comments ?

One more observation i would like to bring to your notice is:

i have created a work group of size 8. but i didn't change my kernel which is basically written for a single work-item,but now i do read and write the data based on the local_x rather than the global_id_x .

my old thread space looks like this:

size_t globalWorkSize[2] = { 1024, 1};

in the kernel side i read the data like the following

-----------------------------------------------------------------

const int global_x = get_global_id(0);

const int local_x = get_local_id(0);

__global float* x;

float16 temp;

vload16(temp,0,x+ global_x *16 );

-----------------------------------------------------------

my new thread space looks like this:

size_t globalWorkSize[2] = { 1024, 1};

size_t local_size[2] = { 8, 1 };

i didn't reduce the global size by 8 since i have written the SIMD code to be executed within single work item itself.

but in the kernel side i have made changes to the global_id to incorporate the local id.

-----------------------------------------------------------------------------

const int global_x = (get_global_id(0)>>3)*8;

const int local_x = get_local_id(0);

__global float* x;

float16 temp;

vload16(temp,0,x+ (global_x+local_x) *16 );

-------------------------------------------------------------------

this is functionally correct. instead of reading the data based on global_index i am reading it based on local_x and modified global_x.

But i see huge performance improvement with this change. But i am not able to understand why there is an improvement.

could you please explain why this happening ?

Best Regards,

Rajesh

rajesh_k_ · ‎03-11-2018

Hi,

Could you please respond?

Best Regards,

Rajesh

Ben_A_Intel · ‎03-13-2018

Hi Rajesh, it sounds like you're on the right track and I don't have too much more to add. Nice job with the performance improvements so far!

Going from a work group size of one to a work group size of eight (or even larger) will almost always result in a performance improvement because our Execution Unit (EU) ALUs are SIMD ALUs, and our IO instructions can load or store up to one cache line (64 bytes) per EU thread per clock. With only one work item per EU thread it's very difficult to keep either of these resources busy.

rajesh_k_ · ‎03-22-2018

Thank you Ben!