- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ,
I have a question related to the compiler generated information.
I have compiled the kernel and see the following information.I am working on Gen9 GT2 system.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please respond?
Thanks
Rajesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ben.
Let me give you a background of what am i trying to achieve.
Iam processing a 1D vector of length 64. i will do MAC operations on these vector samples. the algorithm is quite lengthy where i will have to pass the 1D vector through many feed forward stages. what i thought of doing was,i will assign a single work item and process the 1D vector in SIMD additions and multiplications meaning i will process 8 elements additions or multiplication in a single workitem instead of processing it in 8-work items. As you mentioned above i want to work in the another code path that you suggested- "(In theory our compiler could compile a kernel where a single work item executes in one EU thread, but this would be a separate codepath through our compiler, and these kernels don't generally run well on a GPU)
when you are compiling 8 different work items per EU thread then will it not limit the program code of the kernel?. if i break my algorithm into smaller kernels would it not incur the latencies of VFE->TSG->TDG to spawn the new threads and read the data again?
please suggest a better kernel programming to process the scenario i have described.
Thanks
Rajesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
rajesh k. wrote:Are the operations on your 1D vector all component-wise operations? If so, the usual way to do this is to have a work group process one (or possibly more than one) vector, and each work item in the work group process one (or possibly more than one) component of the vector. I'd start with one vector per work group and one component per work item, and play around with the partitioning from there once it's working, since different partitioning might give different performance. If they aren't component-wise operations then there are options for cross-work-item sharing, but this is more complicated.
please suggest a better kernel programming to process the scenario i have described.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ben,
Just sharing more information about my algorithm.
my kernel function is set of of functions. the output of one function flows down to the next one like below.
Mykerne()
{
func_1();
func_2();
func_3();
func_4();
func_5();
}
one of the functions looks like below.
i have two input vectors v1[17] and another V2[80] and generates an output of V3[64]
To generate one component of output v3 , i will have to shift the input v2 by one and multiply by V1 and sum them up.
once output is generated v3 will be input to next function and so on.
to generate v3 using multiple work items may not straightforward.
this is the simplest function of the kernel. other functions little more complex.
what i have implemented was, i used vector operations to generate v3 ,vload16 for v1 and v2 and using "dotp" i will generate v3. i will do all of my kernel functions in a single work-item and i was expecting it will be mapped to single EU thread so that i will have sufficient registers to generate this functionality without memory spill.
my question is there any way i can map one single work item to single EU thread at the same time i would like complier to generate SIMD 8 additions and multiplication. please let me know.
void func_1(float *v1, float *v2, float *v3 )
{
float sum;
int i;
for (i = 0; i < 64; i++)
{
sum = v2;
sum += v1[1] * v2[i - 1];
sum += v1[2] * v2[i - 2];
sum += v1[3] * v2[i - 3];
sum += v1[4] * v2[i - 4];
sum += v1[5] * v2[i - 5];
sum += v1[6] * v2[i - 6];
sum += v1[7] * v2[i - 7];
sum += v1[8] * v2[i - 8];
sum += v1[9] * v2[i - 9];
sum += v1[10] * v2[i - 10];
sum += v1[11] * v2[i - 11];
sum += v1[12] * v2[i - 12];
sum += v1[13] * v2[i - 13];
sum += v1[14] * v2[i - 14];
sum += v1[15] * v2[i - 15];
sum += v1[16] * v2[i - 16];
v2 = sum;
}
return;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ben,
could you please share your comments ?
One more observation i would like to bring to your notice is:
i have created a work group of size 8. but i didn't change my kernel which is basically written for a single work-item,but now i do read and write the data based on the local_x rather than the global_id_x .
my old thread space looks like this:
size_t globalWorkSize[2] = { 1024, 1};
in the kernel side i read the data like the following
-----------------------------------------------------------------
const int global_x = get_global_id(0);
const int local_x = get_local_id(0);
__global float* x;
float16 temp;
vload16(temp,0,x+ global_x *16 );
-----------------------------------------------------------
my new thread space looks like this:
size_t globalWorkSize[2] = { 1024, 1};
const int global_x = (get_global_id(0)>>3)*8;
const int local_x = get_local_id(0);
__global float* x;
float16 temp;
vload16(temp,0,x+ (global_x+local_x) *16 );
-------------------------------------------------------------------
this is functionally correct. instead of reading the data based on global_index i am reading it based on local_x and modified global_x.
But i see huge performance improvement with this change. But i am not able to understand why there is an improvement.
could you please explain why this happening ?
Best Regards,
Rajesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please respond?
Best Regards,
Rajesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajesh, it sounds like you're on the right track and I don't have too much more to add. Nice job with the performance improvements so far!
Going from a work group size of one to a work group size of eight (or even larger) will almost always result in a performance improvement because our Execution Unit (EU) ALUs are SIMD ALUs, and our IO instructions can load or store up to one cache line (64 bytes) per EU thread per clock. With only one work item per EU thread it's very difficult to keep either of these resources busy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Ben!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page