- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi ,
I have a question related to the compiler generated information.
I have compiled the kernel and see the following information.I am working on Gen9 GT2 system.
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi,
Could you please respond?
Thanks
Rajesh
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Thanks Ben.
Let me give you a background of what am i trying to achieve.
Iam processing a 1D vector of length 64. i will do MAC operations on these vector samples. the algorithm is quite lengthy where i will have to pass the 1D vector through many feed forward stages. what i thought of doing was,i will assign a single work item and process the 1D vector in SIMD additions and multiplications meaning i will process 8 elements additions or multiplication in a single workitem instead of processing it in 8-work items. As you mentioned above i want to work in the another code path that you suggested- "(In theory our compiler could compile a kernel where a single work item executes in one EU thread, but this would be a separate codepath through our compiler, and these kernels don't generally run well on a GPU)
when you are compiling 8 different work items per EU thread then will it not limit the program code of the kernel?. if i break my algorithm into smaller kernels would it not incur the latencies of VFE->TSG->TDG to spawn the new threads and read the data again?
please suggest a better kernel programming to process the scenario i have described.
Thanks
Rajesh
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
rajesh k. wrote:Are the operations on your 1D vector all component-wise operations? If so, the usual way to do this is to have a work group process one (or possibly more than one) vector, and each work item in the work group process one (or possibly more than one) component of the vector. I'd start with one vector per work group and one component per work item, and play around with the partitioning from there once it's working, since different partitioning might give different performance. If they aren't component-wise operations then there are options for cross-work-item sharing, but this is more complicated.
please suggest a better kernel programming to process the scenario i have described.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Ben,
Just sharing more information about my algorithm.
my kernel function is set of of functions. the output of one function flows down to the next one like below.
Mykerne()
{
func_1();
func_2();
func_3();
func_4();
func_5();
}
one of the functions looks like below.
i have two input vectors v1[17] and another V2[80] and generates an output of V3[64]
To generate one component of output v3 , i will have to shift the input v2 by one and multiply by V1 and sum them up.
once output is generated v3 will be input to next function and so on.
to generate v3 using multiple work items may not straightforward.
this is the simplest function of the kernel. other functions little more complex.
what i have implemented was, i used vector operations to generate v3 ,vload16 for v1 and v2 and using "dotp" i will generate v3. i will do all of my kernel functions in a single work-item and i was expecting it will be mapped to single EU thread so that i will have sufficient registers to generate this functionality without memory spill.
my question is there any way i can map one single work item to single EU thread at the same time i would like complier to generate SIMD 8 additions and multiplication. please let me know.
void func_1(float *v1, float *v2, float *v3 )
{
float sum;
int i;
for (i = 0; i < 64; i++)
{
sum = v2;
sum += v1[1] * v2[i - 1];
sum += v1[2] * v2[i - 2];
sum += v1[3] * v2[i - 3];
sum += v1[4] * v2[i - 4];
sum += v1[5] * v2[i - 5];
sum += v1[6] * v2[i - 6];
sum += v1[7] * v2[i - 7];
sum += v1[8] * v2[i - 8];
sum += v1[9] * v2[i - 9];
sum += v1[10] * v2[i - 10];
sum += v1[11] * v2[i - 11];
sum += v1[12] * v2[i - 12];
sum += v1[13] * v2[i - 13];
sum += v1[14] * v2[i - 14];
sum += v1[15] * v2[i - 15];
sum += v1[16] * v2[i - 16];
v2 = sum;
}
return;
}
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Ben,
could you please share your comments ?
One more observation i would like to bring to your notice is:
i have created a work group of size 8. but i didn't change my kernel which is basically written for a single work-item,but now i do read and write the data based on the local_x rather than the global_id_x .
my old thread space looks like this:
size_t globalWorkSize[2] = { 1024, 1};
in the kernel side i read the data like the following
-----------------------------------------------------------------
const int global_x = get_global_id(0);
const int local_x = get_local_id(0);
__global float* x;
float16 temp;
vload16(temp,0,x+ global_x *16 );
-----------------------------------------------------------
my new thread space looks like this:
size_t globalWorkSize[2] = { 1024, 1};
const int global_x = (get_global_id(0)>>3)*8;
const int local_x = get_local_id(0);
__global float* x;
float16 temp;
vload16(temp,0,x+ (global_x+local_x) *16 );
-------------------------------------------------------------------
this is functionally correct. instead of reading the data based on global_index i am reading it based on local_x and modified global_x.
But i see huge performance improvement with this change. But i am not able to understand why there is an improvement.
could you please explain why this happening ?
Best Regards,
Rajesh
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi,
Could you please respond?
Best Regards,
Rajesh
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Rajesh, it sounds like you're on the right track and I don't have too much more to add. Nice job with the performance improvements so far!
Going from a work group size of one to a work group size of eight (or even larger) will almost always result in a performance improvement because our Execution Unit (EU) ALUs are SIMD ALUs, and our IO instructions can load or store up to one cache line (64 bytes) per EU thread per clock. With only one work item per EU thread it's very difficult to keep either of these resources busy.
