// data allocation
int* w_device = malloc_device<int>(M * N, Q);
int* b_device = malloc_device<int>(M, Q);
int* temp = malloc_device<int>(SIZE, Q);
int* x_shared = malloc_shared<int>(N * K, Q);
int* result_shared = malloc_shared<int>(M * K, Q);
//two events: copy data to device
auto e1 = ... cgh.memcpy(w_device...)..;
auto e2 = ..cgh.memcpy(b_device...)...;
// compute graph dependencies
cgh.depends_on({e1, e2});
// two for-loops in kernel function
// dimension: nd_range<2>{{M, K}, {1, GROUP_SIZE}}
int m = item.get_global_id(0);
int n = item.get_global_id(1);
int i = item.get_local_id(1);
//first
for (int k = 0; k < N; k += GROUP_SIZE) {
temp[i] = w_device[m][k +i];
item.barrier();
for(int kk = 0; kk < GROUP_SIZE; kk++) {
sum += temp[kk] * x_shared[(k + kk) * K + n];
item.barrier();
}
}
//second
for (int k = 0; k < N; k++) {
sum += w_device[m * N + k] * x_shared[k * K + n];
}
//
result[m * K + n] = sum + b_device[m];
I was trying to understand how to use SYCL, and decided to write matrix multiplication from scratch. However, I have some questions when testing my codes. I am not sure whether these problems were bugs, or I made some mistakes, so I posted the key part of my codes.
Using the first for-loop, the result occasionally gave wrong elements (not all, just part of it)., while the second for-loop will never give wrong result.
if I did not make mistakes, I guess there are some problems when using item.barrier() to synchronize data accessed by USM?
Link Copied
Hi,
Could you please share the complete source code if possible?
Also, specify your oneAPI base toolkit version and OS.
Thanks,
Rahul
Hi,
I have not heard back from you. So, I will go ahead and close this thread from my end. Feel free to post a new query if you require further assistance from Intel.
Thanks,
Rahul
For more complete information about compiler optimizations, see our Optimization Notice.