#include <vector>
#include <CL/sycl.hpp>
namespace sycl = cl::sycl;
unsigned as_uint(const float x) {
return *(unsigned*)&x;
}
float as_float(const unsigned x) {
return *(float*)&x;
}
unsigned short float_to_half(const float x) {
const unsigned b = as_uint(x) + 0x00001000;
const unsigned e = (b & 0x7F800000) >> 23;
const unsigned m = b & 0x007FFFFF;
return (b & 0x80000000) >> 16 | (e > 112) * ((((e - 112) << 10) & 0x7C00)
| m >> 13) | ((e < 113) & (e > 101)) * ((((0x007FF000 + m) >> (125 - e)) + 1) >> 1) | (e > 143) * 0x7FFF;
}
int main() {
half* v1 = (half*)malloc(1024 * sizeof(half));
half* v2 = (half*)malloc(1024 * sizeof(half));
half* out = (half*)malloc(1024 * sizeof(half));
float val;
for (int i = 0; i < 1024; i++) {
val = (float)i;
v1[i] = float_to_half(val);
v2[i] = float_to_half(val+1);
out[i] = 0;
}
sycl::gpu_selector device_selector;
sycl::queue d_queue(device_selector);
sycl::buffer<half, 1> a_device(v1, sycl::range<1>(1024));
sycl::buffer<half, 1> b_device(v2, sycl::range<1>(1024));
sycl::buffer<half, 1> c_device(out, sycl::range<1>(1024));
{
d_queue.submit([&](sycl::handler& cgh) {
auto A = a_device.get_access<sycl::access::mode::read>(cgh);
auto B = b_device.get_access<sycl::access::mode::read>(cgh);
auto C = c_device.get_access<sycl::access::mode::write>(cgh);
cgh.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> idx) {
C[idx] += A[idx] * B[idx];
});
});
}
for (int i = 0; i < 20; i++) {
printf("%u,%u,%u,\n", v1[i], v2[i], out[i]);
}
free(v1);
free(v2);
free(out);
return(0);
}
out is all zero. I think the output result is not calculated correctly.
If it is "float", I will get the correct value.
Best regards,
Link Copied
Hi,
Zeros you see in the output is due to non-synchronization of device part of the code with host part.
I could see that you have not enclosed buffer declaration within the scope(Dpc++ scope). In your code, scope starts after buffer declaration. As a result, the program doesn't wait for the execution of device kernel.
In other words, the program skips the device kernel execution and proceeds with the remaining part of the program (printing the values).
When you enclose the buffers within the scope, the program execution is bound to wait until the buffer destructor is called. (since buffers are declared within the scope). Once the device kernel execution complete, the data is copied back from device to host implicitly and buffer destructor gets called.
sycl::gpu_selector device_selector;
sycl::queue d_queue(device_selector);
{
sycl::buffer<half, 1> a_device(v1, sycl::range<1>(1024));
sycl::buffer<half, 1> b_device(v2, sycl::range<1>(1024));
sycl::buffer<half, 1> c_device(out, sycl::range<1>(1024));
d_queue.submit([&](sycl::handler& cgh) {
auto A = a_device.get_access<sycl::access::mode::read>(cgh);
auto B = b_device.get_access<sycl::access::mode::read>(cgh);
auto C = c_device.get_access<sycl::access::mode::write>(cgh);
cgh.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> idx) {
C[idx] += A[idx] * B[idx];
});
});
}
Let me know if you face any issue.
Note: In some cases, device kernel runs extremely fast(depending on the size/type of application). In such cases, synchronization might happen even if the buffers aren't enclosed within the scope. However, it's a good practice to enclose the buffers within the scope.
--Rahul
Hello,
Thank you for your answer.
I fixed my code according to your advice, and I got the calculation result.
But I think the calculation result is not correct.
For example, perform the following calculations with parallel_for.
C[idx] = A[idx] * B[idx];
<expected value>
out = v1 * v2
v1, v2, out
0, 1, 0
1, 2, 2 (half:29728)
2, 3, 6 (half:29808)
3, 4, 12 (half:29864)
However, the calculation result is as follows.
v1, v2, out(half)
0, 1, 0
1, 2, 31744
2, 3, 31744
3, 4, 31744
...
What am I doing wrong?
Please advice me.
Hi,
I'd suggest you to compare the results of the serial version of your code to its dpc++ version.
Perform serial computation on CPU(A simple C++ loop to perform computation should do) and compare its results with parallel(Dpc++) computation.
Let me know the results after comparison.
Thanks,
Rahul
Hi,
Since the results are similar for both serial as well as parallel versions, I feel that the issue could be with float_to_half(const float x) function.
"Half" type is a OpenCL/SYCL specification. It is not present in standard C++ specification. Float datatype occupies 4 bytes, whereas half type occupies only 2 bytes. For more information on half type, kindly refer to OpenCL/SYCL specifications.
I don't see the point of float_to_half(const float x) function because you can readily use half type, which guarantees half float precision. My suggestion would be to try out the computation without calling this function(on host) and see if it gives right results.
Coming to speed up, there could be some performance improvement with half(logically speaking). However, I'd suggest you to modify your code, to print the time taken for computation in both the cases and let me know.
Regards,
Rahul
Hi,
Could you kindly confirm if the solution provided helps?
--Rahul
Hi,
I have not heard from back you, so I will close this thread from my end. If you still have issues, feel free to post a new question. Intel will no longer monitor this thread, but it will remain open for community discussion.
For more complete information about compiler optimizations, see our Optimization Notice.