Re: Re:How to use half-float on DPC++

k_higashi · ‎07-02-2020

Hello.

I want to use half-float type on Intel DPC++.

I found "half" type in DPC++ and executed multiply-accumulate operation by
parallel processing of GPU.

But result of the operation is all zero.

・Is it impossible to use half-float on DPC++?
・Is the usage wrong？
If so, I would like to know how to use.

Environment:
Microsoft Visual Studio 2019 Proffesional
Intel oneAPI beta07
GPU: Iris(R) Plus Graphics

Code sample:

#include <vector> 
#include <CL/sycl.hpp> 

namespace sycl = cl::sycl;

unsigned as_uint(const float x) {
	return *(unsigned*)&x;
}
float as_float(const unsigned x) {
	return *(float*)&x;
}
unsigned short float_to_half(const float x) { 
	const unsigned b = as_uint(x) + 0x00001000;
	const unsigned e = (b & 0x7F800000) >> 23;
	const unsigned m = b & 0x007FFFFF;
	return (b & 0x80000000) >> 16 | (e > 112) * ((((e - 112) << 10) & 0x7C00) 
		| m >> 13) | ((e < 113) & (e > 101)) * ((((0x007FF000 + m) >> (125 - e)) + 1) >> 1) | (e > 143) * 0x7FFF;
}

int main() {

	half* v1 = (half*)malloc(1024 * sizeof(half));
	half* v2 = (half*)malloc(1024 * sizeof(half));
	half* out = (half*)malloc(1024 * sizeof(half));
	float val;
	for (int i = 0; i < 1024; i++) {
		val = (float)i;
		v1[i] = float_to_half(val);
		v2[i] = float_to_half(val+1);
		out[i] = 0;
	}
	
	sycl::gpu_selector device_selector;
	sycl::queue d_queue(device_selector);

	sycl::buffer<half, 1> a_device(v1, sycl::range<1>(1024));
	sycl::buffer<half, 1> b_device(v2, sycl::range<1>(1024));
	sycl::buffer<half, 1> c_device(out, sycl::range<1>(1024));

	{
		d_queue.submit([&](sycl::handler& cgh) {
			auto A = a_device.get_access<sycl::access::mode::read>(cgh);
			auto B = b_device.get_access<sycl::access::mode::read>(cgh);
			auto C = c_device.get_access<sycl::access::mode::write>(cgh);
			cgh.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> idx) {

				C[idx] += A[idx] * B[idx];
				});
			});
	}
	for (int i = 0; i < 20; i++) {
		printf("%u,%u,%u,\n", v1[i], v2[i], out[i]);
	}
	free(v1);
	free(v2);
	free(out);
	return(0);
}

out is all zero. I think the output result is not calculated correctly.

If it is "float", I will get the correct value.

Best regards,

RahulV_intel · ‎07-03-2020

Hi,

Zeros you see in the output is due to non-synchronization of device part of the code with host part.

I could see that you have not enclosed buffer declaration within the scope(Dpc++ scope). In your code, scope starts after buffer declaration. As a result, the program doesn't wait for the execution of device kernel.

In other words, the program skips the device kernel execution and proceeds with the remaining part of the program (printing the values).

When you enclose the buffers within the scope, the program execution is bound to wait until the buffer destructor is called. (since buffers are declared within the scope). Once the device kernel execution complete, the data is copied back from device to host implicitly and buffer destructor gets called.

	sycl::gpu_selector device_selector;
	sycl::queue d_queue(device_selector);
    {
	sycl::buffer<half, 1> a_device(v1, sycl::range<1>(1024));
	sycl::buffer<half, 1> b_device(v2, sycl::range<1>(1024));
	sycl::buffer<half, 1> c_device(out, sycl::range<1>(1024));

	
		d_queue.submit([&](sycl::handler& cgh) {
			auto A = a_device.get_access<sycl::access::mode::read>(cgh);
			auto B = b_device.get_access<sycl::access::mode::read>(cgh);
			auto C = c_device.get_access<sycl::access::mode::write>(cgh);
			cgh.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> idx) {

				C[idx] += A[idx] * B[idx];
				});
			});
	}

Let me know if you face any issue.

Note: In some cases, device kernel runs extremely fast(depending on the size/type of application). In such cases, synchronization might happen even if the buffers aren't enclosed within the scope. However, it's a good practice to enclose the buffers within the scope.

--Rahul

k_higashi · ‎07-05-2020

Hello,

Thank you for your answer.

I fixed my code according to your advice, and I got the calculation result.

But I think the calculation result is not correct.

For example, perform the following calculations with parallel_for.

C[idx] = A[idx] * B[idx];

<expected value>
out = v1 * v2
v1, v2, out
0, 1, 0
1, 2, 2 (half:29728)
2, 3, 6 (half:29808)
3, 4, 12 (half:29864)

However, the calculation result is as follows.

v1, v2, out(half)
0, 1, 0
1, 2, 31744
2, 3, 31744
3, 4, 31744
...
What am I doing wrong?
Please advice me.

RahulV_intel · ‎07-06-2020

Hi,

I'd suggest you to compare the results of the serial version of your code to its dpc++ version.

Perform serial computation on CPU(A simple C++ loop to perform computation should do) and compare its results with parallel(Dpc++) computation.

Let me know the results after comparison.

Thanks,

Rahul

k_higashi · ‎07-08-2020

I tried "half" type multiplication on CPU(simple C++).
As a result, I got the same result as DPC++.

(But this was not what we expected...)

I found that DPC++ has "half" type, and,
I was hoping that the "half" type could be used instead of the "float" type to make the computation time more fast.
However, I understand that I can't use a simple arithmetic expression like the "float" type in C++ code.
(no support "half" in C++.)

Thank you for your help.

RahulV_intel · ‎07-08-2020

Hi,

Since the results are similar for both serial as well as parallel versions, I feel that the issue could be with float_to_half(const float x) function.

"Half" type is a OpenCL/SYCL specification. It is not present in standard C++ specification. Float datatype occupies 4 bytes, whereas half type occupies only 2 bytes. For more information on half type, kindly refer to OpenCL/SYCL specifications.

I don't see the point of float_to_half(const float x) function because you can readily use half type, which guarantees half float precision. My suggestion would be to try out the computation without calling this function(on host) and see if it gives right results.

Coming to speed up, there could be some performance improvement with half(logically speaking). However, I'd suggest you to modify your code, to print the time taken for computation in both the cases and let me know.

Regards,

Rahul

RahulV_intel · ‎07-12-2020

Hi,

Could you kindly confirm if the solution provided helps?

--Rahul

RahulV_intel · ‎07-20-2020

Hi,

I have not heard from back you, so I will close this thread from my end. If you still have issues, feel free to post a new question. Intel will no longer monitor this thread, but it will remain open for community discussion.