Solved: Opencl: incorrect results on phi

Dave_O_ · ‎04-01-2014

This kernel runs on host cpu but produces wrong outputs when run Phi. What's the issue witht the Phi here?

//KERNEL_SIMPLE
__kernel void Convolve(const __global float * pInput,
__constant float * pFilter,
__global float * pOutput,
const int nInWidth,
const int nFilterWidth)
{
const int nWidth = get_global_size(0);

const int xOut = get_global_id(0);
const int yOut = get_global_id(1);

const int xInTopLeft = xOut;
const int yInTopLeft = yOut;

float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
{
const int idxFtmp = r * nFilterWidth;

const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;

for (int c = 0; c < nFilterWidth; c++)
{
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
}
}
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
}
//KERNEL_SIMPLE

Yuri_K_Intel · ‎04-03-2014

Well, the "correctness" problem is in the host part of the code. The output buffer (oclBuffers.outputCL) is created with CL_MEM_USE_HOST_PTR flag. And in case of CPU OpenCL implementation uses exactly the same memory region (specified by hostBuffers.pOutputCL) - so we have results in this buffer right away. But in case of Xeon Phi it is not so - OpenCL implementation allocates another buffer on the device and we should get the data back to host. In this case (when CL_MEM_USE_HOST_PTR flag is used) it's sufficient to call clEnqueueMapBuffer and clEnqueueUnmapMemObject functions for the output buffer. With this modification I get "Passed" on Xeon Phi. Please, look at some samples for the similar code, for example - http://software.intel.com/en-us/vcsource/samples/hdr-tone-mapping. As for the performance question... This is quite a broad topic and each application should be analyzed/tuned separately to achieve maximum performance. Here is just general comment. Xeon Phi is an accelerator device (like a GPU) and there are a number of factors (workload algorithm, device architecture, the working size, etc) which determine if the acceleration is possible or not. Please, use optimization guide http://software.intel.com/sites/products/documentation/ioclsdk/2013XE/OG/index.htm. Thanks, Yuri

View solution in original post

Yuri_K_Intel · ‎04-01-2014

Hi Dave, Could you please provide a full reproducer (including the host part)? And the short description of expected results. So I will be able to check it quickly. Thanks, Yuri

David_O_ · ‎04-02-2014

Thanks. I sent you a private message with the source including makefiles. Anyone from intel could please explain to me why the performance of the xeon phi is much less than that of the phi.

also I forgot. when you run the program, use --help to see commands.

Dave_O_ · ‎04-02-2014

*correction: ...than that of the *host*

Yuri_K_Intel · ‎04-03-2014

Well, the "correctness" problem is in the host part of the code. The output buffer (oclBuffers.outputCL) is created with CL_MEM_USE_HOST_PTR flag. And in case of CPU OpenCL implementation uses exactly the same memory region (specified by hostBuffers.pOutputCL) - so we have results in this buffer right away. But in case of Xeon Phi it is not so - OpenCL implementation allocates another buffer on the device and we should get the data back to host. In this case (when CL_MEM_USE_HOST_PTR flag is used) it's sufficient to call clEnqueueMapBuffer and clEnqueueUnmapMemObject functions for the output buffer. With this modification I get "Passed" on Xeon Phi. Please, look at some samples for the similar code, for example - http://software.intel.com/en-us/vcsource/samples/hdr-tone-mapping. As for the performance question... This is quite a broad topic and each application should be analyzed/tuned separately to achieve maximum performance. Here is just general comment. Xeon Phi is an accelerator device (like a GPU) and there are a number of factors (workload algorithm, device architecture, the working size, etc) which determine if the acceleration is possible or not. Please, use optimization guide http://software.intel.com/sites/products/documentation/ioclsdk/2013XE/OG/index.htm. Thanks, Yuri

Dmitry_K_Intel · ‎04-03-2014

According to OpenCL spec paragraph 5.4.2:

clEnqueueMapBuffer, and clEnqueueMapImage act as synchronization points for a region of the buffer object being mapped.

This means that actual data from device is transferred to the host only during Map operations and transfered back during Unmap. The actual data transfer direction depends on clEnqueueMapXXXX parameters. If buffer or image is currently mapped it is considered owned by host and any access to this buffer or image on device produces undefined results. The same is valid for unmapped buffers or images - if buffer or image is unmapped it is considered owned by device and any access to apropriate memory region from the host results in undefined behavior.

Dave_O_ · ‎04-03-2014

Yuri, thanks. I will re-check the link. For the meantime, what is the Number of Compute Units used by opencl on xeon phi? 60 or 240? That is, how does opencl deal with the 4-way hyperthreading on the device?

Dave_O_ · ‎04-03-2014

I have read the documentation again. Map and unmap has better performance than read and write.

Do you agree with the following

// Intialization:
cldistances = clCreateBuffer(clcontext, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, ...
...

// Calculation loop:
for (i = 0; i < numsteps;...
{
// Enqueing kernel for execution:
clerror = clEnqueueNDRangeKernel(clqueue, cldistkernel,...

// Enqueing buffer mapping to read result from device
clEnqueueMapBuffer(clqueue, cldistances,...
...
}

and so at what point do I need the clEnqueueUnMapBuffer?

Dave

Dave_O_ · ‎04-03-2014

Dmittry! thanks!! It works now. Have been digging into the problem and reading the link you sent and followed the intel sample. The host and accelerator results are okay now! Using ClEnqueueMapBuffer and ClEnqueueUnMapBuffer corretly fixed the problem. I know understand the point you were trying to make about using those commands for synchronization (or shared memory authority b/w host and accelerator). Also, the Phi now seems slightly better (by a narrow margin) after running a few more benchmarks with larger problem sizes. I will look into the workgorup sizes now (local work group set to 16 as per phi optimization guide) to see if any more performance gain could be achieved.

Thanks buddy! :)