This kernel runs on host cpu but produces wrong outputs when run Phi. What's the issue witht the Phi here?
__kernel void Convolve(const __global float * pInput,
__constant float * pFilter,
__global float * pOutput,
const int nInWidth,
const int nFilterWidth)
const int nWidth = get_global_size(0);
const int xOut = get_global_id(0);
const int yOut = get_global_id(1);
const int xInTopLeft = xOut;
const int yInTopLeft = yOut;
float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < nFilterWidth; c++)
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
Thanks. I sent you a private message with the source including makefiles. Anyone from intel could please explain to me why the performance of the xeon phi is much less than that of the phi.
also I forgot. when you run the program, use --help to see commands.
According to OpenCL spec paragraph 5.4.2:
clEnqueueMapBuffer, and clEnqueueMapImage act as synchronization points for a region of the buffer object being mapped.
This means that actual data from device is transferred to the host only during Map operations and transfered back during Unmap. The actual data transfer direction depends on clEnqueueMapXXXX parameters. If buffer or image is currently mapped it is considered owned by host and any access to this buffer or image on device produces undefined results. The same is valid for unmapped buffers or images - if buffer or image is unmapped it is considered owned by device and any access to apropriate memory region from the host results in undefined behavior.
Yuri, thanks. I will re-check the link. For the meantime, what is the Number of Compute Units used by opencl on xeon phi? 60 or 240? That is, how does opencl deal with the 4-way hyperthreading on the device?
I have read the documentation again. Map and unmap has better performance than read and write.
Do you agree with the following
cldistances = clCreateBuffer(clcontext, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, ...
// Calculation loop:
for (i = 0; i < numsteps;...
// Enqueing kernel for execution:
clerror = clEnqueueNDRangeKernel(clqueue, cldistkernel,...
// Enqueing buffer mapping to read result from device
and so at what point do I need the clEnqueueUnMapBuffer?
Dmittry! thanks!! It works now. Have been digging into the problem and reading the link you sent and followed the intel sample. The host and accelerator results are okay now! Using ClEnqueueMapBuffer and ClEnqueueUnMapBuffer corretly fixed the problem. I know understand the point you were trying to make about using those commands for synchronization (or shared memory authority b/w host and accelerator). Also, the Phi now seems slightly better (by a narrow margin) after running a few more benchmarks with larger problem sizes. I will look into the workgorup sizes now (local work group set to 16 as per phi optimization guide) to see if any more performance gain could be achieved.
Thanks buddy! :)