OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Opencl: incorrect results on phi

Dave_O_
Beginner
805 Views

This kernel runs on host cpu but produces wrong outputs when run Phi. What's the issue witht the Phi here?

 

//KERNEL_SIMPLE
__kernel void Convolve(const __global  float * pInput,
                        __constant float * pFilter,
                        __global  float * pOutput,
                        const int nInWidth,
                        const int nFilterWidth)
{
    const int nWidth = get_global_size(0);

    const int xOut = get_global_id(0);
    const int yOut = get_global_id(1);

    const int xInTopLeft = xOut;
    const int yInTopLeft = yOut;

    float sum = 0;
    for (int r = 0; r < nFilterWidth; r++)
    {
        const int idxFtmp = r * nFilterWidth;

        const int yIn = yInTopLeft + r;
        const int idxIntmp = yIn * nInWidth + xInTopLeft;

        for (int c = 0; c < nFilterWidth; c++)
        {
            const int idxF  = idxFtmp  + c;
            const int idxIn = idxIntmp + c;
            sum += pFilter[idxF]*pInput[idxIn];
        }
    }
    const int idxOut = yOut * nWidth + xOut;
    pOutput[idxOut] = sum;
}
//KERNEL_SIMPLE

0 Kudos
1 Solution
Yuri_K_Intel
Employee
805 Views
Well, the "correctness" problem is in the host part of the code. The output buffer (oclBuffers.outputCL) is created with CL_MEM_USE_HOST_PTR flag. And in case of CPU OpenCL implementation uses exactly the same memory region (specified by hostBuffers.pOutputCL) - so we have results in this buffer right away. But in case of Xeon Phi it is not so - OpenCL implementation allocates another buffer on the device and we should get the data back to host. In this case (when CL_MEM_USE_HOST_PTR flag is used) it's sufficient to call clEnqueueMapBuffer and clEnqueueUnmapMemObject functions for the output buffer. With this modification I get "Passed" on Xeon Phi. Please, look at some samples for the similar code, for example - http://software.intel.com/en-us/vcsource/samples/hdr-tone-mapping. As for the performance question... This is quite a broad topic and each application should be analyzed/tuned separately to achieve maximum performance. Here is just general comment. Xeon Phi is an accelerator device (like a GPU) and there are a number of factors (workload algorithm, device architecture, the working size, etc) which determine if the acceleration is possible or not. Please, use optimization guide http://software.intel.com/sites/products/documentation/ioclsdk/2013XE/OG/index.htm. Thanks, Yuri

View solution in original post

0 Kudos
8 Replies
Yuri_K_Intel
Employee
805 Views
Hi Dave, Could you please provide a full reproducer (including the host part)? And the short description of expected results. So I will be able to check it quickly. Thanks, Yuri
0 Kudos
David_O_
Innovator
805 Views

Thanks. I sent you a private message with the source including makefiles. Anyone from intel could please explain to me why the performance of the xeon phi is much less than that of the phi.

 

also I forgot. when you run the program, use --help to see commands.

0 Kudos
Dave_O_
Beginner
805 Views

*correction: ...than that of the *host*

0 Kudos
Yuri_K_Intel
Employee
806 Views
Well, the "correctness" problem is in the host part of the code. The output buffer (oclBuffers.outputCL) is created with CL_MEM_USE_HOST_PTR flag. And in case of CPU OpenCL implementation uses exactly the same memory region (specified by hostBuffers.pOutputCL) - so we have results in this buffer right away. But in case of Xeon Phi it is not so - OpenCL implementation allocates another buffer on the device and we should get the data back to host. In this case (when CL_MEM_USE_HOST_PTR flag is used) it's sufficient to call clEnqueueMapBuffer and clEnqueueUnmapMemObject functions for the output buffer. With this modification I get "Passed" on Xeon Phi. Please, look at some samples for the similar code, for example - http://software.intel.com/en-us/vcsource/samples/hdr-tone-mapping. As for the performance question... This is quite a broad topic and each application should be analyzed/tuned separately to achieve maximum performance. Here is just general comment. Xeon Phi is an accelerator device (like a GPU) and there are a number of factors (workload algorithm, device architecture, the working size, etc) which determine if the acceleration is possible or not. Please, use optimization guide http://software.intel.com/sites/products/documentation/ioclsdk/2013XE/OG/index.htm. Thanks, Yuri
0 Kudos
Dmitry_K_Intel
Employee
805 Views

According to OpenCL spec paragraph 5.4.2:

clEnqueueMapBuffer, and clEnqueueMapImage act as synchronization points for a region of the buffer object being mapped.

This means that actual data from device is transferred to the host only during Map operations and transfered back during Unmap. The actual data transfer direction depends on clEnqueueMapXXXX parameters. If buffer or image is currently mapped it is considered owned by host and any access to this buffer or image on device produces undefined results. The same is valid for unmapped buffers or images - if buffer or image is unmapped it is considered owned by device and any access to apropriate memory region from the host results in undefined behavior.

 

0 Kudos
Dave_O_
Beginner
805 Views

Yuri, thanks. I will re-check the link. For the meantime, what is the Number of Compute Units used by opencl on xeon phi? 60 or 240? That is, how does opencl deal with the 4-way hyperthreading on the device? 

0 Kudos
Dave_O_
Beginner
805 Views

I have read the documentation again. Map and unmap has better performance than read and write.

Do you agree with the following

// Intialization:
cldistances = clCreateBuffer(clcontext, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, ...
...

// Calculation loop:
for (i = 0; i < numsteps;...
{
// Enqueing kernel for execution:
clerror = clEnqueueNDRangeKernel(clqueue, cldistkernel,...

// Enqueing buffer mapping to read result from device
clEnqueueMapBuffer(clqueue, cldistances,...
...
}

and so at what point do I need the clEnqueueUnMapBuffer?

 

Dave

0 Kudos
Dave_O_
Beginner
805 Views

Dmittry! thanks!! It works now. Have been digging into the problem and reading the link you sent and followed the intel sample. The host and accelerator results are okay now! Using ClEnqueueMapBuffer and ClEnqueueUnMapBuffer corretly fixed the problem. I know understand the point you were trying to make about using those commands for synchronization (or shared memory authority b/w host and accelerator). Also, the Phi now seems slightly better (by a narrow margin) after running a few more benchmarks with larger problem sizes. I will look into the workgorup sizes now (local work group set to 16 as per phi optimization guide) to see if any more performance gain could be achieved. 

Thanks buddy! :) 

0 Kudos
Reply