In respect to these benchmark

MSimm2 · ‎06-06-2013

http://clbenchmark.com/device-info.jsp?config=15887974

*Cough*...Not so impressive, but I'm guessing its beta version problem...(?) Anyway something looks broken.

If you go to the results page http://clbenchmark.com/result.jsp and untick GPU you can see that it just beats a i7-3770K

LLess · ‎06-06-2013

I hope so too because compared with a GeForce GTX Titan it really hurts...

ARNON_P_Intel · ‎06-12-2013

In respect to these benchmark results, as you know, OpenCL provides a low-level programming environment to write portable code for diverse mix of platforms and devices. The standard ensures that this portable code will be functionally correct on different devices. However, performance portability is not guarantee. Specifically, OpenCL code designed for one target device will not necessarily be optimized to run on another type of target device without optimizing that code for the underlying hardware. Performance and efficiency improvements resulting from this kind of optimization effort may be significant for multicore and many-core applications. And that might be the case here.

In their article “Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor”, A. Heinecke et al. showcase how developer can generate optimal code with only slight modifications for each target device on the fly. See at: http://iwocl.org/wp-content/uploads/2013/06/Dmitry.pdf. Thier results comparing OpenCL on Xeon Phi verous other devices, and refer you to the code itself.

Arnon

MSimm2 · ‎06-12-2013

Thanks Arnon,

If I optimise my code (http://sourceforge.net/projects/openclsolarsyst) to a haswell CPU (with AVX2) and use a workgroup size of 16 , will this go most of the way to optimising for the Xeon phi?

Other than a lookup table using the device name, is there someway of detecting optimisations for the Xeon phi. Eg CL_DEVICE_LOCAL_MEM_SIZE reports 32768 And CL_DEVICE_MAX_WORK_GROUP_SIZE reports 1024.
The paper you linked suggest not using local memory and making the work groups small.

perhaps we need

CL_DEVICE_PREFERRED_WORK_GROUP_SIZE 16

and

CL_DEVICE_PREFERRED_LOCAL_MEM_SIZE 0

Xeon Phi, Opencl and clbenchmark