I am running code that uses your latest implementation of Intel OpenCL SDK on Linux. I execute code on both Xeon and on Xeon Phi. I am profiling code with Intel VTune Amplifier. According to the analysis the main limiting factor that I experience is due to some unidentified TBB scheduling issues. I would rather expect the kernels ([Dynamic Code]) to be the main hotspots in the code. I would like to know what might be the reason that so much execution time is spent in the highlighted parts of code that are not parts of my code, but library calls. Due to the lack of stack trace provided by VTune Amplifier I cannot further optimize the code, because I cannot identify what leads to calling those TBB scheduling functions.
On both Xeon and Xeon Phi I can observe the common behaviour, but on Xeon Phi it is escalates significantly. In case of Xeon Phi does the result of analysis mean that the host CPU is limiting the performance? I observed that the observed difference grows with the number of workgroups. According to OpenCL Optimization Guide it might show issues with workgroup scheduling, but I let OpenCL kernel compiler to pick the right local workgroup size, while I call clEnqueueNDRangeKernel. Actually, when I specify the local workgroup size explicitly, I get 10-15% performance boost, but still the above mentioned issue is by far the main limiting factor to the performance.
I do not observe anything that would resemble a similar issue on Windows platform and Core i7 processors. In general, due the above issue, the application performs better on a Intel(R) Xeon(R) CPU E5-2665 than on Xeon Phi 5110P. Hyperthreading is turned off on Xeon. Analysis shows that all cores are used on both platforms.
I would like to ask for guidance with this issue. I attach the screenshots from the profiler. VTune analysis: Basic Hotspots for Xeon, General Exploration for Xeon Phi.
Thanks for this report. Would it be possible to send a short piece of code illustrating the problem? I suspect that part of the issue may be in setting up the algorithm in a way that can be efficiently scheduled on the larger number of cores. From our end, I'll check if there are any examples showcasing optimizations for Xeon Phi.