While trying to develop a standalone for a prior question, I noticed that offline compilation seems to behave differently for CPU and GPU. Per the OpenCL spec, my understanding is that I should be able to reuse compiled kernels (either through ioc32/64 or clCreateProgramFromSource/clBuild). When using a GPU device I can load said precompiled kernel through clCreateProgramFromBinary and be ready to use it. CPU, however, requires me to call clBuild yet again, which from a performance standpoint defeats the purpose of precompiling my kernels.
I've attached a MSVC 2012 project to reproduce what I'm seeing. Under the release directory are some precompiled kernels that I generated using ioc32. The executable explains how to use it upon running it with no commands. The only thing it doesn't mention is it checks the extension to determine if the input file is a binary file or not. If the file doesn't end in .bin, it assumes its a text .cl file.
offlineCompileBug.exe CPU Template.cl 0 - Fails (Expected)
offlineCompileBug.exe CPU Template.cl 1 - Succeeds (Expected)
offlineCompileBug.exe GPU Template.gpu.bin 0 - Succeeds (Expected)
offlineCompileBug.exe GPU Template.gpu.bin 1 - Succeeds (Not expected, why does compiling twice work?)
offlineCompileBug.exe CPU Template.cpu.bin 0 - Fails (Unexpected, and I believe is a problem)
offlineCompileBug.exe CPU Template.cpu.bin 1 - Succeeds (Not expected, why does compiling twice work?)
Funny, that I just created this post:
In general, clCreateProgramFromBinary should probably be followed by clBuildProgram, since the binary could be SPIR, in which case it is not fully built. In our GPU case, when you fully prebuild the binary (generate .ir), clBuildProgram does not do much - it is basically a no-op, as evidenced by looking at the build log - it will be empty. In the case of the CPU binary, some linking is still involved at the clBuildProgram step, but compilation step is saved. I will ask the CPU device team whether that is necessary.
Doubly funny, just read it yesterday and was wishing that I had that article two weeks ago. Nicely written and much needed since good SPIR examples are a bit sparse.
If it helps, the goal is to cache my kernel compilations so I only need to compile the first time my software runs. clBuildProgram is accounting for roughly half of my execution time, so it would be great if the CPU team knows how to avoid it.
I checked with our standards and driver folks: the right thing to do is to always follow clCreateProgramFromBinary with clBuildProgram. They think the current behavior on the GPU is actually a bug.
Thanks, I'll adjust my code accordingly.
Any word though on what to do with the long CPU kernel build times? GPU, even in the presence of said bug, generates the right results quickly. Why can't I do that with CPU, or is there something I'm missing?