Hello, I'm having a problem with building one of my OpenCL kernels. I'm trying to build on the Intel HD graphics 4000 graphics card on a windows 64 machine, driver version 10.18.10.5069 (the latest I can find). I'm building the solution with visual studio 2019, and am using the cl2.hpp wrapper from the Khronos page. As I call cl::Program::build, I notice that the (host) memory usage of the program greatly increases, even to over 3 gigabytes. After several seconds of this, the build fails. The build log ends with the following:
fcl build 1 succeeded. fcl build 2 succeeded. Error: internal error.
This code was building just fine on a different windows machine using a different Intel card, but after moving to this machine it does this. The kernel code compiles without any problems, it just doesn't build. After deleting and changing certain aspects of the code, I can get it to build, but it still uses way too much memory. Could you guys tell me what I'm doing wrong or if there's some sort of bug?
I will attach code that has around 15 lines repeated about a hundred times, which is able to reproduce the issue (although my original kernel doesn't repeat 15 lines a hundred times), and the c++ code I'm using to load and build the kernel. Also worth noting is that I'm not linking against the Intel SDK files, but I downloaded and compiled my own from the Khronos repository.
Thanks for sending the info and a representative reproducer.
- 10.18.10.5069 and that HW model number uses driver/implementation branches that have been supersceeded by the NEO implementation. Client hardware models from 2012 have their support limited support for these branches. If Linux is a possibility, it may be worth trying the Beignet implementation. It's possible that defects for this branch would not be able to be triaged.
- Intel's SDK has a utility called ioc64 that can give kernel compile pass feedback through the command line. It may be useful for you.
- Doubtful the ICD Loader library from Khronos creates a difference here.
Took a look quick look at the code, there isn't anything that immediately jumps out... I'll try it on a skylake based system. I'm not immediately aware of length restrictions on kernels but FWIW this is longer than most codes people ask for review.
What system did it work fine on? Can you describe that configuration?
I tried this reproducer on i5-6770HQ graphics. It uses the NEO implementation branch on Windows 10 MSVS 2017. I didn't observe any any compilation issues with your reproducer.
Since I don't have access to the legacy system, I'll pass the kernel on to the development team to see if they have any feedback.. fortunately, the error strings you provided may serve as useful hints. Unfortunately, it may prove difficult to root cause or triage issues with the legacy configuration. We'll see.
On previous speculation:
- Out of curiosity, did you use a recent ICD Loader library? Still don't think this should have an affect.
- Also: What system did it work fine on? Can you describe that configuration?
It may be useful to pass in build options to ensure apples to apples... see cl-std: https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/clBuildProgram.html
Intel HD graphics 4000 is a bit old and had some limitations, specifically:
- It doesn't support native 64-bit arithmetic and needs to emulate 64-bit (e.g. "long") operations using 32-bit arithmetic. As I recall, emulating 64-bit division was especially non-trivial and results in a fair amount of code.
- It doesn't support "unstructured control flow". This makes it difficult to compile some programs with complicated control flow. In some cases, compound boolean expressions with short-circuited control flow would cause some control flow blocks to be replicated.
I suspect that these two limitations taken together cause your program to grow quite large, and the compiler eventually runs out of memory, generating the "internal error".
Some suggestions to work around these limitations:
- Can you switch any compound boolean expressions to use bitwise operations ("&") instead of boolean operations ("&&")? The bitwise operations do not short-circuit and hence simplify control flow. I don't think this will be enough on its own, but it's a very easy first thing to try.
- Can you restructure your algorithm to use 32-bit arithmetic instead of 64-bit arithmetic, or at least to avoid 64-bit division? Note, division by a power-of-two is OK, since the compiler will turn these into bit shifts.
Hope this helps!
I hope BenA's comment can get you through compilation step. If possible, can you share any results and your driver acquisition to the thread?
Thanks for all the replies!
MichaelC: sorry, before I said that the driver was "the latest I can find" but I now realize it's not, and I'm not sure where the driver is from.
The previous system it worked on was again windows 10 64 bit, with Intel HD graphics 520, driver version 220.127.116.1173.
Also, yes, the ICD loader was the most recent one as of a few months ago at most.
I will try to compile using your suggestions, and report what I find.
Yay! After trying BenA's two simple fixes my original kernel and the reproducer seem to build and run just fine, using a reasonable amount of memory.
Changing boolean to bitwise operators seemed to help it the most, but using both suggestions fixed it completely, thanks!