I have an Intel HD 4600 gpu and noticed some performance discrepancies when running a microbenchmark with a significant number of loops for built-in math functions (arithmetic operators are fine). The results are compared against results from running the microbenchmark on the cpu, and running the standard C math functions in a loop (vectorisation and optimizations are avoided). So my question is this; is there a big loop or math function overhead when executing a kernel on an Intel HD GPU?
Could you please clarify what you mean by performance discrepancies? What did you expect and why? What did you actually get?
Built-in math functions could be fairly expensive. What you can try to do is to use their native equivalents, e.g. instead of sqrt use native_sqrt. Typically, this involves relaxed precision (not IEEE 754 compliant). You can also try -cl-fast-relaxed-math flag to improve performance.
You also need to be careful how you distribute work to the GPU. Big loops with expensive math functions inside could make software threads run for a long time, making GPU utilization less than ideal: instead of fairly fast ramp-down at the end of the kernel run, which you want, kernel will have a fairly long tail. To achieve good GPU utilization, you want relatively fast ramp up, fast ramp down and a fairly long run at full capacity.
Thanks for replying. In this case, work distribution isn't too much of a concern as there are certain criterias that need to be fulfilled for the microbenchmark, where a loop is executed within a single thread on the opencl device and compared against a sequential loop executing a standard math function. For arithmetic operators, the results are as expected, though when using certain math functions that don't have native equivalents, there's a definitely slow down on the intel GPU, as an example for an input size of 1000000 for float data types:
While big loops are understandably expensive, I doubt the overhead could cause such a major discrepancy. Hence why I wanted to know if certain built-in math functions were expensive if executed on the intel hd gpu. If anything, I am simply curious on that matter alone as I was simply running some microbench tests that were already predefined (hence I don't have control over optimizing the test)
Couple of things:
1. Could you please list the graphics driver version where you are observing this behavior?
2. Could you try to replace fmod(x,y) with x - y * trunk(x/y) and see what performance you get? (it is close enough, but much faster).
3. Could you please try to measure the performance with -cl-fast-relaxed-math?
1. The driver version is 10.18.15.4256
2. I've already done so and there's an obvious increase in performance (40.00 ms), though that means that the sequential benchmark would need to reflect the same implementation, which sees an increase in performance too. The scaling ends up being very similar at that point.
3. If I use the build option for fmod, the performance is similar to x - y * trunk(x/y). However, if the formula approach is used in conjunction with the build option, no performance gains are observed. Unfortunately, the tool is not meant to have any math optimizations enabled.
4. The kernel is straight forward, take a value (which represents the number of loops), and repeat a math operation on a private variable, after which the variable is written to the output. All of this is done on a single thread. Global memory access is already limited (input value from global memory is assigned to a private variable before initiating for loop).
Though right now, I have reason to believe that the issue could be driver-related as I don't seem to have any problems with the CPU or with an Nvidia GPU.