OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1687 Discussions

GTexels/s of the HD 4000



I checkout the Intel HD 4000 integrated on IvyBridge GT2 I5-3320M with OpenCL.

The test kernel samples 262144x1250 times an unified 1024x1250 CL_R, CL_FLOAT image2D with CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR.
The sampling coordinates follow a curve. Calculating the curve consists of less and just add and mul operations. The inner loop is unrolled and has 4 read_imagef().
The kernel writes 262144 floats as result in a global buffer.

The kernel is called in a test loop several times. Every kernel call works on different input data (image2D content).

GPU-Z 0.6.8 mentions a texture Fillrate of 2.6 GTexels/s for the HD 4000.
The GPU frequency is after the 10th kernel call up to 1200 MHz, as per GPU-Z.

(a) If the image2D becomes wider than 1536 pixels, then the GTexels/s drop significant below the 2.6 GTexels/s.
(b) If the image2D is as given then it results in more than 2.6 GTexels/s.

I lack an official specification of the HD 4000 by INTEL. I just found

May I ask to explain, how to calculate the maximum GTexels/s for the image2D format specified above ?

From my understanding, (b) indicates a higher  GTexels/s as given by GPU-Z. But under which conditions ?
Any hint, how I could avoid the drop of   GTexels/s ass een in (a) ?


0 Kudos
4 Replies

Not sure what dimensions you are using for the kernels and if you are tiling it or not.

But I am pretty sure that depending on the images size and some local work item you will see various results due to the cache access to the texture.

I have done parameterized some of my code before to be able to try different local work item sizes and let the code find the best one for a given kernel. You might want to try something like that too.



I agree with Laurent. You can use the KernelBuilder's analyze feature to "guess" the best global and local sizes, instead of modifying the host code. Hope that helps.



Hi Laurent and Raghu

Thanks for your reply. Yes, I found by tests the best fit global and local size (together with loop unroling an local memory).

I upgraded last week to OpenCL 1.2 with the SDK 3.0 and driver (Graphic Driver

I found that with the (old) driver supporting OpenCL 1.1 the linear interpolation within a float image2D was only possible with normailzed coordinates. But my algorithm uses the image2D like an image1D_array. Hence, to hit the row of the array is not perfect as it has to be expressed as normalized fraction with inherent precission. Instead of the desired linear interpolation I got a bilinear interpolation, as I have to assume from further studies:

The driver supporting OpenCL 1.2 supports non-normilzed coordinates for linear interpolation. Now the algorithm can hit the row perfect (i.e. y.5f in the float2 coordinate). To my surprise I got a 50% performance improvement! Hence, for such *.5f adressing, there is only a linear interpolation started by the driver / GPU (?) , insted of a blinear one.

Utilizing image1D_array available in OpenCL 1.2. brought another surprise: It showes the same (slower) performance as a bilinear interpolationin within image2D. Hence I conclude, that I may have not setup the array or the coordinates correct or the image1D_array is internally mapped to a 2D normalized image.

I thought, a image1D_array would load less data into the texture$, but my thought seems to be wrong. I assume now, that a texture fetch loads inboth setups the same amount, but then 1D, instead of a 2D "area". As can not walk linear through the cache, I have to accept wasted bandwidth here.

So I tried CL_HALF_FLOAT, CL_R expecting a gain in the texture cache due to smaller memory. But the performance improvement was neglectable.



PS: With the OpenCL 1.1 driver, I could access the result by the HD4000  in host memory already without clEnqueueReadBuffer(). The OpenCL 1.2 driver now seems to create device memory, even the HD4000 has not such, and the result has to be read into the host memory in the end, following the general pattern.


Just as update: The HD 4000 documentation is public at