GTexels/s of the HD 4000

Stephan1 · ‎03-25-2013

Hi

I checkout the Intel HD 4000 integrated on IvyBridge GT2 I5-3320M with OpenCL.

The test kernel samples 262144x1250 times an unified 1024x1250 CL_R, CL_FLOAT image2D with CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR.
The sampling coordinates follow a curve. Calculating the curve consists of less and just add and mul operations. The inner loop is unrolled and has 4 read_imagef().
The kernel writes 262144 floats as result in a global buffer.

The kernel is called in a test loop several times. Every kernel call works on different input data (image2D content).

GPU-Z 0.6.8 mentions a texture Fillrate of 2.6 GTexels/s for the HD 4000.
The GPU frequency is after the 10th kernel call up to 1200 MHz, as per GPU-Z.

(a) If the image2D becomes wider than 1536 pixels, then the GTexels/s drop significant below the 2.6 GTexels/s.
(b) If the image2D is as given then it results in more than 2.6 GTexels/s.

I lack an official specification of the HD 4000 by INTEL. I just found http://www.realworldtech.com/ivy-bridge-gpu/1/

Hence:
May I ask to explain, how to calculate the maximum GTexels/s for the image2D format specified above ?

From my understanding, (b) indicates a higher GTexels/s as given by GPU-Z. But under which conditions ?
Any hint, how I could avoid the drop of GTexels/s ass een in (a) ?

Stephan

LLess · ‎04-01-2013

Not sure what dimensions you are using for the kernels and if you are tiling it or not.

But I am pretty sure that depending on the images size and some local work item you will see various results due to the cache access to the texture.

I have done parameterized some of my code before to be able to try different local work item sizes and let the code find the best one for a given kernel. You might want to try something like that too.

Laurent

Raghupathi_M_Intel · ‎04-03-2013

I agree with Laurent. You can use the KernelBuilder's analyze feature to "guess" the best global and local sizes, instead of modifying the host code. Hope that helps.

Raghu

Stephan1 · ‎04-08-2013

Hi Laurent and Raghu

Thanks for your reply. Yes, I found by tests the best fit global and local size (together with loop unroling an local memory).

I upgraded last week to OpenCL 1.2 with the SDK 3.0 and driver 15.31.3.64.3071 (Graphic Driver 9.18.10.3071)

I found that with the (old) driver supporting OpenCL 1.1 the linear interpolation within a float image2D was only possible with normailzed coordinates. But my algorithm uses the image2D like an image1D_array. Hence, to hit the row of the array is not perfect as it has to be expressed as normalized fraction with inherent precission. Instead of the desired linear interpolation I got a bilinear interpolation, as I have to assume from further studies:

The driver supporting OpenCL 1.2 supports non-normilzed coordinates for linear interpolation. Now the algorithm can hit the row perfect (i.e. y.5f in the float2 coordinate). To my surprise I got a 50% performance improvement! Hence, for such *.5f adressing, there is only a linear interpolation started by the driver / GPU (?) , insted of a blinear one.

Utilizing image1D_array available in OpenCL 1.2. brought another surprise: It showes the same (slower) performance as a bilinear interpolationin within image2D. Hence I conclude, that I may have not setup the array or the coordinates correct or the image1D_array is internally mapped to a 2D normalized image.

I thought, a image1D_array would load less data into the texture$, but my thought seems to be wrong. I assume now, that a texture fetch loads inboth setups the same amount, but then 1D, instead of a 2D "area". As can not walk linear through the cache, I have to accept wasted bandwidth here.

So I tried CL_HALF_FLOAT, CL_R expecting a gain in the texture cache due to smaller memory. But the performance improvement was neglectable.

Stephan

PS: With the OpenCL 1.1 driver, I could access the result by the HD4000 in host memory already without clEnqueueReadBuffer(). The OpenCL 1.2 driver now seems to create device memory, even the HD4000 has not such, and the result has to be read into the host memory in the end, following the general pattern.

Stephan1 · ‎04-09-2013

Just as update: The HD 4000 documentation is public at intellinuxgraphics.org