SIMD threads on GPU

Mohamed_Amine_BERGAC · ‎08-20-2013

Hello,

I'm wondering how threads are dispatched over SIMD units of the intel Ivy Bridge HD 4000 GPU, I tested many configurations and I'm blocked by some strange behaviours:

I use a simple kernel that compute N times the same "MAD" operation, I launch this kernel with global_size=local_size=1 , for the best to my knowledge I assume that the GPU will launch one thread on one EU ? is it correct ? the strange behaviours that I'm encountering : when I use the computation in my kernel as a scalar (float) I have about 2GFlops of performance, But when I try to use "MAD" as a vector (float2,float4, float8 or foat16) the performance falls dramatically to 0.1 Gflops , am i missing something ? can any one help me to understand ?

Thanks,

Mohamed

Ben_A_Intel · ‎08-22-2013

Hi Mohamed,

I gave a talk at SIGGRAPH that talked about how OpenCL workloads are executed on Intel Iris Graphics. The same concepts also apply to Intel HD 4000 GPUs. Might be worth a look:

http://software.intel.com/en-us/siggraph2013, specifically

http://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

To answer your specific question, you should get very similar performance with scalar data types as you do for vectors. You might see slightly better performance with vectors than scalars due to fewer back-to-back instruction dependencies, but the difference will be minimal. Can you describe your kernel in more detail? I suspect there's a bug in your code.

-- Ben

Mohamed_Amine_BERGAC · ‎08-23-2013

Hi Ben,

Thank you for your reply and for your slides.

My kernel is:

__kernel void MaxGFLOPS()
{
__private float x,y,z,w,a[512];
//__private float2 x,y,z,w,a[512];
//__private float4 x,y,z,w,a[512];
//__private float8 x,y,z,w,a[512];
//__private float16 x,y,z,w,a[512];

x=0.0f;
y=3.0f;
z=2.0f;
w=1.0f;

ushort i=0,j=0;
ushort loop=40000;

for(j=0;j<loop;j++)
{
for(i=0;i<512;i++)
{
a=mad(w,y,z);
}
}

for(i=0;i<512;i++)
{
x+=a;
}

}

The aim of this code is to perform mad computations as much as we can in private memory (EU registers),

Thank you for your help.

THX,

Mohamed

Ben_A_Intel · ‎08-23-2013

To get the maximum number of mad computations you want something like this:

[cpp]__kernel void foo( __global float4* buffer )

{

float4 x = buffer[ get_local_id(0) ];

float4 y = get_local_id(0);

float4 z = x;

// Repeat n times:

z = mad( z, x, y );

...

buffer[ get_local_id(0) ] = z;

}[/cpp]

The important changes are:

- Unroll your loop. The compiler should do this for you, but only up to a point, and you don't want to measure loop operations.

- Don't use constants as your input to mad() or the compiler will constant fold them.

- Make sure to write your result to memory at some point, or the compiler may dead code eliminate your entire kernel. :-)

-- Ben

Mohamed_Amine_BERGAC · ‎08-27-2013

Thank you Ben for your reponse,

I'm curious about what happens when I sets the Global_size=1 and local_size=1, I assume the kernel will be launched on only one EU, if this assumption was true, the compiler will allow the kernel to use the full width of the SIMD unit (for example : mad operation on float4 or float8) or will he restrict the kernel to use just one lane of the SIMD unit ?

I'm very confused about this point, and all explanations are welcomed

Thanks in advance,

Mohamed

Ben_A_Intel · ‎08-27-2013

Local Size = 1 will restrict you to one lane of the SIMD unit. Note also that Global Size = 1 means you'll only launch one thread on the EU array, so almost all of the machine will be unutilized and you won't see any benefits from co-issue.

Mohamed_Amine_BERGAC · ‎08-28-2013

Aha! this explains the strange behaviour that I got when I compared vector operations and scalar ones by setting Local_size and global_size to 1, in this case the vector operations will be "scalarized", and I got a poor performance compared with the scalar version.

This doesn't help me to do what I'm looking for, I'm searching to use just one thread per EU and hard coded SIMD operations to use the full width of the SIMD unit, I assume this can't be possible, am I right?

Thanks,

Mohamed

Ben_A_Intel · ‎09-03-2013

"One thread per EU" may be difficult, and it will prevent co-issue, so you really don't want to do that. "Using the full width of the SIMD unit" is doable though. Here's how to do it:

The number of occupied SIMD lanes is a function of the compiled SIMD width and your local work size. To determine the compiled SIMD size, use clGetKernelWorkGroupInfo( CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE ). This will probably return 8 for SIMD8 or 16 for SIMD16, but it might return 32 for SIMD32 in some rare cases. Then, so long as your local work size is a multiple of this value, you'll be using the full width of the SIMD unit.

Mohamed_Amine_BERGAC · ‎09-04-2013

So many thanks Ben for your explanations and your time :)

I'm a little bit disapointed about what I'm searching to do, I have a big amount of data to compute and I was hoping that I can benefit from all the space provided by the registers in one EU (one thread ensures to be in one EU) to store and compute my data.

Thanks,

Mohamed