OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

SIMD threads on GPU

Mohamed_Amine_BERGAC
755 Views

Hello,

I'm wondering how threads are dispatched over SIMD units of the intel Ivy Bridge HD 4000 GPU, I tested many configurations and I'm blocked by some strange behaviours:

I use a simple kernel that compute N times the same "MAD" operation, I launch this kernel with global_size=local_size=1 , for the best to my knowledge I assume that the GPU will launch one thread on one EU ? is it correct ? the strange behaviours that I'm encountering : when I use the computation in my kernel as a scalar (float) I have about 2GFlops of performance, But when I try to use "MAD" as a vector (float2,float4, float8 or foat16) the performance falls dramatically to 0.1 Gflops , am i missing something ? can any one help me to understand ?

Thanks,

Mohamed

0 Kudos
8 Replies
Ben_A_Intel
Employee
755 Views

Hi Mohamed,

I gave a talk at SIGGRAPH that talked about how OpenCL workloads are executed on Intel Iris Graphics.  The same concepts also apply to Intel HD 4000 GPUs.  Might be worth a look:

http://software.intel.com/en-us/siggraph2013, specifically

http://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

To answer your specific question, you should get very similar performance with scalar data types as you do for vectors.  You might see slightly better performance with vectors than scalars due to fewer back-to-back instruction dependencies, but the difference will be minimal.  Can you describe your kernel in more detail?  I suspect there's a bug in your code.

  -- Ben

0 Kudos
Mohamed_Amine_BERGAC
755 Views

Hi Ben,

Thank you for your reply and for your slides.

My kernel is:

__kernel void MaxGFLOPS()
{
__private float x,y,z,w,a[512];
//__private float2 x,y,z,w,a[512];
//__private float4 x,y,z,w,a[512];
//__private float8 x,y,z,w,a[512];
//__private float16 x,y,z,w,a[512];

x=0.0f;
y=3.0f;
z=2.0f;
w=1.0f;

ushort i=0,j=0;
ushort loop=40000;

for(j=0;j<loop;j++)
{
for(i=0;i<512;i++)
{
a=mad(w,y,z);
}
}

for(i=0;i<512;i++)
{
x+=a;
}

}

The aim of this code is to perform mad computations as much as we can in private memory (EU registers), 

Thank you for your help.

THX,

Mohamed

0 Kudos
Ben_A_Intel
Employee
755 Views

To get the maximum number of mad computations you want something like this:

[cpp]__kernel void foo( __global float4* buffer )

{

    float4 x = buffer[ get_local_id(0) ];

    float4 y = get_local_id(0);

    float4 z = x;

    // Repeat n times:

    z = mad( z, x, y );

    ...

    buffer[ get_local_id(0) ] = z;

}[/cpp]

The important changes are:

- Unroll your loop.  The compiler should do this for you, but only up to a point, and you don't want to measure loop operations.

- Don't use constants as your input to mad() or the compiler will constant fold them.

- Make sure to write your result to memory at some point, or the compiler may dead code eliminate your entire kernel.  :-)

    -- Ben

0 Kudos
Mohamed_Amine_BERGAC
755 Views

Thank you Ben for your reponse,

I'm curious about what happens when I sets the Global_size=1 and local_size=1, I assume the kernel will be launched on only one EU, if this assumption was true, the compiler will allow the kernel to use the full width of the SIMD unit (for example : mad operation on float4 or float8) or will he restrict the kernel to use just one lane of the SIMD unit ?

I'm very confused about this point, and all explanations are welcomed

Thanks in advance,

Mohamed

0 Kudos
Ben_A_Intel
Employee
755 Views

Local Size = 1 will restrict you to one lane of the SIMD unit.  Note also that Global Size = 1 means you'll only launch one thread on the EU array, so almost all of the machine will be unutilized and you won't see any benefits from co-issue.

0 Kudos
Mohamed_Amine_BERGAC
755 Views

Aha! this explains the strange behaviour that I got when I compared vector operations and scalar ones by setting Local_size and global_size to 1, in this case the vector operations will be "scalarized", and I got a poor performance compared with the scalar version.

This doesn't help me to do what I'm looking for, I'm searching to use just one thread per EU and hard coded SIMD operations to use the full width of the SIMD unit, I assume this can't be possible, am I right?

Thanks,

Mohamed

0 Kudos
Ben_A_Intel
Employee
755 Views

"One thread per EU" may be difficult, and it will prevent co-issue, so you really don't want to do that.  "Using the full width of the SIMD unit" is doable though.  Here's how to do it:

The number of occupied SIMD lanes is a function of the compiled SIMD width and your local work size.  To determine the compiled SIMD size, use clGetKernelWorkGroupInfo( CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE ). This will probably return 8 for SIMD8 or 16 for SIMD16, but it might return 32 for SIMD32 in some rare cases.  Then, so long as your local work size is a multiple of this value, you'll be using the full width of the SIMD unit.

0 Kudos
Mohamed_Amine_BERGAC
755 Views

So many thanks Ben for your explanations and your time :)

I'm a little bit disapointed about what I'm searching to do, I have a big amount of data to compute and I was hoping that I can benefit from all the space provided by the registers in one EU (one thread ensures to be in one EU) to store and compute my data.

Thanks,

Mohamed

0 Kudos
Reply