Solved: How to use SIMD FPU on HD Graphics

naoki_o_ · ‎11-02-2015

I write the OpenCL kernel code for HD Graphics.

I know each EU has 4-way SIMD FPU.

But, I don't know how to use SIMD operation.

Does kernel compiled for HD Graphics is vectorized automatically?

How to know whether the kernel code is vectorized.

When my kernel compiled with intel kernel builder to CPU, it is displayed "Kernel <~~> was successfully vectorized".

I think this code is compiled with vectorize.

But, compiled to GPU, it isn't display so.

Is this kernel code vectorized automatically if compiled for HD Graphics?

If kernel code isn't vectorized automatically, how to vectorize the code?

Robert_I_Intel · ‎11-03-2015

Hi Naoki,

Yes, on the GPU the kernel code is automatically vectorized. Code will be compiled SIMD32, SIMD16, or SIMD8, depending on how long your kernel is and how much private memory per work item you are using (typically you have 128 bytes of private memory per work item in SIMD32 case, 256 bytes of private memory per work item in SIMD16 case, or 512 bytes of private memory per work item in SIMD8 case). Most regular kernels are compiled SIMD16.

Additionally, you can perform more work per item by using vector types like uchar4, uchar16, or float4. You can watch my videos on Optimizing Simple OpenCL kernels here https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels. You can also try our tools like Intel(R) VTune Amplifier XE https://software.intel.com/en-us/intel-vtune-amplifier-xe/try-buy - they will show you whether your kernel was built SIMD32, SIMD16, or SIMD8.

View solution in original post

Robert_I_Intel · ‎11-03-2015

Hi Naoki,

Yes, on the GPU the kernel code is automatically vectorized. Code will be compiled SIMD32, SIMD16, or SIMD8, depending on how long your kernel is and how much private memory per work item you are using (typically you have 128 bytes of private memory per work item in SIMD32 case, 256 bytes of private memory per work item in SIMD16 case, or 512 bytes of private memory per work item in SIMD8 case). Most regular kernels are compiled SIMD16.

Additionally, you can perform more work per item by using vector types like uchar4, uchar16, or float4. You can watch my videos on Optimizing Simple OpenCL kernels here https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels. You can also try our tools like Intel(R) VTune Amplifier XE https://software.intel.com/en-us/intel-vtune-amplifier-xe/try-buy - they will show you whether your kernel was built SIMD32, SIMD16, or SIMD8.

naoki_o_ · ‎11-11-2015

Hi Robert.

Thanks for your reply.

HD Graphics 4600's EU has 2 4-way SIMD floating point unit.

I think each EU can operate 8 floating point FMA simultaneously.

Does this kernel execute 8 operation simultaneously?

I want to compare theoretical computation time and actual time.

So, I want to know whether all units is used.

Robert_I_Intel · ‎11-12-2015

Hi Naoki,

Generally: (MUL + ADD) x Physical SIMD x Num FPUs x Num EUs x Clock Speed

So, for HD Graphics 4600 you have 2 (Mul + Add) x 4 x 2 * 20 * 1.3 GHz = 416 GFlops

If you can achieve 70 to 80% of that number on a real kernel, you did very well :)

naoki_o_ · ‎11-13-2015

Hi Robert.

I tried to use float4 instead of float, vload4 and vstore4.

But, I can't get faster.

On the contrary, get slower...

Is this because my kernel code is automatically vectorized?

I use float. So, does each EU execute two float4 operation?

Robert_I_Intel · ‎11-13-2015

Hi Naoki,

There could be a number of reasons why your kernel got slower. The three main issues in general are

1. You are thread launch limited (highly unlikely, when switching from float to float4, since you are reducing the number of threads launched)

2. You are compute limited (switching to float4 could have that effect if you are performing complex calculations)

3. You are bandwidth limited.

It is hard for me to tell what is going on without looking at the kernel. Could you post a sample workload with your float and float4 kernels or maybe send me a private message with an attachment, so I can take a look at what's going on?

BTW, you don't need to use vload4 and vstore4 unless you have unaligned loads and stores. Otherwise you can use regular array reads and writes. Also, you can try to download Intel(R) Vtune Amplifier 2016 https://software.intel.com/en-us/intel-vtune-amplifier-xe/ - you can get a free 30 day evaluation copy and try to run your application there. Check this article: https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-performance-analysis-on-intel-hd-graphics?language=en on how to use VTune for OpenCL analysis.

naoki_o_ · ‎11-26-2015

Hi Robert!

I sent a massage for you.

please check it.

When I'd checked my code by Vtune Amplifer, my code is SIMD-16.

When similar and more simple code, it is SIMD-32.

What's the difference?

How to use SIMD-32?

Robert_I_Intel · ‎12-01-2015

Hi Naoki,

Typically, a simpler kernels (less than 150 assembly instructions) are compiled SIMD32 (meaning 32 work items are fit onto one hardware thread). More complex kernels are compiled SIMD16, and very complicated kernels are compiled SIMD8. Currently, compiler decides this automatically.

naoki_o_ · ‎12-02-2015

Hi Robert.

I understand why my kernel code is SIMD16.

I have to write simple code to reduce assembly instructions in order to be compiled SIMD32.