OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

Intel Gen8 architecture calculating total kernel instances per execution unit

Manish_K_
Beginner
1,383 Views

I am taking the reference from the intel_gen8_arch

Few sections are causing confusion in my understanding for SIMD engine concept.

5.3.2 SIMD FPUs Within each EU, the primary computation units are a pair of SIMD floating-point units (FPUs). Although called FPUs, they support both floating-point and integer computation. These units can SIMD execute up to four 32-bit floating-point (or integer) operations, or SIMD execute up to eight 16-bit integer or 16-bit floating-point operations. Each SIMD FPU can complete simultaneous add and multiply (MAD) floating-point instructions every cycle. Thus each EU is capable of 16 32-bit floating-point operations per cycle: (add + mul) x 2 FPUs x SIMD-4.

The above lines of the documents clearly states the maximum floating point operations that can be done on each Execution Unit.

First Question: I think it is referring to per hardware thread of Execution unit than the whole execution unit. Am I right here?

In section 5.3.5 it mentions On Gen8 compute architecture, most SPMD programming models employ this style code generation and EU processor execution. Effectively, each SPMD kernel instance appears to execute serially and independently within its own SIMD lane. In actuality, each thread executes a SIMD-Width number of kernel instances concurrently. Thus for a SIMD-16 compile of a compute kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances to be executing concurrently on a single EU. Similarly, for a SIMD-32 compile of a compute kernel, 32 x 7 threads = 224 kernel instances could be executing concurrently on a single EU.

Now this section illustration seems contradicting with the section 5.3.2.

Specifically, 1) Since it says each HW thread of EU has 2, SIMD-4 units then how SIMD-32 works. How are we reaching to calculation of 224 on 7 threads. DO we combine some hardware threads?

Also, How we compile the kernel in SIMD-16 or SIMD-32 mode?

0 Kudos
1 Solution
Robert_I_Intel
Employee
1,383 Views

Manish,

OpenCL has subgroups extension, and on Intel architecture we have cl_intel_subgroups extension: see https://www.khronos.org/registry/cl/extensions/intel/cl_intel_subgroups.txt and this article for some sample code https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics

So for Intel, "warp size" is 8, 16 or 32 depending on how your kernel is compiled :) And every subgroup is executed in SIMD4 chunks at a time.

Currently, there is no way for external customers to control SIMD width of the compilation: it is a heuristic in the compiler. We explored couple of extensions to do that in the past, but nothing in production yet. Typically, small kernels (less than 150 assembly instructions) are compiled SIMD32, medium size kernels that require less than 256 bytes of private memory per work item are compiled SIMD16, and everything else is SIMD8.

View solution in original post

0 Kudos
6 Replies
Robert_I_Intel
Employee
1,383 Views

Hi Manish,

1. It is referring to the whole execution unit.

2. Under the hood our hardware is SIMD4, but what is exposed to the programmer is SIMD32, SIMD16 or SIMD8 software view of the world. So, SIMD8 operation will take 2 cycles, SIMD16 - 4 cycles, and SIMD32 - 8 cycles to execute.

0 Kudos
Manish_K_
Beginner
1,383 Views

Thanks Robert for this!

Now the next thing which I was not able to relate is warp size here. Warp is NVIDIA's terminology but what is the corresponding term here in Intel/OPENCL. and how does that correlate to the SIMD width(8/16/32). 

As I understand a warp, all work items in a warp will execute in parallel. And warp size is 32 in Nvidia. Now when you say under the hood intel has only 2 SIMD-4 units then how a warp will get executed on intel GPGPU. I hope you understand my question.

0 Kudos
Manish_K_
Beginner
1,383 Views

Also, please let me know how we compile the kernel in SIMD-8/16/32 bit mode?

0 Kudos
Robert_I_Intel
Employee
1,384 Views

Manish,

OpenCL has subgroups extension, and on Intel architecture we have cl_intel_subgroups extension: see https://www.khronos.org/registry/cl/extensions/intel/cl_intel_subgroups.txt and this article for some sample code https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics

So for Intel, "warp size" is 8, 16 or 32 depending on how your kernel is compiled :) And every subgroup is executed in SIMD4 chunks at a time.

Currently, there is no way for external customers to control SIMD width of the compilation: it is a heuristic in the compiler. We explored couple of extensions to do that in the past, but nothing in production yet. Typically, small kernels (less than 150 assembly instructions) are compiled SIMD32, medium size kernels that require less than 256 bytes of private memory per work item are compiled SIMD16, and everything else is SIMD8.

0 Kudos
Manish_K_
Beginner
1,383 Views

Thanks Robert!

Can I say the terms sub groups of intel , warp size of NVIDIA and Wavefront of AMD all are same?

So OPENCL doesn't provide any name for this functionality?

0 Kudos
Robert_I_Intel
Employee
1,383 Views

Sub groups is actually an OpenCL term: there were two proposals, and it looks like longer term Intel proposal is winning. The alternative proposal is here: https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/cl_khr_subgroups.html .

Sub groups and warp sizes are the same. Do not know anything about wavefront.

0 Kudos
Reply