Questions about kernel instances & the analysis methods for kernel code

Xinyan_S_ · ‎12-25-2016

4Hi there & every expert:
Sorry for bother! I have a question about the specification named <The Compute Architecture of Intel® Processor Graphics Gen7.5>, because the iris pro 5200 is used, so I think that reading this paper is right. But in page 6, there is a sentence like this "generate SIMD code to map multiple kernel instances to be executed simultaneously within a given hardware thread" and the meaning of the term "kernel instances" is that "We use the generic term kernel instance as equivalent to OpenCL work-item, or DirectX ComputeShader thread.".
Now the question is:
1. Why are the work-items executed simultaneously within a given hardware thread, what's the benefit?
2. When these items are executed in one thread, what's the subsequence? How the hardware control it? Which rule it depends on?
3. If the questions above are related to the secret of the design of some specific GPUs, I would say sorry for it.

4. Are there any analysis methods in the web about kernel code, now I have installed code debugger for windows and develope in vs 2013. Much data has been acquired by clicking the analyze button, but I don't know how to utilize these data, please give me a hint.

Thank you very much for reading it and it's better to answer these questions. I appreciate it!!!

Jeffrey_M_Intel1 · ‎12-27-2016

These are important topics, and I hope we are able to extend documentation on them in the future. As a short answer (from section 5.2 in the architecture guide)

Each EU has 2 FPUs.
Each FPU can SIMD execute 4 32 bit (float or int) operations per cycle, but instructions can be up to 32 wide. (The wider instructions take more cycles to complete).
If there are enough registers, we will schedule as if there are 32 separate "SIMD lanes". There is a scalarization pass in the compiler to break up vector operations before vectorizing again. (Important exception: vector loads and stores are not scalarized.) This means scalar and vector code can often result in very similar Gen instructions. This approach is used since it can often give the compiler more opportunities to optimize. If 32 lanes means too many registers are needed, the compiler will drop to 16 lanes, or 8 if the need for registers is high. Usually the compiler makes this decision for you based on the register needs of the code. You can influence this decision with the reqd_work_group_size attribute.

One of the things Code Builder kernel analysis provides is "what if" analysis to automatically run experiments on size configurations to check what works best with the compiler and hardware. If you select "auto" for your local size the kernel analyzer will try several combinations to make it easier to find the best execution time, as well as to optimize occupancy and latency. A good place to start is with the "execution analysis" screen which appears after running Code Builder->Kernel Development->Analyze. This shows a "Best Configuration" at the top, with global and local sizes for running the kernel optimally which you can add to your code. Other options are shown below that -- there may be other size combinations which work as well (or close) which may fit better with your plans for other kernels, or with the optimal size found across a range of devices.

Xinyan_S_ · ‎12-27-2016

hello Jeffrey:

Thanks for your reply, and your answer is detailed. According to your explanation,I have some other questions.

1.what is the "lane"? The lanes are executed concurrently?

2.Assuming that the kernel is compiled to SIMD-32, are the assembly code generated for it executed once or 32 times? I don't know more about the "send" instruction, are there any place to find information?

3.About EU state. At least one thread is dispatched, the EU is not idle, right? We can't say one EU is stalled until all threads are stalled, such as read data from memory, right?

Thanks again!!!

Jeffrey_M_Intel1 · ‎12-29-2016

1 and 2: SIMD lanes are (at least conceptually) executed concurrently. If each kernel only works on 1 array element then for a SIMD-32 compile work should be arranged so up to 32 kernel instances appear to work independently -- though this is implemented by the compiler arranging SIMD instructions. If the kernels use vector types then more "lanes" will be used by each kernel. This reference on Gen assembly might help: https://software.intel.com/en-us/articles/introduction-to-gen-assembly

3. There are 2 FPUs per EU but 7 threads. In theory 2 threads could be enough to feed the 2 EUs. 7 is a compromise between HW resources to hold thread state and keeping the FPUs busy. If a thread is waiting for memory we hope we can switch to another. You're right -- if a tool such as Code Builder, Graphics Performance Analyzers, or VTune Amplifier reports EUs as stalled this mean no threads were scheduled to them or the EUs were idle for at least part of the sampling period.