4Hi there & every expert:
Sorry for bother! I have a question about the specification named <The Compute Architecture of Intel® Processor Graphics Gen7.5>, because the iris pro 5200 is used, so I think that reading this paper is right. But in page 6, there is a sentence like this "generate SIMD code to map multiple kernel instances to be executed simultaneously within a given hardware thread" and the meaning of the term "kernel instances" is that "We use the generic term kernel instance as equivalent to OpenCL work-item, or DirectX ComputeShader thread.".
Now the question is:
1. Why are the work-items executed simultaneously within a given hardware thread, what's the benefit?
2. When these items are executed in one thread, what's the subsequence? How the hardware control it? Which rule it depends on?
3. If the questions above are related to the secret of the design of some specific GPUs, I would say sorry for it.
4. Are there any analysis methods in the web about kernel code, now I have installed code debugger for windows and develope in vs 2013. Much data has been acquired by clicking the analyze button, but I don't know how to utilize these data, please give me a hint.
Thank you very much for reading it and it's better to answer these questions. I appreciate it!!!
These are important topics, and I hope we are able to extend documentation on them in the future. As a short answer (from section 5.2 in the architecture guide)
One of the things Code Builder kernel analysis provides is "what if" analysis to automatically run experiments on size configurations to check what works best with the compiler and hardware. If you select "auto" for your local size the kernel analyzer will try several combinations to make it easier to find the best execution time, as well as to optimize occupancy and latency. A good place to start is with the "execution analysis" screen which appears after running Code Builder->Kernel Development->Analyze. This shows a "Best Configuration" at the top, with global and local sizes for running the kernel optimally which you can add to your code. Other options are shown below that -- there may be other size combinations which work as well (or close) which may fit better with your plans for other kernels, or with the optimal size found across a range of devices.
Thanks for your reply, and your answer is detailed. According to your explanation,I have some other questions.
1.what is the "lane"? The lanes are executed concurrently?
2.Assuming that the kernel is compiled to SIMD-32, are the assembly code generated for it executed once or 32 times? I don't know more about the "send" instruction, are there any place to find information?
3.About EU state. At least one thread is dispatched, the EU is not idle, right? We can't say one EU is stalled until all threads are stalled, such as read data from memory, right?
1 and 2: SIMD lanes are (at least conceptually) executed concurrently. If each kernel only works on 1 array element then for a SIMD-32 compile work should be arranged so up to 32 kernel instances appear to work independently -- though this is implemented by the compiler arranging SIMD instructions. If the kernels use vector types then more "lanes" will be used by each kernel. This reference on Gen assembly might help: https://software.intel.com/en-us/articles/introduction-to-gen-assembly
3. There are 2 FPUs per EU but 7 threads. In theory 2 threads could be enough to feed the 2 EUs. 7 is a compromise between HW resources to hold thread state and keeping the FPUs busy. If a thread is waiting for memory we hope we can switch to another. You're right -- if a tool such as Code Builder, Graphics Performance Analyzers, or VTune Amplifier reports EUs as stalled this mean no threads were scheduled to them or the EUs were idle for at least part of the sampling period.