I have a basic question on the number of work groups that can run in parallel. According to the definition of compute unit each compute unit can have only one work group, so number of work groups which can be run concurrently depends "only" on the number of CU present .But in the "OpenCL* Applications - Optimization Guide" it has been specified that number of work groups depends on the number of the work items in a group .
Each Execution Unit (EU) in our integrated graphics has seven hardware threads, each hardware thread is capable of running 8, 16, or 32 work items depending on whether compiler chose to build your kernel SIMD8, SIMD16 or SIMD32. There are six or eight execution units per subslice, and there will be two or more subslices per slice, and in some higher SKUs two slices. So, you can run quite a large number of workgroups in parallel, indeed depending on the size of your workgroup. If you are using local memory and barriers, it will limit the number of workgroups that you can run concurrently, since both local memory and barriers are limited resources (64KB of local memory per subslice and 16 barriers per subslice).
See https://software.intel.com/en-us/file/compute-architecture-of-intel-processor-graphics-gen8pdf and https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Proce... for more details.
I am using HD4600 graphics which has 20 EUs. In my application i have workgroup size of 64, so in SIMD32 mode only two hardware threads will be utilized per EU ?? or will multiple workgroups can execute on a single EU??
In SIMD16 mode you only have 256 bytes of private memory per work item. So if your kernel exceeds that amount, it will be compiled SIMD8, where you will have 512 bytes of private memory per work item.
I have two kernels which have the same number of loads from the host memory, but they use different number of private memories. will this have an impact on the EU stalls ??
Other than loads from the host memory, what are the other factors which causes the EU stalls??.
The amount of private memory could affect whether the kernel is compiled SIMD8 or SIMD16 (the kernel that uses more private memory is more likely to end up compiled SIMD8). So even though you have the same number of loads in both kernels, the one compiled SIMD16 could end up with twice as many loads per hardware thread as the one compiled SIMD8 and hence you will get more stalling from SIMD16 kernel.
Stalls are caused by global memory reads and writes, local memory reads and writes and image reads and writes.
This time the two kernels are compiled in SIMD8 mode and both the kernels represents the same algorithm but implemented in different way. Because of difference in implementation one kernel uses almost twice the amount of private memory when compared to other. I am observing 21% stalls in the kernel which uses less amount of private memory and 50%stalls in case of other kernel which uses more private memory.
Which OS are you using? What version of the graphics driver? What tool are you using to get the stall info? Which version of the tool?
Also, could you provide a small reproducer for this issue? You can send it in a private message if you are uncomfortable sharing it in a public forum. At the very least, could you list all the read and write commands in both kernels.
One thing that comes to mind without seeing your kernels: could it be that you are exceeding 512 bytes of private memory in your second kernel? If that is the case, your kernel will start spilling to global memory, which will result in additional memory traffic and therefore stalls.
I am using Vtune analyser for analysing the performance of GPU HD4600. One of the kernel is exceeding the 512 bytes of private memory and causing degradation.
Previously I could not able to understand why the memory reads and write traffic has increased by 10 folds, when I use the kernel which uses more private memory. Thanks for the information where the kernel starts spilling to global memory when the private memory utilization exceeds 512 bytes.
I have some interpolation kernels, some of them are being executed on GPU and some of them are being executed on the CPU(Platform : I5 processor with HD4600 graphics). I see performance degradation on the GPU side, i.e stalls percentage and average execution on the GPU increases when both GPU nad CPU cores are running in parallel.
When I serialize the these execution i.e the CPU cores will be idle when GPU is running, I get better GPU performance. So what are the usual factors that are affecting the performance here.
The SOC (System on a Chip, which includes both CPU and GPU) is designed to run within a certain power envelope, meaning that the device can run generating a certain amount of heat that the cooling system is capable of safely dissipating (see https://en.wikipedia.org/wiki/Thermal_design_power ). Let's say it is 28 Watts. When you run CPU only, or GPU only part of the SOC, let's say you generate 26 Watts and you are within your power envelope. If you start using both CPU and GPU at peak frequencies, what may happen is that you start generating 30 Watts and exceed your power envelope. At this point, the SOC power circuits will kick in and lower the frequency of the GPU or CPU or both to keep you under 28 Watts, so obviously your performance will go down.
The way to observe this is to use Vtune (see for example this article https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-perfo... ), and one of the things that it shows on the timeline is the GPU frequency and you should notice that the GPU frequency could go down when you use both CPU and GPU simultaneously.
Yes, if the buffer fits in the eDRAM, it will reside there. eDRAM is part of the global memory subsystem.
Yes to the second question as well. Just make sure to align the buffer on the 4096 bytes and size it in multiples of 64 bytes and use CL_MEM_USE_HOST_PTR flag when creating a buffer.
Currently, if your buffer is bigger than LLC$ (2-8 MB depending on the processor) and smaller than 128 MB (the size of eDRAM, even though on Skylake chips - 6th gen chips, there are 64 MB and 128 MB configurations), it will end up in eDRAM. Typically, if your kernel has input and output buffer, the sum of the sizes of your buffers should be less than 128 MB. The only way to observe the eDRAM boundary is to gradually increase the sizes of the buffers and see performance drop in a step function way as you exceed the size of the eDRAM.
Lets say if we try to allocate a buffer of size 140MB, will the allocation will be part eDRAM(128MB) and part system DRAM(12MB) or will it be fully 140MB in system DRAM??.
In another case if my buffer size is lets say 16MB will the allocation is part LLC and eDRAM or fully eDRAM??