I try to know the number of parallel instruction involved in a OpenCL kernel regarding the kernel parameters... For instance, with 4-core Xeon, I launch 8 workgroup of 32 threads. (1 workgroup per HW thread). We have so a parallelism degree of = 8 x parallelism degree of workgroup..
What is the parallelism degree of a workgroup? I know that the code is scalarized and vectorized to fit with the xmm registers width.. And we must consider pipeline mechanism..
Thanks Michael for your question.
You are definately right. Each workgroup is implemented as a loop over the work-items. Then the loop is unrolled to the "float" SIMD width of the CPU. So double precision operations would need 2 SIMD egisters for each argument. This gives paralelism of 8 on today's CPUs (4 for doubles).
In addition, each CPU core can issue multiple different instructions rep cycle. The level of instuction level parallelism is dependent on the CPU model (generation) on the combination of instruction ready to execution at any given cycle and on the availeble CPU resources at that clock cycle. Hoever, the OpenCL compiler wouldn't expose such parallelism by additional loop untoling.