Hi,
I have a question about the architecture of knights corner. I wonder how many Vector Processing Unit (VPU) within one physical core? Because the hardware can support 4 threads within one core, does it that mean there are four VPUs within one core? or there is only one VPU within one core, and four threads share one VPU? I am not familiar with this, Can someone answer me ?
Thank you!
Qiang
The VPU can issue an instruction from any thread on each clock cycle, but can take an instruction from an individual thread only at an interval of 2 or more cycles. 2 threads could keep the VPU running at 90% of maximum throughput, so if you are creating a simplified model, don't let it become over-optimistic. Current compilers do an excellent job of generating code which is efficient over the full range of threads per core, so there's little incentive to spend time trying to second-guess it.
Hand-coded MKL functions which take full advantage of 4 threads per core are using at least one thread for data shuffling, rather than driving the VPU.
链接已复制
The VPU can issue an instruction from any thread on each clock cycle, but can take an instruction from an individual thread only at an interval of 2 or more cycles. 2 threads could keep the VPU running at 90% of maximum throughput, so if you are creating a simplified model, don't let it become over-optimistic. Current compilers do an excellent job of generating code which is efficient over the full range of threads per core, so there's little incentive to spend time trying to second-guess it.
Hand-coded MKL functions which take full advantage of 4 threads per core are using at least one thread for data shuffling, rather than driving the VPU.
