I'm trying to understand the number of ports that are available for the vector instructions being executed on my processor, an Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (Cascade Lake).
Wikichip points me toward Skylake to see the microarchitecture, and as I understand it I should have two FMA ports, one "fused" by using Port 0 and Port 1 to form a 512-bit FMA unit, and another dedicated 512-bit FMA unit on Port 5. (https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition)
For total context, I'm comparing single float vs int16 matrix-vector multiplication, and the single float version edges out my int16 even though I should have more data parallelism from the smaller data type. I'm comparing two sequences of assembly that repeatedly execute, the first calls these floating-point vector instructions including FMA:
And the second calls these integer vector instructions that mimic the fused multiply-accumulate but with int16_t:
My two questions are:
- What is the difference between "fused" FMA port 0 + port 1 vs "dedicated" FMA port 5? Are there any?
- The Wikichip section says that FMA unrelated operations can still execute in parallel on Port 0/1. Would this mean that vfmadd231ps and vbroadcastss could execute at the same time on the same port, for example? That could explain why the float operations are faster.
Thanks so much in advance.
Thank you for joining the Intel community
Please allow me some time to research on this and I will get back to you as soon as I have some updates.
In the meantime you could take a look at the Xeo Processors resource site:
Intel Customer Support