Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
New Contributor I
67 Views

Difference between "fused" FMA port 0 + port 1 vs "dedicated" FMA port 5?

I'm trying to understand the number of ports that are available for the vector instructions being executed on my processor, an Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (Cascade Lake).

Wikichip points me toward Skylake to see the microarchitecture, and as I understand it I should have two FMA ports, one "fused" by using Port 0 and Port 1 to form a 512-bit FMA unit, and another dedicated 512-bit FMA unit on Port 5. (https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_additi...

For total context, I'm comparing single float vs int16 matrix-vector multiplication, and the single float version edges out my int16 even though I should have more data parallelism from the smaller data type. I'm comparing two sequences of assembly that repeatedly execute, the first calls these floating-point vector instructions including FMA:

vmovups zmm28, zmmword ptr [rsi+0x180]
vmovups zmm27, zmmword ptr [rsi+0x1C0]
vbroadcastss zmm18, dword ptr [rdx+0x10]
vfmadd231ps zmm26, zmm30, zmm18
vfmadd231ps zmm25, zmm29, zmm18
vbroadcastss zmm18, dword ptr [rdx+0x14]
vfmadd231ps zmm22, zmm30, zmm18
vfmadd231ps zmm21, zmm29, zmm18

And the second calls these integer vector instructions that mimic the fused multiply-accumulate but with int16_t:

vpbroadcastw zmm1,WORD PTR [rsi+0xe]
vmovdqu16 zmm13,ZMMWORD PTR [rdi+0x100]
vpmullw zmm1,zmm1,zmm14
vpaddw zmm0,zmm0,zmm1
vpbroadcastw zmm1,WORD PTR [rsi+0x12]
vmovdqu16 zmm12,ZMMWORD PTR [rdi+0x140]
vpmullw zmm1,zmm1,zmm13
vpaddw zmm0,zmm0,zmm1

 

My two questions are:

  1. What is the difference between "fused" FMA port 0 + port 1 vs "dedicated" FMA port 5? Are there any?
  2. The Wikichip section says that FMA unrelated operations can still execute in parallel on Port 0/1. Would this mean that vfmadd231ps and vbroadcastss could execute at the same time on the same port, for example? That could explain why the float operations are faster.

Thanks so much in advance.

 

0 Kudos
1 Reply
Highlighted
Moderator
58 Views

Hello brandon,


Thank you for joining the Intel community


Please allow me some time to research on this and I will get back to you as soon as I have some updates.


In the meantime you could take a look at the Xeo Processors resource site:

https://www.intel.com/content/www/us/en/products/processors/xeon/scalable/gold-processors.html


Regards


Jose A.

Intel Customer Support


0 Kudos