Hi,
In order to play with roof-line charts for 32b integer-based code, I struggle to find what are the theoretical peak integer performances for Ivy Bridge, Haswell and Knights Corner processors and co-processors.
For floating point, that's "easy": vector length / type length * 2 (for FMA) * #cores * freq
Now, for integers, that's another story:
So altogether, what should be my theoretical 32b integer peak performances for these 3 architectures (and other possibly) for an 32b integer matrix-matrix multiplication kind of workload? And why?
Thank you very much for any help on that
Gilles
Link Copied
It is difficult to discuss peak integer performance without being more specific about what types of multiplication are required and whether the integers are signed or unsigned.
For Ivy Bridge the peak 32-bit integer performance looks like 8 ops/cycle: a 4-wide add (128-bit SSE or 128-bit AVX with packed doublewords) plus a 4-wide multiply (SSE4.1 PMULLD or AVX VPMULLD 128-bit with signed packed doublewords and only keeping the low half of the results). If you need to keep all 64 bits of the multiply result, then the multiplication rate is halved. You can use the PMULDQ/PMULUDQ instructions to multiply 2 of the 4 elements in a 128-bit register and store the 2 64-bit products in the output register. It looks like all of these are fully pipelined with single-cycle latency.
For Haswell the peak 32-bit integer performance looks like 12 ops/cycle: an 8-wide AVX2 packed integer add and either an 8-wide 32-bit packed integer add (VPMULLD) saving the low-order 32 bits (but executing only once every 2 cycles) or a VPMULDQ that multiplies the even-numbered 32-bit sub-fields of two 256-bit registers and saves the 4 64-bit results in an output register.
For Xeon Phi the peak 32-bit integer performance is also 16 ops/cycle if you can use the VPMADD instructions. These discard the upper 32 bits of the result. Also note that you must be running at least 2 threads per physical core if you want to issue instructions every cycle. Xeon Phi also supports an ordinary packed 32-bit ADD (VPADDD) and separate instructions for packed 32-bit multiplication that store the high-order and low-order 32-bits of the result. There is not a lot of documentation on latency and throughput for Xeon Phi vector instructions, but these are all very likely to be fully pipelined.
Of course I might have gotten confused in there somewhere...
Gilles,
integer fma is only available on Intel Xeon Phi. For exploring the Intel instruction set, I like the interactive intrinsics guide.
The latency and throughput of instructions are described in Appendix C of the Intel® 64 and IA-32 Architectures Optimization Reference Manual.
Kind regards
Thomas
It is difficult to discuss peak integer performance without being more specific about what types of multiplication are required and whether the integers are signed or unsigned.
For Ivy Bridge the peak 32-bit integer performance looks like 8 ops/cycle: a 4-wide add (128-bit SSE or 128-bit AVX with packed doublewords) plus a 4-wide multiply (SSE4.1 PMULLD or AVX VPMULLD 128-bit with signed packed doublewords and only keeping the low half of the results). If you need to keep all 64 bits of the multiply result, then the multiplication rate is halved. You can use the PMULDQ/PMULUDQ instructions to multiply 2 of the 4 elements in a 128-bit register and store the 2 64-bit products in the output register. It looks like all of these are fully pipelined with single-cycle latency.
For Haswell the peak 32-bit integer performance looks like 12 ops/cycle: an 8-wide AVX2 packed integer add and either an 8-wide 32-bit packed integer add (VPMULLD) saving the low-order 32 bits (but executing only once every 2 cycles) or a VPMULDQ that multiplies the even-numbered 32-bit sub-fields of two 256-bit registers and saves the 4 64-bit results in an output register.
For Xeon Phi the peak 32-bit integer performance is also 16 ops/cycle if you can use the VPMADD instructions. These discard the upper 32 bits of the result. Also note that you must be running at least 2 threads per physical core if you want to issue instructions every cycle. Xeon Phi also supports an ordinary packed 32-bit ADD (VPADDD) and separate instructions for packed 32-bit multiplication that store the high-order and low-order 32-bits of the result. There is not a lot of documentation on latency and throughput for Xeon Phi vector instructions, but these are all very likely to be fully pipelined.
Of course I might have gotten confused in there somewhere...
Wow, thank you very much, that's quite the answer: In addition to be very detailed and answering all of my questions, it also makes me feel like I'm not ashamed I didn't find it all by myself despite my long efforts.
Thanks again, I really appreciate it.
Gilles
For more complete information about compiler optimizations, see our Optimization Notice.