Solved: Gilles,

gilles_c_1 · ‎06-11-2015

Hi,

In order to play with roof-line charts for 32b integer-based code, I struggle to find what are the theoretical peak integer performances for Ivy Bridge, Haswell and Knights Corner processors and co-processors.

For floating point, that's "easy": vector length / type length * 2 (for FMA) * #cores * freq

Now, for integers, that's another story:

Ivy Bridge's 256b AVX doesn't support integer operations, but SSE 128b does support some... But which ones exactly? I saw an integer FMA for 16b integers, a 32b add with a 0.5 cycles throughput, and a 32b multiply with a 1 cycle throughput. Does that mean that I can in average expect a 1.5 multiply / add throughput (for a typical Matrix multiplication)?
For Haswell, 256b AVX2 does support some integer operations. But again, I didn't find any FMA for 32b data, only the 0.5 cycle add and 1 cycle multiply. So basically, same question here...
For Xeon Phi Knights Corner, apparently we do have a SSE 512b FMA for 32b integers. However, the throughput isn't given (I assume it's 1 cycle). So I can go for a "512 / 32 * 2 (for FMA) * freq * #cores" for the peak, right?

So altogether, what should be my theoretical 32b integer peak performances for these 3 architectures (and other possibly) for an 32b integer matrix-matrix multiplication kind of workload? And why?

Thank you very much for any help on that

Gilles

McCalpinJohn · ‎06-12-2015

It is difficult to discuss peak integer performance without being more specific about what types of multiplication are required and whether the integers are signed or unsigned.

For Ivy Bridge the peak 32-bit integer performance looks like 8 ops/cycle: a 4-wide add (128-bit SSE or 128-bit AVX with packed doublewords) plus a 4-wide multiply (SSE4.1 PMULLD or AVX VPMULLD 128-bit with signed packed doublewords and only keeping the low half of the results). If you need to keep all 64 bits of the multiply result, then the multiplication rate is halved. You can use the PMULDQ/PMULUDQ instructions to multiply 2 of the 4 elements in a 128-bit register and store the 2 64-bit products in the output register. It looks like all of these are fully pipelined with single-cycle latency.

For Haswell the peak 32-bit integer performance looks like 12 ops/cycle: an 8-wide AVX2 packed integer add and either an 8-wide 32-bit packed integer add (VPMULLD) saving the low-order 32 bits (but executing only once every 2 cycles) or a VPMULDQ that multiplies the even-numbered 32-bit sub-fields of two 256-bit registers and saves the 4 64-bit results in an output register.

For Xeon Phi the peak 32-bit integer performance is also 16 ops/cycle if you can use the VPMADD instructions. These discard the upper 32 bits of the result. Also note that you must be running at least 2 threads per physical core if you want to issue instructions every cycle. Xeon Phi also supports an ordinary packed 32-bit ADD (VPADDD) and separate instructions for packed 32-bit multiplication that store the high-order and low-order 32-bits of the result. There is not a lot of documentation on latency and throughput for Xeon Phi vector instructions, but these are all very likely to be fully pipelined.

Of course I might have gotten confused in there somewhere...

View solution in original post

Thomas_W_Intel · ‎06-12-2015

Gilles,

integer fma is only available on Intel Xeon Phi. For exploring the Intel instruction set, I like the interactive intrinsics guide.

The latency and throughput of instructions are described in Appendix C of the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

Kind regards

Thomas

McCalpinJohn · ‎06-12-2015

It is difficult to discuss peak integer performance without being more specific about what types of multiplication are required and whether the integers are signed or unsigned.

For Ivy Bridge the peak 32-bit integer performance looks like 8 ops/cycle: a 4-wide add (128-bit SSE or 128-bit AVX with packed doublewords) plus a 4-wide multiply (SSE4.1 PMULLD or AVX VPMULLD 128-bit with signed packed doublewords and only keeping the low half of the results). If you need to keep all 64 bits of the multiply result, then the multiplication rate is halved. You can use the PMULDQ/PMULUDQ instructions to multiply 2 of the 4 elements in a 128-bit register and store the 2 64-bit products in the output register. It looks like all of these are fully pipelined with single-cycle latency.

For Haswell the peak 32-bit integer performance looks like 12 ops/cycle: an 8-wide AVX2 packed integer add and either an 8-wide 32-bit packed integer add (VPMULLD) saving the low-order 32 bits (but executing only once every 2 cycles) or a VPMULDQ that multiplies the even-numbered 32-bit sub-fields of two 256-bit registers and saves the 4 64-bit results in an output register.

For Xeon Phi the peak 32-bit integer performance is also 16 ops/cycle if you can use the VPMADD instructions. These discard the upper 32 bits of the result. Also note that you must be running at least 2 threads per physical core if you want to issue instructions every cycle. Xeon Phi also supports an ordinary packed 32-bit ADD (VPADDD) and separate instructions for packed 32-bit multiplication that store the high-order and low-order 32-bits of the result. There is not a lot of documentation on latency and throughput for Xeon Phi vector instructions, but these are all very likely to be fully pipelined.

Of course I might have gotten confused in there somewhere...

gilles_c_1 · ‎06-12-2015

Wow, thank you very much, that's quite the answer: In addition to be very detailed and answering all of my questions, it also makes me feel like I'm not ashamed I didn't find it all by myself despite my long efforts.

Thanks again, I really appreciate it.

Gilles