Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Xeon-D DP flops calculation

Nooraini
Beginner
1,017 Views

Hi, 

 

I'm new to computation and looking into CPU performance benchmarking tools such as Linpack. I'm planning to use the Intel optimized LINPACK benchmark tool to get the floating-point performance. How can we determine the CPU floating-point performance? For example, what is the Xeon D (D-2776NT) DP flops? From the D-2776NT product info here, I can find the base clock at 2.10 GHz with 16cores and AVX-512 (1 FMA). 

https://ark.intel.com/content/www/us/en/ark/products/226239/intel-xeon-d-2776nt-processor-25m-cache-up-to-3-20-ghz.html

 

Thank you.

Regards,

Nooraini 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
901 Views

The "peak" performance will depend on the average frequency, and that varies by die-to-die variability in the chips and may vary with the effectiveness of the cooling system.

For the Xeon D-2776NT, using either AVX2/FMA or AVX512 code the peak performance is 16 64-bit floating-point operations per cycle per core.

  • With AVX2 the limit is 2 VFMA instructions per cycle.  These operate on 256-bit registers --> 4x64 bits, and perform a multiplication and an addition on each element.  2 units * unit width of 4 * 2 FP operations = 16 FP64 operations per cycle.
  • For AVX512 the number is the same, but it comes from 1 512-bit performing the 2 operations on an 8-wide register.

The frequency is the awkward part to specify.  Each processor has an internal table of minimum and maximum frequencies for each of three "Power License Levels" (that correspond to different instruction types and vector widths: "Non-AVX", "AVX2", "AVX512") and for each count of "active cores".  These used to be documented in the "specification update" documents, but Intel quit doing that after Skylake/Cascade Lake, and I have been unable to find such tables in any of the usual public places.     

In your case the numbers you want are the "base AVX2 frequency for all cores active" and the "max AVX2 Turbo frequency for all cores active".  The actual average frequency when running HPL should be between those two numbers, and the corresponding Peak FP64 rate will be between:

  • 16 cores * 16 FP ops/cycle * (base AVX2 frequency for all cores active)

and

  • 16 cores * 16 FP ops/cycle * (max AVX2 Turbo frequency for all cores active)

The numbers that exist on the product page are the "base frequency for non-AVX code for 1 core active" and the "max non-AVX Turbo frequency for 1 core active", which are only minimally useful here.

For a processor with 2 AVX-512 units, the average frequency running HPL is usually somewhat below the nominal frequency ("base frequency for non-AVX code for 1 core active").  With only 1 AVX512 unit the performance is probably best with AVX2 code and the average frequency is probably pretty close to the nominal frequency.  That would give an approximate peak FP64 rate of

  • 16 cores * 16 FP ops/core/cycle * 2.1 GHz = 537.6 GFLOPS

The actual performance varies by chip -- I saw a range of almost 13% in single-node (2-socket) HPL results across 1736 Xeon Platinum 8160 nodes (Skylake Xeon), and an almost identical range across a set of 4200 (1-socket) Xeon Phi x200 nodes.

View solution in original post

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
902 Views

The "peak" performance will depend on the average frequency, and that varies by die-to-die variability in the chips and may vary with the effectiveness of the cooling system.

For the Xeon D-2776NT, using either AVX2/FMA or AVX512 code the peak performance is 16 64-bit floating-point operations per cycle per core.

  • With AVX2 the limit is 2 VFMA instructions per cycle.  These operate on 256-bit registers --> 4x64 bits, and perform a multiplication and an addition on each element.  2 units * unit width of 4 * 2 FP operations = 16 FP64 operations per cycle.
  • For AVX512 the number is the same, but it comes from 1 512-bit performing the 2 operations on an 8-wide register.

The frequency is the awkward part to specify.  Each processor has an internal table of minimum and maximum frequencies for each of three "Power License Levels" (that correspond to different instruction types and vector widths: "Non-AVX", "AVX2", "AVX512") and for each count of "active cores".  These used to be documented in the "specification update" documents, but Intel quit doing that after Skylake/Cascade Lake, and I have been unable to find such tables in any of the usual public places.     

In your case the numbers you want are the "base AVX2 frequency for all cores active" and the "max AVX2 Turbo frequency for all cores active".  The actual average frequency when running HPL should be between those two numbers, and the corresponding Peak FP64 rate will be between:

  • 16 cores * 16 FP ops/cycle * (base AVX2 frequency for all cores active)

and

  • 16 cores * 16 FP ops/cycle * (max AVX2 Turbo frequency for all cores active)

The numbers that exist on the product page are the "base frequency for non-AVX code for 1 core active" and the "max non-AVX Turbo frequency for 1 core active", which are only minimally useful here.

For a processor with 2 AVX-512 units, the average frequency running HPL is usually somewhat below the nominal frequency ("base frequency for non-AVX code for 1 core active").  With only 1 AVX512 unit the performance is probably best with AVX2 code and the average frequency is probably pretty close to the nominal frequency.  That would give an approximate peak FP64 rate of

  • 16 cores * 16 FP ops/core/cycle * 2.1 GHz = 537.6 GFLOPS

The actual performance varies by chip -- I saw a range of almost 13% in single-node (2-socket) HPL results across 1736 Xeon Platinum 8160 nodes (Skylake Xeon), and an almost identical range across a set of 4200 (1-socket) Xeon Phi x200 nodes.

0 Kudos
Nooraini
Beginner
875 Views

Hi @McCalpinJohn ,

 

This very much helpful, appreciate the thorough explanation and example. As least this give an estimation on what should I be observing when running the linpack test.  I was looking at this link showing the Gflops with APP value for each CPU. However I'm not sure how does these value were derive or if these are the Gflops value that we should use as the benchmark for CPU floating-point performance?

https://www.intel.com/content/www/us/en/support/articles/000057415/processors.html

 

Regards,

Nooraini 

0 Kudos
McCalpinJohn
Honored Contributor III
846 Views

It looks like the GFLOPS values in the export control document were based on the "base AVX2 frequency" for the processor.  I was mis-remembering yesterday when I suggested that there was a distinction between the "base frequency" on 1 core and the "base frequency" on all cores -- those two values are the same, it is only the "maximum Turbo frequency" that changes with the number of cores (and the width of the SIMD instructions used).   

For the Xeon D-2776NT the export control document reports 460.8 GFLOPS, which corresponds to a "base AVX2 frequency" (or "base AVX512 frequency") of 1.80 GHz.  I can't find the Turbo tables for the Xeon D-27xx processors, but 1.8 GHz seems like a plausible AVX2 base frequency.

Double-checking using other processors for which I have the full turbo frequency tables, I see:

  • Xeon Platinum 8160, 1075.2 GFLOPS --> matches my expectation:
    • "base AVX512 frequency" (1.40 GHz) *  24 cores * 32 FP Ops/cycle/core = 1075.2 GFLOPS
  • Xeon Platinum 8380, 1152 GFLOPS --> exactly 1/2 of the value I expected for AVX512
    • AVX512: 40 cores * 32 FP Ops/cycle/core * 1.8 GHz = 2304 GFLOPS
    • AVX2:      40 cores * 16 FP Ops/cycle/core * 2.1 GHz = 1344 GFLOPS
  • Xeon Gold 5120, 537.6 GFLOPS --> also does not match my expectations
    • AVX512:  14 cores * 16 FP Ops/cycle/core * 1.2 GHz = 268.8 GFLOPS
    • AVX2:       14 cores * 16 FP Ops/cycle/core * 1.8 GHz = 403.2 GFLOPS
    • 537.6 GFLOPS corresponds to 2.40 GHz for 14 cores, which does not match *any* of the maximum Turbo frequencies for any SIMD width or core count for this processor, but is exactly twice my computed AVX512 value.  The calculation in the table probably assumed 2 AVX512 FMA units, but this processor definitely only has one.  (Both https://ark.intel.com/content/www/us/en/ark/products/120474/intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html and my testing confirm the single AVX512 FMA unit.)

I don't know whether the values in Intel's export control documentation are incorrect, or if the export control regulations have some funny special cases in the formulas, but it does not appear that this table is a reliable way to get the "Peak GFLOPs" that an ordinary user would expect....

0 Kudos
Nooraini
Beginner
734 Views

@McCalpinJohn , again thank you so much for the explanation.   

0 Kudos
Reply