Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
2,593 Views

Calculate the Max Flops on Skylake

 

I calculate the Max Flops on Skylake with cpu- frequency*16.

Is cpu-frequency*32 on GOLD version of Skylake? 

 

0 Kudos
25 Replies
Highlighted
Black Belt
2,506 Views

If I recall correctly, all of the Gold 6000 processors have two AVX512 units, so they are capable of 32 DP FLOPS/cycle.  The Gold 5000 processors have one AVX512 unit (except for the Gold 5122, which has two), so they are capable of 16 DP FLOPS/cycle.

The frequency that you will get when running AVX512 instructions will be lower than the nominal frequency in most cases.  The minimum and maximum values for each processor model are included in the Xeon Scalable Processor Specification Update (document 336065).

0 Kudos
Highlighted
Beginner
2,506 Views

Gold 5000 processors:  16 DP = 512 / 64 * 2

How can I understand the "2"?

Gold 6000 processor: 32 DP = 512 / 64 * 2 * 2

How can I understand the two "2"?

Is that mean 2 calculate unit or 4 calculate unit?

0 Kudos
Highlighted
Beginner
2,506 Views

There are fp airth 128Bit, 256Bit on Skylake. 

There are fp airth 128Bit, 256Bit, 512Bit on Skylakex. 

Is that mean only E3 V5 is Skylake, Scalable processor is Skylakex?

 

0 Kudos
Highlighted
New Contributor I
2,506 Views

GHui, the *2 is due to the peak FLOPS being achieved with FMA instructions that do 2 flops (multiply and add) together in 1 instruction.

0 Kudos
Highlighted
Black Belt
2,506 Views

Starting with the Xeon E5 v3 (Haswell) core, each floating-point vector unit supports the Fused Multiply-Add instructions, which perform two operations on each element.  So for AVX512, each unit performs 2 operations on each of 8 elements in one cycle.

The use of the "Skylake" label is quite confusing. 

  • When used by itself, "Skylake" refers to the "Skylake client" core.  This uses the new core architecture, but does not support the AVX512 instruction set.  The AVX2/FMA instruction set provides the highest FP operation rate, with 2 256-bit functional units providing a total of: 2 functional units * 4 64-bit elements/functional unit * 2 operations/element = 16 FP ops/cycle.
  • "Skylake Xeon" refers to cores that include support for the AVX512 instruction set.  These may have either one or two 512-bit functional units for AVX512 instructions, depending on the model.
  • Server processors (i.e, "Xeon" processors) can be built using either the "Skylake client" core or the "Skylake Xeon" core.
    • Xeon D-21xx processors appear to use the "Skylake Xeon" core with one AVX512 unit.
    • Xeon E3-12xx v5 processors use the "Skylake client" core and do not support AVX512.
    • Xeon E3-15xx v5 processors use the "Skylake client" core, and all include integrated graphics engines.
    • Xeon W-21xx processors use the "Skylake Xeon" core with either one or two AVX512 units.
    • Xeon Scalable processors (Bronze, Silver, Gold, Platinum)  use the "Skylake Xeon" core with either one or two AVX512 units.

No doubt things will become even more confusing in the future....

The information above comes from https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors and from checking out the links to specific processor information pages on https://ark.intel.com/

0 Kudos
Highlighted
Beginner
2,506 Views

I cat /proc/cpuinfo to get model.

I get model 85 on Xeon Gold 6613, and model name "Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz".

Does the Xeon D-21xx, Xeon E3-12xx, Xeon E3-15xx, Xeon W-21xx, Gold 6000, Gold 5000, and so on, are the same model number?

Can I use the model to get CPU with one or two AVX512 units?

0 Kudos
Highlighted
Black Belt
2,506 Views

The only way I know to obtain this information is to look at the specific processor product page under https://ark.intel.com/#@Processors

Examples:

 

0 Kudos
Highlighted
Beginner
2,506 Views

I have get CPU model name "Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz", but I'm not sure the 4th word is "Processor Number". 

0 Kudos
Highlighted
Black Belt
2,506 Views

The Xeon Gold 6133 processor does not appear in Intel's list of Xeon Scalable Processors at https://ark.intel.com/products/series/125191/Intel-Xeon-Scalable-Processors, but it may be a special "OEM" version.  There are a number of these listed in the section on "Skylake SP" processors at https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors -- they either say "OEM" or have a blank in the "Release Price" column.

The only way to be sure of the number of AVX-512 units is to run a benchmark test -- it does not look like the number of AVX-512 units is available from the CPUID instruction or through any other hardware reference.

Given that every other Xeon Gold 6000 processor has 2 AVX-512 FMA units, I would guess that this one does as well, but if it is an OEM part it could have been specially requested to only have one AVX-512 FMA unit.

0 Kudos
Highlighted
Beginner
2,506 Views

What benchmark test can do that? How can I get it?

0 Kudos
Highlighted
Beginner
2,506 Views

If there no AVX512 (like E3-1585 v5), I set the AVX512 performance counter, what will happened.

0 Kudos
Highlighted
Black Belt
2,506 Views

On a Linux system, the command "cat /proc/cpuinfo" will include a list of "flags" that show which features the processor supports.   If AVX512 is supported, then the AVX512 subsets that are supported will be listed.  On a Xeon Platinum 8160, for example, I get:

# head -26 /proc/cpuinfo | grep 512
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req

This shows that the processors supports the AVX-512F (Foundations) instruction set, as well as the "DQ", "CD", "BW", and "VL" subsets.  This is the expected set for a Skylake Xeon processor.

Assuming that the processor supports AVX512, the performance of Intel's optimized LINPACK benchmark should make it very clear whether the processor has 1 or 2 AVX-512 FMA units.  For Linux, the description of the benchmark is at https://software.intel.com/en-us/mkl-linux-developer-guide-intel-optimized-linpack-benchmark-for-lin...; There is also a Windows version of the benchmark that is easy to find.

0 Kudos
Highlighted
Beginner
2,506 Views

I have ran the runme_xeon64, and its output as follow. Which show the AVX-512 FMA units.

[root@centos71611 linpack]# ./runme_xeon64 
This is a SAMPLE run script for running a shared-memory version of
Intel(R) Distribution for LINPACK* Benchmark. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
*Other names and brands may be claimed as the property of others.
2018年 06月 01日 星期五 15:06:49 CST
Sample data file lininput_xeon64.

Current date/time: Fri Jun  1 15:06:49 2018

CPU frequency:    2.493 GHz
Number of CPUs: 2
Number of cores: 12
Number of threads: 12

Parameters are set to:

Number of tests: 15
Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    

Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.010      69.5946  8.724688e-13 2.975343e-02   pass
1000   1000   4      0.008      88.2280  8.724688e-13 2.975343e-02   pass
1000   1000   4      0.009      78.1615  8.724688e-13 2.975343e-02   pass
1000   1000   4      0.008      88.9556  8.724688e-13 2.975343e-02   pass
2000   2000   4      0.045      119.8063 4.565348e-12 3.971294e-02   pass
2000   2000   4      0.041      131.0556 4.565348e-12 3.971294e-02   pass
5000   5008   4      0.506      164.7312 2.416245e-11 3.369259e-02   pass
5000   5008   4      0.499      167.1883 2.416245e-11 3.369259e-02   pass
10000  10000  4      3.743      178.1716 8.700884e-11 3.068020e-02   pass

 

0 Kudos
Highlighted
Black Belt
2,506 Views

None of the sizes shown in this output are big enough to see asymptotic performance on this system....

The "runme_xeon64" script launches the "xlinpack_xeon64" binary.   Running "xlinpack_xeon64 -e" prints out the extended help for the benchmark.

If you set OMP_NUM_THREADS to 1 before running the test, it will limit the execution to a single core and you should get close to asymptotic performance for the problem sizes in the 10000 to 15000 range.   Since we don't know what the minimum AVX512 or maximum AVX512 frequencies are for this processor, I would wrap the command in "perf stat" to get the average frequency.

On my Xeon Platinum 8160, modifying the input file to run a problem size of 10000 four times, I get

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
10000  10000  4       79.0453  79.4813

The output of "perf stat" showed an average of 3.36 GHz.  

Dividing 79 GFLOPS by 3.36 GHz gives 23.5 FP operations per cycle, which is much higher than the peak of 16 FP operations per cycle that would be appropriate for a processor with only one AVX512 FMA unit.

0 Kudos
Highlighted
Beginner
2,506 Views

 

How could I understand the "Base AVX-512 Core Frequency(GHz)"?  Does that affect the Max Gflops?

 

a8cf-444d-b178-84013c1c37a3.png

0 Kudos
Highlighted
Black Belt
2,506 Views

The "Base AVX-512 Core Frequency (GHz)" is the frequency that the processor will use when running 512-bit SIMD instructions in the absence of any Turbo boost.  The columns to the right show the maximum frequency that the processor will use when running 512-bit SIMD instructions with various numbers of active cores.

The maximum frequency is only available if the temperature does not exceed the temperature limits, the package power does not exceed the package power limits, and the electrical current does not exceed the current limits.    If any of these limits are exceeded, the frequencies will be reduced until none of the limits are exceeded.

The "Base AVX-512 Core Frequency (GHz)" is also intended to represent the minimum frequency that will ever be seen for any power-limited workload (assuming a correctly configured cooling system).    So this frequency can be used to compute a lower bound on the peak GFLOPS.  For example, the Xeon Platinum 8180 as a "Base AVX-512 Core Frequency" of 1.7 GHz, with 28 cores, and two AVX-512 units per core, giving a lower bound

Lower Bound:     28 cores * 1.7 GHz * 32 FLOPS/Hz = 1523.2 GFLOPS (per socket).

The maximum 28-core AVX-512 frequency of 2.3 GHz provides an upper bound on the peak performance

Upper Bound:     28 cores * 2.3 GHz * 32 FLOPS/Hz = 2060.8 GFLOPS (per socket)

The actual frequency when running compute-intensive AVX512 workloads depends on the unique characteristics of the specific piece of silicon  (particularly leakage current), as well as the characteristics of the cooling system (ambient temperature, heat sink thermal conductivity, air flow rate, etc).

We have 3472 Xeon Platinum 8160 (24-core0 processors in 1736 two-socket nodes.   The Base AVX-512 Core Frequency for these processors is 1.4 GHz and the maximum 24-core AVX-512 frequency is 2.0 GHz.  When running Intel's optimized LINPACK benchmark, we see that the average frequency of these processors varies between about 1.52 GHz and about 1.73 GHz, with sustained (LINPACK) performance varying by the same proportions.

0 Kudos
Highlighted
Beginner
2,506 Views

For example, the Xeon Platinum 8180, Processor Base Frequency is 2.5GHz, Max Turbo Frequency is 3.8GHz(from ark.intel.com). May I use 2.5 or 3.8 to calculate the Max GFLOPS?

0 Kudos
Highlighted
Beginner
2,506 Views

You should use 2.5 to calculate Max Gflops. You can't achieve the Max Turbo if you are using all the cores.

0 Kudos
Highlighted
Black Belt
2,506 Views

The rightmost column of the first row of the table shows that 2.3 GHz is the maximum Turbo frequency that the Xeon Platinum 8180 allows when using all cores and running AVX512 code.  2.3*32*28 = 2060.8 GFLOPS for double-precision FMA operations on 512-bit vectors.

0 Kudos