If I recall correctly, all of the Gold 6000 processors have two AVX512 units, so they are capable of 32 DP FLOPS/cycle. The Gold 5000 processors have one AVX512 unit (except for the Gold 5122, which has two), so they are capable of 16 DP FLOPS/cycle.
The frequency that you will get when running AVX512 instructions will be lower than the nominal frequency in most cases. The minimum and maximum values for each processor model are included in the Xeon Scalable Processor Specification Update (document 336065).
Gold 5000 processors: 16 DP = 512 / 64 * 2
How can I understand the "2"?
Gold 6000 processor: 32 DP = 512 / 64 * 2 * 2
How can I understand the two "2"?
Is that mean 2 calculate unit or 4 calculate unit?
Starting with the Xeon E5 v3 (Haswell) core, each floating-point vector unit supports the Fused Multiply-Add instructions, which perform two operations on each element. So for AVX512, each unit performs 2 operations on each of 8 elements in one cycle.
The use of the "Skylake" label is quite confusing.
No doubt things will become even more confusing in the future....
The information above comes from https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors and from checking out the links to specific processor information pages on https://ark.intel.com/
I cat /proc/cpuinfo to get model.
I get model 85 on Xeon Gold 6613, and model name "Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz".
Does the Xeon D-21xx, Xeon E3-12xx, Xeon E3-15xx, Xeon W-21xx, Gold 6000, Gold 5000, and so on, are the same model number?
Can I use the model to get CPU with one or two AVX512 units?
The only way I know to obtain this information is to look at the specific processor product page under https://ark.intel.com/#@Processors
The Xeon Gold 6133 processor does not appear in Intel's list of Xeon Scalable Processors at https://ark.intel.com/products/series/125191/Intel-Xeon-Scalable-Processors, but it may be a special "OEM" version. There are a number of these listed in the section on "Skylake SP" processors at https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors -- they either say "OEM" or have a blank in the "Release Price" column.
The only way to be sure of the number of AVX-512 units is to run a benchmark test -- it does not look like the number of AVX-512 units is available from the CPUID instruction or through any other hardware reference.
Given that every other Xeon Gold 6000 processor has 2 AVX-512 FMA units, I would guess that this one does as well, but if it is an OEM part it could have been specially requested to only have one AVX-512 FMA unit.
On a Linux system, the command "cat /proc/cpuinfo" will include a list of "flags" that show which features the processor supports. If AVX512 is supported, then the AVX512 subsets that are supported will be listed. On a Xeon Platinum 8160, for example, I get:
# head -26 /proc/cpuinfo | grep 512
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req
This shows that the processors supports the AVX-512F (Foundations) instruction set, as well as the "DQ", "CD", "BW", and "VL" subsets. This is the expected set for a Skylake Xeon processor.
Assuming that the processor supports AVX512, the performance of Intel's optimized LINPACK benchmark should make it very clear whether the processor has 1 or 2 AVX-512 FMA units. For Linux, the description of the benchmark is at https://software.intel.com/en-us/mkl-linux-developer-guide-intel-optimized-linpack-benchmark-for-lin...; There is also a Windows version of the benchmark that is easy to find.
I have ran the runme_xeon64, and its output as follow. Which show the AVX-512 FMA units.
[root@centos71611 linpack]# ./runme_xeon64 This is a SAMPLE run script for running a shared-memory version of Intel(R) Distribution for LINPACK* Benchmark. Change it to reflect the correct number of CPUs/threads, problem input files, etc.. *Other names and brands may be claimed as the property of others. 2018年 06月 01日 星期五 15:06:49 CST Sample data file lininput_xeon64. Current date/time: Fri Jun 1 15:06:49 2018 CPU frequency: 2.493 GHz Number of CPUs: 2 Number of cores: 12 Number of threads: 12 Parameters are set to: Number of tests: 15 Number of equations to solve (problem size) : 1000 2000 5000 10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000 Leading dimension of array : 1000 2000 5008 10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000 Number of trials to run : 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1 Data alignment value (in Kbytes) : 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 Maximum memory requested that can be used=16200901024, at the size=45000 =================== Timing linear equation system solver =================== Size LDA Align. Time(s) GFlops Residual Residual(norm) Check 1000 1000 4 0.010 69.5946 8.724688e-13 2.975343e-02 pass 1000 1000 4 0.008 88.2280 8.724688e-13 2.975343e-02 pass 1000 1000 4 0.009 78.1615 8.724688e-13 2.975343e-02 pass 1000 1000 4 0.008 88.9556 8.724688e-13 2.975343e-02 pass 2000 2000 4 0.045 119.8063 4.565348e-12 3.971294e-02 pass 2000 2000 4 0.041 131.0556 4.565348e-12 3.971294e-02 pass 5000 5008 4 0.506 164.7312 2.416245e-11 3.369259e-02 pass 5000 5008 4 0.499 167.1883 2.416245e-11 3.369259e-02 pass 10000 10000 4 3.743 178.1716 8.700884e-11 3.068020e-02 pass
None of the sizes shown in this output are big enough to see asymptotic performance on this system....
The "runme_xeon64" script launches the "xlinpack_xeon64" binary. Running "xlinpack_xeon64 -e" prints out the extended help for the benchmark.
If you set OMP_NUM_THREADS to 1 before running the test, it will limit the execution to a single core and you should get close to asymptotic performance for the problem sizes in the 10000 to 15000 range. Since we don't know what the minimum AVX512 or maximum AVX512 frequencies are for this processor, I would wrap the command in "perf stat" to get the average frequency.
On my Xeon Platinum 8160, modifying the input file to run a problem size of 10000 four times, I get
Performance Summary (GFlops)
Size LDA Align. Average Maximal
10000 10000 4 79.0453 79.4813
The output of "perf stat" showed an average of 3.36 GHz.
Dividing 79 GFLOPS by 3.36 GHz gives 23.5 FP operations per cycle, which is much higher than the peak of 16 FP operations per cycle that would be appropriate for a processor with only one AVX512 FMA unit.
The "Base AVX-512 Core Frequency (GHz)" is the frequency that the processor will use when running 512-bit SIMD instructions in the absence of any Turbo boost. The columns to the right show the maximum frequency that the processor will use when running 512-bit SIMD instructions with various numbers of active cores.
The maximum frequency is only available if the temperature does not exceed the temperature limits, the package power does not exceed the package power limits, and the electrical current does not exceed the current limits. If any of these limits are exceeded, the frequencies will be reduced until none of the limits are exceeded.
The "Base AVX-512 Core Frequency (GHz)" is also intended to represent the minimum frequency that will ever be seen for any power-limited workload (assuming a correctly configured cooling system). So this frequency can be used to compute a lower bound on the peak GFLOPS. For example, the Xeon Platinum 8180 as a "Base AVX-512 Core Frequency" of 1.7 GHz, with 28 cores, and two AVX-512 units per core, giving a lower bound
Lower Bound: 28 cores * 1.7 GHz * 32 FLOPS/Hz = 1523.2 GFLOPS (per socket).
The maximum 28-core AVX-512 frequency of 2.3 GHz provides an upper bound on the peak performance
Upper Bound: 28 cores * 2.3 GHz * 32 FLOPS/Hz = 2060.8 GFLOPS (per socket)
The actual frequency when running compute-intensive AVX512 workloads depends on the unique characteristics of the specific piece of silicon (particularly leakage current), as well as the characteristics of the cooling system (ambient temperature, heat sink thermal conductivity, air flow rate, etc).
We have 3472 Xeon Platinum 8160 (24-core0 processors in 1736 two-socket nodes. The Base AVX-512 Core Frequency for these processors is 1.4 GHz and the maximum 24-core AVX-512 frequency is 2.0 GHz. When running Intel's optimized LINPACK benchmark, we see that the average frequency of these processors varies between about 1.52 GHz and about 1.73 GHz, with sustained (LINPACK) performance varying by the same proportions.
The rightmost column of the first row of the table shows that 2.3 GHz is the maximum Turbo frequency that the Xeon Platinum 8180 allows when using all cores and running AVX512 code. 2.3*32*28 = 2060.8 GFLOPS for double-precision FMA operations on 512-bit vectors.
CPU processor Max Turbo Frequency is 3.8GHz, but AVX-512 Max Turbo Frequency is only 2.3GHz. Why is this?
And calculate Max GFLOPS always use ( CPU-Base-Frequency * cores * 32(or 16, 8, 4) ) before.