I am not sure if this is the right sub form, however, I couldn't find a better fit.
In our group, we are parallelizing some C++ source code. For our experiments we use an Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz.
To get a better understanding of our results, It is needed to have a deeper insight in the hardware architecture.
So, we are looking for some documents which provide information about the hyper threading for this specific processor.
Is there any documents that tells us which computation units are shared (e.g. floating point unit) between hyper threads?
- Parallel Computing
I believe ivy bridge was the same as sandy bridge in these respects, covered without much detail in the intel architecture docs. Fpu and fill buffers are shared between hyperthreads. Experts on this might be found on the isa forum.
The best general reference is "Intel® 64 and IA-32 Architectures Optimization Reference Manual", Intel document 248966, revision 033 from June 2016 is the most recent, but any version from the last 2 years will have the latest information on the Xeon E3 v2 ("Ivy Bridge"). Sometimes you will find low-level details in other presentations (e.g., from the Hot Chips conferences), but these are typically isolated low-level details rather than overviews.
At a high level, the easiest way to think about Intel's HyperThreading implementations is that the threads have their own registers, but share everything else -- particularly all of the processor's functional units and all levels of the processor's cache hierarchy.
- HyperThreading is well suited for heterogeneous workloads in which the two threads are often using (and waiting on) different resources.
- For homogeneous workloads, HyperThreading will often help throughput, but usually by a smaller ratio than with heterogeneous workloads.
- Homogeneous threads can often overlap stalls due to cache misses and can sometimes overlap stalls due to functional unit latencies.
- But it is easier to find cases for which the extra competition for cache capacity (or cache associativity, or DRAM banks, or other shared resources) results in an overall reduction in performance.
- Enabling HyperThreading also allows for scheduling errors when using fewer threads than "logical processors". Unless you explicitly control the placement of user threads to "logical processors", it won't take long to find cases in which the OS schedules two processes on one physical core and zero processes on another. This does not have to happen very often to completely negate the performance advantages obtained when the processes are scheduled properly.
The AVX instruction set extensions include scalar instructions and packed instructions for both 128-bit and 256-bit SIMD register width.
For scalar and 128-bit packed operations, many of the AVX instructions are similar to previous SSE instructions, but allow a 3-operand format instead of the 2-operand format used by SSE. This can significantly reduce the number of extra register-to-register copies required.
AVX only includes support for a subset of data types for packed 256-bit SIMD registers -- I think it only supports 32-bit and 64-bit floating point. Packed byte/word/doubleword/quadword operations are supported on 128-bit SIMD registers.
AVX2 adds support for packed byte/word/doubleword/quadword operations on 256-bit SIMD registers.
With most compilers, selecting the AVX target implies 256-bit instructions, possibly mixed with 128-bit instructions where those are considered optimum for unknown alignment on early AVX CPUs, or when 256-bit instructions are absent.
An exception is Visual Studio 2012, which uses only AVX 128-bit instructions for auto-vectorization under /arch:AVX. Later versions of Visual Studio use 256-bit instructions for /arch:AVX and support also /arch:AVX2.
Hello Jim, John, and Tim,
thanks for all of your comments. I browsed at little the web and found this overview of intel's parallel computation:
So, my understanding is that avx is automatically 256bit.
avx2 just has more computation units to support more data types in parallel.
Is this statement correct or is it too simplified?
Intel AVX instructions promoted floating-point SIMD instructions to 256-bit.
Intel AVX2 instructions promoted most integer SIMD instructions to 256-bit.
AVX2 introduced the fused multiply-add (FMA). And the following:
- additional functionalities for broadcast/permute operations on data elements
- vector shift instructions with variable-shift count per data element
- instructions to fetch non-contiguous data elements from memory
For a complete description, see the Intel(R) Advanced Vector Extensions Programming Reference
Equating AVX and "256-bit" is very likely to cause confusion.
AVX is the first SIMD instruction set to *support* 256-bit operations, but it also includes many new instructions that operate on scalar values or on packed values in 128-bit registers. It is easy to generate code that will only run on machines that support AVX, but that will not use 256-bit registers.
AVX2 adds support for packed integer types in 256 bit registers, and is also associated with the Fused Multiply-Add (FMA) instructions in all existing products. (Some documentation will refer to FMA as a separate instruction set extension, but I don't think that there are any Intel processors that support "AVX2 but not FMA" or "FMA but not AVX2".
An example of the confusion caused by equating AVX with "256-bit" relates to the Turbo frequencies on Xeon E5 v3 and newer products. The maximum Turbo frequency of the processor depends on the number of cores in use (as in previous products) *and* on whether they are using 256-bit registers (in the last millisecond). Unfortunately this is called the "AVX Turbo" frequency, but it is not controlled by whether you are running AVX instructions -- it is controlled by whether those instructions are using 256-bit registers. Code that uses the AVX instruction set but that never touches a 256-bit register will run at the higher Turbo frequencies, not at the "AVX Turbo" frequencies.
This will all get more confusing over time, e.g.,
- Skylake ISA support -- "Skylake client" vs "Skylake Xeon", and
- AVX-512 ISA support in "Xeon Phi 72xx" vs "Skylake Xeon".
The CPUID instruction returns "False" for FMA support on Xeon Phi x100 (Knights Corner/KNC). This is the right answer because the FMA implementation in the Xeon Phi x100 is not compatible with the FMA implementation in the any other processor.
Volume 2 of the SWDM (Table 3-20) says that the FMA feature "supports FMA extensions using YMM state", so it has to be an add-on to AVX or AVX2 just to match the SIMD register width. There may be good low-level reasons for not adding FMA to an AVX implementation, but it is not obvious that an AVX2 implementation without FMA would be a problem. There are almost certainly application segments that can exploit the 256-bit packed integer capability of AVX2 but which would not care if FMA were absent. On the other hand, I am sure that test engineers and compiler writers are very happy not to have to deal with all possible instruction set permutations....
Just a question about AVX2 / FMA3 and HT (on Haswell, sorry !)
As far as I undestand, FMA3 provides 2 FMA operations onto 2x256 bits operands registers. That leads to 16Flop/cycle (at best ;)), right ?
My question is, when HT is enabled, is it possible to each thread to perform AVX2 operation at the same time ?
I think a clearer way to say it is that the FMA (a.k.a. FMA3) instruction set extension provides support for FMA instructions that operate on 3 registers, with support for scalar float or double, packed 128-bit float or double data, and packed 256-bit float or double data. The number of FMA instructions that can be executed in one cycle is an implementation detail that is independent of support for the instructions themselves.
Haswell/Broadwell/Skylake processors all have two "execution ports" that can execute these FMA instructions concurrently. For 64-bit data, a 256-bit register holds 4 values, so each FMA instruction performs 8 Floating-Point operations. Executing 2 of these instructions in a single cycle gives a peak floating-point execution rate of 16 double-precision floating-point operations per cycle.
Given the many public descriptions of HyperThreading, it seems clear that in any cycle it is possible for one "logical processor" to issue an FMA to one of the execution ports while the other "logical processor" issues an FMA to the other execution port. Instructions from different "logical processors" should also be able to interleave in the different pipeline stages of any single execution unit. (I am not sure about the complex functions like FP divide, but the single-pass pipelines should be able to accept instructions from either logical processor in any cycle, so they should support arbitrary interleaving across the stages of the pipeline.)
Not all processors that support FMA have two FMA units. The Xeon Phi x200 (a.k.a., Knights Landing, KNL) has support for the FMA extensions to AVX2 in only one of the two floating-point functional units, so it can only execute one 256-bit FMA per cycle. Of course, KNL also supports the AVX-512F instruction set, which includes 3-operand FMA operations on 512-bit registers. Both floating-point functional units support the AVX-512F instruction set, giving a peak floating-point performance of 2 FPUs * 8 elements/register * 2 FP ops/instruction = 32 double-precision floating-point operations per cycle. It is harder to get close to this peak value on KNL because KNL is a 2-issue core -- so peak performance can only be obtained if there are no other instructions being executed, such as loads, pointer increments, compares, branches, etc. (The AVX-512F FMA instructions can include loads, but most algorithms require more than one load per FMA.)
The AVX turbo frequency is affected by any instruction executed in the vector unit. This includes all x87, MMX, SSE, AVX, AVX2 and AVX-512 instructions, stores and masked loads, but excluding vector (non-masked) loads. No distinction made between instructions of different types, classes or sizes (e.g. vector vs scalar, or xmm vs ymm vs zmm).
In regard to the question John brought up about the FP divide (which stalled the Sandy Bridge pipeline for a relatively large number of cycles), my tests showed that the other hyperthread could run effectively using the other execution ports.
Perhaps this digresses from the interests of the others on this thread: the no-prec-div default method of Intel compilers to avoid the long latency IEEE divide instructions did not show the performance gains from hyperthreading which were possible when using the IEEE divide. These effects could be significant even on Ivy Bridge with AVX256 divide (and sqrt) since the operands were split in half and fed to the execution port in sequence, thus doubling the pipeline delay of 128-bit divide.
From the comments I've seen, the Broadwell CPU might resolve the performance question for scalar divide, but not until Skylake server CPU would one expect a full speed parallel floating point divide.
So the question of which divide situations are helped by hyperthreading depends on CPU version as well as on the compile options.
KNC had no satisfactory performing alternative to the no-prec-div no-prec-sqrt scheme, as there was not even instruction level support for IEEE divide/sqrt. KNL seems to open a new group of questions.