Does this AVX-specific Turbo - Page 3

AFog0 · ‎10-11-2013

AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.

We have to learn from history. When the 64-bit mmx registers were replaced by the 128-bit xmm registers, nobody thought about preparing for the predictable next extension. The consequence of this lack of foresight is that we now have the complication of two versions of all xmm instructions and three states of the ymm register file. We have to issue a vzeroupper instruction before every call and return to ABI-compliant functions, or alternatively make two versions of all library functions, with and without VEX.

Such lack of foresight can be disastrous. Unfortunately, it looks like the AVX512 design is similarly lacking foresight. I want to point out two issues here that are particularly problematic:

AVX512 does not provide for clean extensions of the mask registers
The overloading of the register extension bits will mess up possible future expansions of the general purpose register space

First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.

It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.

Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.

So let me propose: Drop the new mask registers and the instructions for manipulating them. Allow seven of the vector registers (e.g. xmm1 - xmm7 or xmm25 - xmm31) to be used as mask registers. All mask functionality will be the same as currently specified by AVX512. This will make future extensions problem-free and allow the synergy of using the same instructions for manipulating vectors and manipulating masks.

The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.

There are other less attractive solutions in case the Knights Corner bit cannot be used, but anyway I think it is important to keep the possibility open for future extensions of the register space instead of messing up everything with short-sighted patches.

I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

Bernard · ‎11-12-2013

>>>What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.>>>

Signal routing will be probably implemented at control micro-instruction(uops?) level and when you consider 8/16/32/64 bit mask granularity it will be needed to use 2-bit bitfield to represent particular bitmask(here I suppose that operands are not encoded in micro-instruction) and are simply decoupled.At hardware level masking could be probably performed by the ALU(probably wil need specific control signal input) and it is interesting if the same execution Port ALU will be responsible for performing masking operation on AVX-512 vectors thus possibly staying busy when vector integer code is beign dispatched for execution.

capens__nicolas · ‎11-12-2013

Agner wrote:
Whether it is too expensive to mask with 8-bit or 16-bit granularity is difficult to argue when the people who know the hardware details seem to be gagged.

GPU designers aren't doing it, and AVX-512 doesn't support it either. I think that tells us something about the significant cost of supporting fine-grained predication. And I have yet to come accross a valuable use case for which 32-bit lanes or legacy blend instructions and vector-within-vector operations would not be acceptable.

Even if it is expensive to do low-granularity masking today, it may be cheap tomorrow, or they may have to do it anyway because of demand from SW people. My point is that the design should be extensible because we can't predict the future. The history of x86 instruction sets is full of examples of very shortsighted solutions with no concern for extensibility. They are doing the same mistake again with AVX512 by making mask registers with no plan for extension beyond 64 bits.

Can't you see the irony in that? You say we can't predict the future, but you also say we should keep the mask registers extensible for low-granularity masking, which potentially hampers widening the SIMD width itself. That seems to me like the shortsightedness you intended to avoid in the first place. You have no cause for this, other than an expectation of that it might be demanded in the distant future. So the reality is that trying to not make a prediction, results in making a prediction as well. Engineers face this dilemma all the time, so in the end you just have to recognize that all the past 'mistakes' stem from the same limited foresight that we are facing now. So you can't call them mistakes unless you're willing to admit that your proposal might very well be the next one.

So in my opinion we have to try to predict the future to the best of our abilities and design for that instead of designing for something we may or may not need. Worst case you're creating something that's really powerful but for a slightly limited range of applications. Looking at GPUs and how lots of people are trying to use them for generic computing, it's not going to be bad at all to use them as the guideline. Bringing that technology into CPU cores in a homogeneous manner in and of itself widens the range of applications. And GPUs have proven that having a minimal lane width of 32-bit is not a significant limitation. For the few cases where you need maximum 8-bit performance, vector-within-vector instructions offer a practical solution. So why take any unnecessary risks?

andysem · ‎11-12-2013

bronxzv wrote:

Quote:

andysem wrote:movmsk* are slower than just cmp* instructions, especially on older CPUs.

they are 1 clock throughput / 2 clock latency since Conroe or even before IIRC

As per Intinsics guide, on Sandy/Ivy bridge, pmovmskb has latency 2 clocks, throughput 1 clock; pcmpeqb and pcmpgtb - 1/0.5. On Conroe/Wolfdale the timings are [unknown] (probably, 1?)/1 and 1/0.33 respectively. On Netburst CPUs the difference is more pronounced: 7/2 and 2/2. I'm not sure about AMD CPUs but I think pmovmskb is very slow on some models.

AFog0 · ‎10-09-2014

Well, now they have announced 8-bit and 16-bit granularity after all in a subset of AVX512 named AVX512BW, see https://software.intel.com/en-us/blogs/additional-avx-512-instructions

Apparently, AVX512BW will be supported already in Skylake in 2015, but not in the Knights Landing multiprocessor.

Now, this allready exhausts the capacity of the 64-bit mask registers. So now the big question is: Will future extensions to 1024 or 2048 bits have only 32 and 64 bit granularity, or will the mask registers be redesigned to make them bigger? Or will there be no future extensions?

bronxzv · ‎10-09-2014

Agner wrote:

Well, now they have announced 8-bit and 16-bit granularity after all in a subset of AVX512 named AVX512BW, see https://software.intel.com/en-us/blogs/additional-avx-512-instructions

Apparently, AVX512BW will be supported already in Skylake in 2015, but not in the Knights Landing multiprocessor.

Now, this allready exhausts the capacity of the 64-bit mask registers. So now the big question is: Will future extensions to 1024 or 2048 bits have only 32 and 64 bit granularity, or will the mask registers be redesigned to make them bigger? Or will there be no future extensions?

The notation dst[MAX:512] := 0 used throughout the Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide/ hints at a future extension to 1024-bit, IMHO the granularities will be kept but masks will simply grow larger, i.e. a new __mmask128 type will be introduced

andysem · ‎10-10-2014

I must say I'm really happy that Intel decided to introduce AVX512BW and AVX512VL, although the announcement only mentions Xeon CPUs and doesn't say anything explicitly about desktop CPUs. I hope that the support is implied.

Bernard · ‎10-21-2014

It could be interesting to compare performance of CPU SIMD 512-bit vector unit to GPU CUDA Shader Processors (large matrix computation)

diamantis__nikos · ‎10-24-2014

I think Intel and people using/developing on SIMD extensions should pay attention to thermal/power restrictions and performance penalties that new SIMD extensions introduce in the actual implementation.

For example, the new Xeon E5 v3 (Haswell core) drops speed (CPU clock) to new lower limits called "AVX base clock" which is lower than nominal CPU clock and "AVX Turbo clock" which is lower than nominal CPU Turbo clock.

So, the new E5-2699 v3 Xeon using all cores has a frequency of 2.3GHz (Base) to 2.8GHz (Turbo) with all workloads but AVX.

When it encounters AVX code the CPU frequency drops to 1.9GHz (Base) with an upper limit of 2.3GHz (Turbo) for heavy AVX usage.

This kind of performance loss in the actual implementation of a SIMD architecture, could be more significant than even design decisions of the architecture.

Bernard · ‎10-24-2014

>>>When it encounters AVX code the CPU frequency drops to 1.9GHz (Base) with an upper limit of 2.3GHz (Turbo) for heavy AVX usage.>>>

Interesting information. May I ask you what is the source of that information?

On the other hand it is understandable that some power consumption restriction are in place. Heavy AVX workload will utilize many of 168 available floating point physical registers while transfering data over 256-bit data paths to and from registers themselves and execution units.

diamantis__nikos · ‎10-24-2014

The first source is the official PDF from Intel here:

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf

The second source is the Anandtech here:

http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

Bernard · ‎10-24-2014

Thank you Nikos.

McCalpinJohn · ‎10-24-2014

I have not been able to find the presentation(s) on the Intel site, but these features are described at:

http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

I have only done a little but of testing on the Haswell EP systems here, but it is clear that the Turbo behavior is quite different than what I see on the Sandy Bridge EP (Xeon E5-2680).

One thing I did note is that the reduction in frequency appears to be triggered by 256-bit operations, not by AVX encodings -- so if your code uses AVX encodings for scalar or 128-bit packed operations you will not see the clock reduction. This makes for increasingly subtle performance tradeoffs...

McCalpinJohn · ‎10-24-2014

Aha -- using the first reference in Nikos note, I found that the frequency limits for the Xeon E5 v3 parts are all documented in the "Intel Xeon Processor E5 v3 Product Families: Specification Update" (document 330785-002, October 2014), which is currently available at http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf

The "standard" Turbo boost frequencies for each processor model are in Table 2 and the "AVX" turbo boost frequencies are in Table 3.

angus-hewlett · ‎12-18-2014

Does this AVX-specific Turbo restriction affect desktop and mobile parts, or only Xeons?

TimP · ‎12-19-2014

angus-hewlett wrote:

Does this AVX-specific Turbo restriction affect desktop and mobile parts, or only Xeons?

Did you try it?

On the I5-4200U, I continue to observe performance cut-backs which persist well into single-thread regions when running 2 hyperthreads per CPU, but I don't observe a connection with AVX or even 128-bit parallel SIMD. The effect is much more pronounced when running with Intel OpenMP library, as compared with gnu OpenMP. Performance with Intel OpenMP depends on limiting threads to 1 per CPU.

I haven't seen documentation as to why there is no BIOS option to disable hyperthreading on some HSW CPUs, or, on others, why tinkering with that option prevents OS from starting up.

diamantis__nikos · ‎12-27-2014

Using my Haswell Core i7-4790, I see no restrictions in turbo mode when running AVX applications.

BUT, the huge amount of power needed to run the task (depending on the AVX optimizations) can lead to Turbo mode elimination by triggering the process of CPU thermal/power throttling or can lead to an even lower clock for CPU, especially if you don't have proper cooling conditions for the CPU.

For example, running Prime95 v28.5 which is a heavily AVX2 (FMA) optimized application, without proper cooling I had never seen the max Turbo mode of 3.8GHz, instead from the beginning the clock was 3.6GHz (max non-Turbo clock) and after a few minutes the CPU throttles to 3.5GHz, 3.4GHz etc.

So, there are no actual, strict and direct restrictions, BUT in real life you could see CPU throttling using AVX optimized apps.

Bernard · ‎12-27-2014

Probably a lot of energy is consumed by AVX register file hence CPU throttling when AVX code is scheduled to run.

Royi · ‎07-08-2017

Has Intel changed anything according to the above feedback?

AFog0 · ‎07-08-2017

Royi wrote:

Has Intel changed anything according to the above feedback?

No. As I wrote. At the time they publish a new ISA extension, it is too late to change anything.

This problem can only be overcome with an open development process and open standards. I am working on a project to demonstrate the advantages of such an open process. This project includes a new instruction set, designed completely from scratch, but taking advantage of the experience we have from the history of existing instruction sets. It has vector registers with variable length and a feature for making vector code compatible with future extensions, and a more efficient handling of vector loops, among many other advantages. You can see it all on www.forwardcom.info if you are intersted.

Talisin__Rus · ‎11-04-2017

andysem wrote:

Quote:

c0d1f1ed wrote:

Quote:

If mask registers were designed with byte granularity in mind, all that complexity would be unnecessary. Just let k registers be 64-bit from the start, every bit would mask the corresponding byte in the vector. Larger-granularity masking would just use more than one bit per lane or just one (e.g. highest) bit to mask out the operation on the element. Comparison instructions would also set bits corresponding to the operation granularity (like pcmpgt/pcmpeq do now with xmm registers). No need for wiring different bits to different lanes at all.

That would be an excellent suggestion if not for the fact that AVX-512 would already use all 64-bit of the k registers. Extending to 1024-bit would get very messy. I don't think that's a smart move for a predication granularity that's not going to be of great value anyway.

Quote:

Vector width extension then comes naturally with k registers width extension. I suppose, at that point the integer PRF wouldn't be enough to store the extended k registers, unless one k register is multiplexed from two physical registers in the file.

Then you need two register file accesses per predication mask (up to a total of six per cycle, without counting any pointer and index registers). Also, you'd easily run out of physical registers. This sounds like a costly hack to me, and again I don't think it's worth the effort.

Well, it's hard for me to judge how difficult such an extension would be, but it looks worthy to me. There are solutions for this problem, besides multiplexing k registers. x86-128 IA32 extention, for example :-D. Seriously though, k registers could be extracted to a separate file, which could be dormant most of the time, .. On that suggest, why not create a register stack .. Seriously tho, a register workspace pointer to k sets of 32 registers

All in all, I'm not arguing that full support for 8/16-bit operations and predicaion comes without a cost. My opinion is that the demand for it is significant enough to justify the possible complication. Vectors won't continue to grow much after 512 bits (1024 - yes, 2048? don't know), so the amount of complication is limited and predictable...

LOOKING at the PHI .. im seriously replacing the E5 2650 cluster with 16 PHI vector processors .. its not until reading the vector extensions on this "failed arch" that it hit me i was searching for firmware control of event loops (CPU offload) and deep correlation (requiring GPU CUDA) when here it lay hidden (at least on Intel compute chip) in the form of a 32K L1 data cache .. Here was a 32k Array neatly laid out in serial form .in 64byte (512 bit) width .. my requirement NOW not some future target .. .61 cores with 4 hardware threads / core, thats 244 threads / CPU

" a 64-byte cache line implies that it can contain 512 bits of data. In terms of single precision and double-precision floating-point data, the 64-byte cache lines can hold 16 single-precision floating-point objects or 8 double-precision floating-point objects. In associating the 64-byte (512 bits) cache-line-width architecture with the VPU on the Intel Xeon Phi coprocessor [6], the vector instructions can process 512 bits of data "

https://software.intel.com/en-us/articles/memory-management-optimizations-on-the-intel-xeon-phi-coprocessor-using-abstract-vector

All i need now, is a few diligent Cilk pundits https://www.cilkplus.org/node/78#block-views-cilk-tools-block-1 who would enjoy the results of a nanoprobe Tx /Rx array sitting in an Auger dual beam electron microscope, receiving doppler signatures via a -140dbm noise floor A/D front end

Good programmers are hard to find .. when bioscience requires WFFT .. life savings on equipment.. http://spaa.acm.org/

AVX-512 is a big step forward - but repeating past mistakes!