Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

AVX-512 expectations

capens__nicolas
New Contributor I
2,087 Views

Announcement: http://software.intel.com/en-us/blogs/2013/avx-512-instructions

What isn't clear from this announcement is whether the future Xeon processor with AVX-512 support will actually be a socketed MIC, or a CPU (more precisely Skylake)? Is it coming to consumer CPUs in a similar timeframe? Developers might want to know, to determine whether to adopt AVX2+ or heterogeneous computing. The latter would benefit the competition more than it would benefit Intel.

0 Kudos
29 Replies
Bernard
Valued Contributor I
481 Views

I confused MIC VPU instructions with Xeon AVX which are different probably at front end machine code level.Newest wider 512-bit extension will probably be able to reduce the raw FLOP/sec  speed difference when compared to todays GPUs. It will be quite interesting to vectorize custom math functions library to work on 8 double precision vector.

 

0 Kudos
Bernard
Valued Contributor I
481 Views

>>>In my opinion AVX-512 is even a power efficiency feature for high DLP workloads>>>

But at cost of more transistors and thus more gate logic needed to implement new instructions.

0 Kudos
capens__nicolas
New Contributor I
481 Views

iliyapolak wrote:
>>>In my opinion AVX-512 is even a power efficiency feature for high DLP workloads>>>

But at cost of more transistors and thus more gate logic needed to implement new instructions.

Not really. You only need half the number of cores for the same (peak) throughput. So it saves lots of transistors for pefectly vectorizable workloads. Even more typical code contains many loops with independent iterations that can be vectorized, and in recent years many new parallel algorithms are being developed. So a balance is required between scalar and vector processing for the optimal average performance/transistor and performance/Watt. AVX-512 would definitely help improve that. The cost is fully compensated by the average gains.

Also note that 14 nm technology and beyond greatly increases the transistor budget, so no compromise has to be made. Any other way those transistors would be spent is not likely to offer the same net benefit.

0 Kudos
Bernard
Valued Contributor I
481 Views

Regarding half cores needed to do the same job that is true.Bear in mind that some newest machine code instructions while beign implemented at hardware level by micro ops could have use more cpu resources like adders or multipliers or the other logical units and thus consume more power also while working on larger vectors at some small time presumably measured in single clock cycle more energy could be disipated because more raw data needs to be operated on.

 

 

0 Kudos
Bernard
Valued Contributor I
481 Views

Forgotten to add that greater amount of transistors generates more heat dissipation.So it all about finding the proper balance between higher computational power as a function of heat dissipation and trying to minimize the heat being generated.

0 Kudos
perfwise
Beginner
481 Views

I agree with c0d1f1ed.  The transistor budget is there and you have more opportunity than just adding cores.  If you vector wider, you mitigate the power of handling a uop, you have 1 cache, 1 fetch unit, 1 uop cache, 1 LS, which has 1 LDQ, etc.  Going wider, so long as you can use it which is up to the app and the ISA.. and the compiler used to leverage it, makes sense.  That's preferred rather than having many cores with duplicate power burned for no return.  Lastly, the performance of scalar code is critical, since most "highly vectored" apps average 30-45% vector code, you need to have good scalar code performance because algorithms aren't all just sitting in some highly vectorizable loop.  There's transitions, conditional control flow, etc. which necessitate a well rounded perfomrance scheme.  Just my 2 cents..

Perfwise

0 Kudos
Bernard
Valued Contributor I
481 Views

It could be interesting to do comparison in term of heat dissipation or energy needed to perform for example trigonometrical calculation by hardware implemented algorithm (vectorised version of scalar fsin machine code instruction) and the same algorithm implemented with AVX-512 instructions.My guess is that at the front end stage cost in cycles to fetch and decode up to 6 or 7 terms of Horner Scheme could be greater than decoding one complex instructions.

0 Kudos
SergeyKostrov
Valued Contributor II
481 Views
>>Any details on AVX-512 products being presented there? No. Even if it is Not related to the subject of the thread here is a very short list of different expressions I remember: - 7nm technology by 2017 year - Haswell goes mobile - SoC ( System On Chip ) - Quark - Two-in-One solutions ( Desktop / Laptop / Tablet / Mobile ) Note: This is a compromise between PC and Tablet systems - Processing very large Data Sets ( petabytes and so on ) I didn't hear anything about MKL and IPP.
0 Kudos
Bernard
Valued Contributor I
481 Views

How the  lithography progressed very quickly.I still remember 130 nm chips.

0 Kudos
Reply