- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Announcement: http://software.intel.com/en-us/blogs/2013/avx-512-instructions
What isn't clear from this announcement is whether the future Xeon processor with AVX-512 support will actually be a socketed MIC, or a CPU (more precisely Skylake)? Is it coming to consumer CPUs in a similar timeframe? Developers might want to know, to determine whether to adopt AVX2+ or heterogeneous computing. The latter would benefit the competition more than it would benefit Intel.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I confused MIC VPU instructions with Xeon AVX which are different probably at front end machine code level.Newest wider 512-bit extension will probably be able to reduce the raw FLOP/sec speed difference when compared to todays GPUs. It will be quite interesting to vectorize custom math functions library to work on 8 double precision vector.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>In my opinion AVX-512 is even a power efficiency feature for high DLP workloads>>>
But at cost of more transistors and thus more gate logic needed to implement new instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>In my opinion AVX-512 is even a power efficiency feature for high DLP workloads>>>But at cost of more transistors and thus more gate logic needed to implement new instructions.
Not really. You only need half the number of cores for the same (peak) throughput. So it saves lots of transistors for pefectly vectorizable workloads. Even more typical code contains many loops with independent iterations that can be vectorized, and in recent years many new parallel algorithms are being developed. So a balance is required between scalar and vector processing for the optimal average performance/transistor and performance/Watt. AVX-512 would definitely help improve that. The cost is fully compensated by the average gains.
Also note that 14 nm technology and beyond greatly increases the transistor budget, so no compromise has to be made. Any other way those transistors would be spent is not likely to offer the same net benefit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding half cores needed to do the same job that is true.Bear in mind that some newest machine code instructions while beign implemented at hardware level by micro ops could have use more cpu resources like adders or multipliers or the other logical units and thus consume more power also while working on larger vectors at some small time presumably measured in single clock cycle more energy could be disipated because more raw data needs to be operated on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Forgotten to add that greater amount of transistors generates more heat dissipation.So it all about finding the proper balance between higher computational power as a function of heat dissipation and trying to minimize the heat being generated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with c0d1f1ed. The transistor budget is there and you have more opportunity than just adding cores. If you vector wider, you mitigate the power of handling a uop, you have 1 cache, 1 fetch unit, 1 uop cache, 1 LS, which has 1 LDQ, etc. Going wider, so long as you can use it which is up to the app and the ISA.. and the compiler used to leverage it, makes sense. That's preferred rather than having many cores with duplicate power burned for no return. Lastly, the performance of scalar code is critical, since most "highly vectored" apps average 30-45% vector code, you need to have good scalar code performance because algorithms aren't all just sitting in some highly vectorizable loop. There's transitions, conditional control flow, etc. which necessitate a well rounded perfomrance scheme. Just my 2 cents..
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It could be interesting to do comparison in term of heat dissipation or energy needed to perform for example trigonometrical calculation by hardware implemented algorithm (vectorised version of scalar fsin machine code instruction) and the same algorithm implemented with AVX-512 instructions.My guess is that at the front end stage cost in cycles to fetch and decode up to 6 or 7 terms of Horner Scheme could be greater than decoding one complex instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How the lithography progressed very quickly.I still remember 130 nm chips.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »