Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Will AVX-512 replace the need for dedicated GPU's?

Christopher_H_
Beginner
2,204 Views

I do not expect it to replace high end graphics cards, and will likely be less efficient powerwise than a dedicated gpu (integrated or discrete). As far as I can tell performance wise it will easily make a CPU on par with a mid range GPU, which is far and above what the majority of people need. A 3Ghz 4 Core Skylake cores will have 768GFlops(3Ghz * 4Core * 2x16FMA). The GPU takes up a enough die space to allow for 8 core chips, which would double the max flops. Intel already has the OpenGL and DirectX software renderers from Larrabee. The only thing really lacking is memory bandwidth, although DDR4 and Crystalwell should help with this.

0 Kudos
13 Replies
Bernard
Valued Contributor I
2,204 Views

GPU can still outperform such a CPU in highly parallel application like pixel processing when the adjacent pixels are not interdependent. Moreover CPU will need to run OS code kernel and user code ,manage context switching and service interrupt.In short period of time when for example interrupt service routine is executed CPU will put on hold processing of rendering thread.When you also consider that the UI objects like desktop(s),windows and font rendering is done probably by GPU in Win 8 so overloading back CPU with the stuff which is already offloaded to GPU will not increase performance.

 

 

0 Kudos
capens__nicolas
New Contributor I
2,204 Views

Yes, CPUs and GPUs have been converging for many years, and will eventually unify into a homogeneous architecture: Why the Future of 3D Graphics is in Software.

However, I don't think AVX-512F will be sufficient for that to happen. It will just move some of the work back to the CPU: CPU Onloading. GPUs have fixed-function hardware which for an efficient software implementation requires a set of new instructions. While I believe that can be achieved while keeping the instructions highly generic, it does take additional instructions to be executed and thus less processing power is available for the shaders. Hence we need higher total processing power.

Doubling the number of cores does not make sense, and would consume too much power. Instead, each core could have two SIMD clusters. Each of them would be dedicated to one thread. To keep utilization high on both clusters, we could have AVX-1024 instructions which are issued in two cycles.

 

0 Kudos
capens__nicolas
New Contributor I
2,204 Views

iliyapolak wrote:
GPU can still outperform such a CPU in highly parallel application like pixel processing when the adjacent pixels are not interdependent. Moreover CPU will need to run OS code kernel and user code ,manage context switching and service interrupt.In short period of time when for example interrupt service routine is executed CPU will put on hold processing of rendering thread.When you also consider that the UI objects like desktop(s),windows and font rendering is done probably by GPU in Win 8 so overloading back CPU with the stuff which is already offloaded to GPU will not increase performance.

I believe he's talking about replacing the GPU with more CPU cores. Hence there is no bottleneck for running user code. Note that GPUs used to have separate cores for vertex and pixel processing, so you were always bottlenecked either by vertex processing or by pixel processing. But then they unified so it doesn't matter how your vertex and pixel workload is distributed. The same thing will happen with the CPU and GPU cores.

0 Kudos
Bernard
Valued Contributor I
2,204 Views

>>>I believe he's talking about replacing the GPU with more CPU cores. Hence there is no bottleneck for running user code>>>

I agree with you.

0 Kudos
Christopher_H_
Beginner
2,204 Views
c0d1f1ed which new instructions would be neccesary to achieve parity with GPU's in terms of performance? something specific for rasterization? The processing sounds cards used to do, has ended up effectively just being taken care of by the CPU these days. It seems logical the way CPU's are heading (Intel atleast) there will be no need for dedicated 3d hardware.
0 Kudos
Richard_Nutman
New Contributor I
2,204 Views

I agree that the paths of CPUs and GPU's are merging.  CPU's struggled for along time in processing parallel operations, whereas GPU's struggled with being programmably flexable.  Now both are gaining ground in both respects.

A GPU will always have the advantage however that it is built for a more specialised task, and therefore can focus more die space on its more specialized parallel operations.

Audio processing uses soo little data, it wasn't worth using dedicated hardware for it anymore.  I don't see the same applying to graphics for many years however.

0 Kudos
Bernard
Valued Contributor I
2,204 Views

Regarding the audio processing in term of speed and resources there is no need for dedicated hardware for this specific task.Although when one takes into account achieving high fidelity of sound reproduction in this case offloading audio stream to dedicated external hardware is better than relying on noisy PC environment.

0 Kudos
Christopher_H_
Beginner
2,204 Views
All I meant with audio was that CPU's were not fast enough previously to process sound and perform other tasks, so we had dedicated hardware (many years ago). This redundancy seems close for graphics now, maybe a few years away. It depends on how cores and memory bandwidth scales up on cpu's.
0 Kudos
Bernard
Valued Contributor I
2,204 Views

Now I understand your point.

0 Kudos
capens__nicolas
New Contributor I
2,204 Views

Christopher H. wrote:
c0d1f1ed which new instructions would be neccesary to achieve parity with GPU's in terms of performance? something specific for rasterization?

AVX-512F gives graphics developers a 'carte blanche' to design their software any way they feel like. So we shouldn't seek to achieve parity with GPUs for things they are currently highly specialized at. Just like with sound cards, we merely need adequate support for the legacy features. Where things get exciting is for algorithms the dedicated hardware is not designed for. Developers go to great lengths to shoehorn their algorithms into the graphics pipeline supported by the GPU, but we can often do much better when given total freedom.

To illustrate this; when the GPU's pixel and vertex processing became programmable, developers were concerned about keeping parity with the performance of legacy fixed-function GPUs. But nowadays we use shaders in ways completely unimaginable back then, and having dedicated logic for things like for example bump mapping and alpha testing is the very least of our concerns. Likewise when vertex and pixel processing became unified, some were worried that these less specialized computing units would be less efficient. And while that might have been the case for legacy applications that were tuned to use separate vertex and pixel units, unifying them has created new possibilities we would never want to part with now.

So when given the 'carte blanche', and sufficient time to forget about the legacy ways of doing things, graphics code would look largely indiscernible from any other computing code. I believe it is this generic code that AVX extensions should cater for. Any attempt at imitating current GPUs too closely would result in instructions that become dead weight when practices change over time. AVX should merely help extract the data level parallelism (DLP) in generic code in an efficient manner.

So with that in mind, I've identified two (classes of) instructions which I think would be of lasting value:

  • A vector equivalent of the BMI instructions. Aside from floating-point operations, graphics work with a lot of small bitfields. And these instructions would also help video codecs, compression, encryption, etc.
  • Conversion to/from fixed-point. I encounter a lot of code where there's a floating-point multiplication followed by a conversion to integer, or a conversion to floating-point followed by a division (or rather multiplication by the reciprocal). Note that this can't be a shift operation, it has to be a multiplication, because e.g. a byte value of 0xFF represents 1.0. Even better is an FMA operation, to account for rounding.

Both of these are examples of a broader approach to achieve better performance for generic code: they combine multiple operations into one. BMI instructions perform several binary or shift operations at once, while the fixed-point instructions combine FMA with integer conversion. Both of them would also be relatively cheap to implement, because BMI operations use a butterfly network which also serves as a shift unit, while FMA units already have to normalize/denormalize the values so working with integers shouldn't be too hard.

 

0 Kudos
Bernard
Valued Contributor I
2,204 Views

>>>I encounter a lot of code where there's a floating-point multiplication followed by a conversion to integer,>>>

Probably done before sending the pixel values to RAMDAC(Video DAC) in analogue output or TDMS in digital output.

0 Kudos
capens__nicolas
New Contributor I
2,204 Views

iliyapolak wrote:
>>>I encounter a lot of code where there's a floating-point multiplication followed by a conversion to integer,>>>

Probably done before sending the pixel values to RAMDAC(Video DAC) in analogue output or TDMS in digital output.

No, there is no value in doing that in the programmable cores. Just like with sound processing we still need dedicated hardware for the purely I/O parts.

Instead these fixed-point conversion instructions would be useful for vertex fetch, conversion to an index that can be used for a table lookup (gather) to approximate transcendentals, converting floating-point position into pixel addresses with sub-pixel accuracy (part of rasterization), reading/writing certain depth buffer and color buffer formats, converting normalized texture coordinates into texel coordinates, converting texel values, etc. And that's just for legacy 3D graphics. It is extremely common for conversions between floating-point and integer values to be accompanied by a multiplication, and sometimes an addition for rounding, in any code that uses both data types.

0 Kudos
Bernard
Valued Contributor I
2,204 Views

>>>conversion to an index that can be used for a table lookup (gather) to approximate transcendentals>>>

Btw I thought that transcendentals in GPU's are approximated by Horner scheme where the coefficients are pre-calculated and stored in LUT.

0 Kudos
Reply