PDEP/PEXT operations for AVX

capens__nicolas · ‎02-19-2013

Hi,

I was wondering if Intel engineers would consider adding certain BMI instructions to AVX. In particular, I think vector variants of PDEP and PEXT would be of great value for various kinds of multimedia that often encode things in small bitfields.

I'm personally interested in it in light of the ongoing CPU-GPU convergence. AVX2's gather and FMA support will help a great deal to catch up with the GPU, but there's still a handful of legacy graphics operations that require a relatively large number of instructions to implement on the CPU. The great thing about PDEP/PEXT is that developers could also efficiently implement custom data formats. Recently there has been a lot of research into custom rasterization and anti-aliasing algorithms, but I'm sure the uses go far beyond (rasterization) graphics.

Thanks,
Nick

Bernard · ‎02-20-2013

>>>I was wondering if Intel engineers would consider adding certain BMI instructions to AVX. In particular, I think vector variants of PDEP and PEXT would be of great value for various kinds of multimedia that often encode things in small bitfields>>>

You raised a very interested question.It is interesting what is the cpi cost of PEXT/PDEP instructions?

Regarding CPU-GPU convergence I think that at far extreme GPU could employ an auxiliary CPU on die unit for managing general purpose computation.This is needed because of more efficient cache and memory management ,out of order execution and ability to exploit parallelism.

capens__nicolas · ‎02-20-2013

iliyapolak wrote:
You raised a very interested question.It is interesting what is the cpi cost of PEXT/PDEP instructions?

According to this article, it's only marginally slower than a shift operation. Both scalar and vector shift instructions take one cycle today so I guess it's feasible to keep that latency (it would be problematic to change it due to writeback conflicts).

Regarding CPU-GPU convergence I think that at far extreme GPU could employ an auxiliary CPU on die unit for managing general purpose computation.This is needed because of more efficient cache and memory management ,out of order execution and ability to exploit parallelism.

That might be sensible/necessary in the shorter term for discrete GPUs, but integrated GPUs are gaining market share fast and they obviously already have CPU cores closeby. Also, AMD is hard at work to unify the address space between them. So the convergence is happening from the bottom up, and it's starting to make sense to bring things even closer together.

Imagine extending AVX to 512-bit (cf. Xeon Phi), and replacing the integrated GPU with more CPU cores. This would offer up to 2 TFLOPS of computing power, on the same die size as today's mainstream CPUs! But most importantly, all this computing power would be under direct control of developers. No need for various complex APIs that have buggy drivers.

Fully unifying the CPU and GPU architecture would clearly revolutionize computing as we know it. Of course this isn't without challenges, but for instance the latency hiding qualities of the GPU could be achieved with 1024-bit instructions that take two cycles (in combination with existing technologies like Hyper-Threading, out-of-order execution, prefetching, etc. this should suffice).

Bernard · ‎02-20-2013

>>>According to this article, it's only marginally slower than a shift operation. Both scalar and vector shift instructions take one cycle today so I guess it's feasible to keep that latency (it would be problematic to change it due to writeback conflicts>>>

Thanks for posting the link.It was very interesting article.

>>>Also, AMD is hard at work to unify the address space between them>>>

Nvidia in its Fermi architecture implemented flat 40-bit addressing space accessible from the driver.But it can not be cmpared to fully unified address space.Do you know how at which stage of GPU architecture x86 machine code is translated into native GPU binary representation?There is almost any relevant information regarding this.From what I have been able undestand R600 GPU implements some kind of front-end which is called Command Processor which is responsible for managing DMA (circular buffer pointer) and probably this unit also binary translates vertex data ahnd other graphics data which is fetched from MM I/O address space.

>>>Imagine extending AVX to 512-bit (cf. Xeon Phi), and replacing the integrated GPU with more CPU cores. This would offer up to 2 TFLOPS of computing power, on the same die size as today's mainstream CPUs! But most importantly, all this computing power would be under direct control of developers>>>

But you need also special hardware like ROP units various ,Texture mapping units ,TDMS transmitters. I think that only programming cores(shader processing cores) could benefit from such a unification.Regarding the wide 1024-bit registers they could improve the data throughput , but GPU also has so called constant registers and various special buffers implemented in hardware(stencil buffer,color buffer ,z-buffer) it bears also silicon cost.From the strictly processing power(only shader cores) needed to run today's 3D application there is quite possible to unify CPU and GPU.But when you need to add all those complex hardware needed for rendering you cannot mimic high-end GPU functionality with the same die size as today's maimstream CPU.

Bernard · ‎02-20-2013

Hi c0d1f1ed!

I posted an reply,but it is queued for admin approval.

capens__nicolas · ‎02-24-2013

iliyapolak wrote:
Nvidia in its Fermi architecture implemented flat 40-bit addressing space accessible from the driver.But it can not be cmpared to fully unified address space.

Indeed it's still a far cry from what a unified architecture would offer. Despite the hype, GPUs have had little success for general-purpose usage (GPGPU) in the consumer market. I believe that's because the APIs, the drivers, and the heterogeneous hardware throw up barriers that make it hard to develop applications, and they lower the effective performance. A unified architecture would increase ROI for software companies and in turn increase demand for such hardware without boundaries.

But you need also special hardware like ROP units various ,Texture mapping units ,TDMS transmitters. I think that only programming cores(shader processing cores) could benefit from such a unification.

As explained in the article from my first post, ROP units really aren't necessary. Note that for instance a GTX 680 has 1536 stream processors but only 32 ROP units. So their role is negligible and still declining due to shader-based anti-aliasing techniques and things like programmable blending.

Texture mapping can also be achieved adequately without specialized units. It's basically a gather operation and some arithmetic for filtering. The filtering part is starting to diversify, as not every sample operation is from an actual texture (these days various generic data structures are used), and when higher quality filtering is required it has to be done in the shader. Texture mapping is one of the many things that would benefit from PDEP/PEXT operations for AVX.

Indeed TDMS transmitters are indispensible, and that's obviously true for any pure I/O functionality. And it's not a limiting factor. Just compare this to sound processing. You used to need a discrete sound card to get any sound at all, but today all the computations are done on the CPU and the only specialized hardware is the I/O receivers/transmitters. The rest is all programmable and developers can achieve any sound effect they like without being limited by the hardware.

Regarding the wide 1024-bit registers they could improve the data throughput , but GPU also has so called constant registers and various special buffers implemented in hardware(stencil buffer,color buffer ,z-buffer) it bears also silicon cost.

Constant registers are no longer actual physical registers. Modern GPUs just have shared memory / caches which hold these constant values. CPU architectures automatically achieve very high locality of reference for such values. Likewise the stencil, color and depth buffers themselves are generic memory buffers. It's the ROPs that read and write to them, but again that's perfectly achievable in software, even more so with vector PDEP/PEXT support. Note that most of the silicon cost on the GPU is a waste because at any given time many of the features are inactive. For instance having stencil operations active is rare, and even when active it may not be using all features, such as two-sided stencil. In software, dynamic code generation can be employed to include only the operations that are required at the time. The cost of inactive stenciling is zero, and single-sided stencil is cheaper than two-sided stencil.

From the strictly processing power(only shader cores) needed to run today's 3D application there is quite possible to unify CPU and GPU.But when you need to add all those complex hardware needed for rendering you cannot mimic high-end GPU functionality with the same die size as today's maimstream CPU.

I'm not really talking about high-end GPUs. As I noted before, the convergence is happening from the bottom up. Replacing the integrated GPU with more CPU cores would practically double the GFLOPS directly available to any application, not just graphics. Also I hope I've convinced you that the "complex" hardware you refer to is not that complex, is becoming of negligible importance, and was underutilized.

Bernard · ‎02-24-2013

>>> Also I hope I've convinced you that the "complex" hardware you refer to is not that complex, is becoming of negligible importance, and was underutilized>>>

Yes it was very interested post.You know I have not been following the progres which has been made in order to unify GPU and CPU architectures so I'm stuck in the past:)Regarding Khronos project IIRC that I had a very insightful discussion with the forum member @bronxz who is working on that project and he told 3D rendering can be done completely in CPU.

You mentioned that TMU functionality can be implemented in software without need for special hardware.Yes that is true.One of the examples is biliniear interpolation done on the texels and it can be performed easily on the CPU.

SergeyKostrov · ‎03-05-2013

>>...Imagine extending AVX to 512-bit... It is already reality and take a look at zmmintrin.h header file.

Bernard · ‎03-05-2013

Sergey Kostrov wrote:

>>...Imagine extending AVX to 512-bit...

It is already reality and take a look at zmmintrin.h header file.

You mean Haswell microarchitecture?

capens__nicolas · ‎03-05-2013

Sergey Kostrov wrote:
>>...Imagine extending AVX to 512-bit...

It is already reality and take a look at zmmintrin.h header file.

As far as I know that's only for the MIC architecture (hence why I mentioned Xeon Phi), and is not referred to as AVX, even though the MIC's MVEX encoding format is fairly similar to AVX's VEX encoding). Of course it's equally intriguing that the 512-bit registers were renamed from v0-31 in Larrabee to zmm0-zmm31 in Xeon Phi. Lastly, the first AVX documents mentioned extendability up to 1024-bit, and the VEX encoding format indeed has a couple of unused bits that could be used for that purpose...

So I'm hopeful that Intel is considering to revolutionize computing by combining the flexibility of the CPU and the throughput of the GPU into a unified architecture. But I haven't seen any confirmation yet of this already being "reality". Extending AVX to 512-bit, while preserving the CPU's high IPC, won't be trivial. Of course they can do it gradually by first extending the register set to 512-bit and executing the instructions on the existing 256-bit execution units in 2 cycles. This extra storage helps hide latency so it would be a useful intermediate step.

The other differences with the MIC ISA is that it has mask registers, and AVX only has 16 generic vector registers. But I think the vblend instructions can do any required masking operations just fine, and the lower number of registers is compensated by having two L1 read ports and store to load bypass. So once the desktop architecture features AVX-512 and the GPU is replaced with more CPU cores, the MIC and it's MVEX encoding might quickly become obsolete.

SergeyKostrov · ‎03-06-2013

>>...Extending AVX to 512-bit, while preserving the CPU's high IPC, won't be trivial... It is already supported by Intel Paralle Studio XE 2013 (!) Initial Release. However, in zmmintrin.h there are No Any Signs for a Code Name of the instruction set. This is what a description says: ... * Definitions and declarations for use with 512-bit compiler intrinsics. ... ...Most 512-bit vector instructions have names... ...

Thomas_W_Intel · ‎03-26-2013

Nick,

what are the use-cases where you need the full power of pdep/pext? I usually find that only a consecutive number of bits is needed, which can efficiently accessed using shift plus and/or. Apart from this, the two-instruction sequence saves you from preparing the argument for pext/pdep. With AVX2 you will get variable vector shifts and do the same in vector space.

Kind regards

Thomas

capens__nicolas · ‎04-02-2013

Hi Thomas,

Graphics and multimedia use a lot of data types with packed bitfields. For instance A2B10G10R10 is a color format with 10 bits for red, green and blue shades, and 2 bits for 'alpha'. PDEP can move each component into a 16-bit SIMD lane for arithmetic processing (for four of them in parallel if it was an AVX-256 instruction). PEXT can pack it back together. Since the "bandwidth wall" is becoming a bigger issue as CPUs scale up the number of cores and widen the SIMD units, I think using compact data representations like these will become quite important.

The beauty of PDEP/PEXT is that it's generic. It can be used with various legacy formats but also many new ones. There's are new video codecs every year and they have different specifications for lots of bitfields. The use cases are endless and certainly not limited to graphics and multimedia. Basically anywhere people use bitfields in a loop with independent iterations, an SIMD version of PDEP/PEXT would likely have some use. And even for the case where you only need consecutive bits, it saves an AND/OR instruction. I'm not entirely sure what you mean by saving the preparation of the argument for PDEP/PEXT when using the two-instruction sequence. I assume you're referring to cases where the offset is variable? Yes that's better handled by a shift, but in most cases the offset is known so you load the argument to PDEP/PEXT from memory.

Cheers,
Nick