Is this for real ? Where could I find an Intel descriptions about this with some background. The AVX intro document mentions this too but only gives it a mention.
With 1024 bit length registers, how would the memory supply enough data to feed this beast?
Obviously this is a field for future-proofing the AVX instructions. Which is a first for x86 AFAIK - and a really good decision IMHO. One may assume that a 1024-bit vector CPU would use cache lines that are at least 1024-bit wide. Nowadays they're still at 512 bits.
It doesn't necessarily have to process 1024-bit in one clock cycle. The Pentium 3 and 4 executed 128-bit SSE instructions on 64-bit execution units by splitting them into two uops.
This implicitly allows access to more physical register space. An AVX-1024 instruction would be the equivalent of "unrolling" it into four AVX-256 instructions, but without the risk of running out of ymm registers. This means it's easier to cover instruction latencies, and thus you can reach higher effective throughput.
A possibly even more compelling reason to implement AVX-1024 without widening the execution path is power consumption. Instead of splitting the instruction into four uops, I believe it could remain a single uop by performing the actual sequencing at the issue stage. This means the entire front-end and even part of the scheduler could be clock-gated due to this lower instruction rate.
Note that this is quite similar to how GPUs function. AMD processes 2048-bit vectors on 512-bit execution units in four cycles, while NVIDIA processes 1024-bit vectors on 512-bit execution units using a front-end clocked at half the frequency.
Together with AVX2's gather and FMA support, I believe this would make GPGPU processing obsolete. The CPU is much more flexible, and a homogeneous high-throughput architecture would eliminate the CPU-GPU communication overhead, offering even higher effective performance.
Together with AVX2's gather and FMA support, I believe this would make GPGPU processing obsolete. The CPU is much more flexible, and a homogeneous high-throughput architecture would eliminate the CPU-GPU communication overhead, offering even higher effective performance..
I heard it for the last few years and till now CUDAproved it is much faster.
I tried using AVX with dual cpu - each with 6 cores and I continuously get stuck with poor memory bandwidth.
It looks as if the AVX was meant to improve the floating-point vector api, yet no-one took into account the bandwidth problem. In certain cases the SSE or C code get better results.
That's mainly due to the lack of gather support. A single gather instruction can replace 18 instructions!
FMA would also offer a significant increase in computing power. And because of this Haswell is also expected to double the cache bandwidth. Note that Sandy Bridge-E is said to already have twice the RAM bandwidth. So the combination of all these things would make the CPU highly effective at throughput computing.
I checked the Programming Reference and I actually couln't locate this 2-bit field. There's only a 1-bit VEX.L field for indicating 128-bit or 256-bit operation... Or did I miss something?
There's a VEX.m-mmmm field in the 3-byte VEX format which has three reserved bits though.
I just found out that one of these bits is used by AMD's XOP instructions. Interestingly, it's the middle bit of the trio, which suggests that perhaps Intel already has specific plans for the first bit...
Never mind, I've finally taken the time to study the VEX encoding (in particular how it avoids collision with legacy instructions), and I found out that while XOP encoding uses the same format, it actually maps to a previously unused part of a 'group opcode', were the mod field overlaps with the VEX.mmmmm field and needs to have a fixed value.
But for VEX I believe the three first bits of the mmmmm field are all still available for future extensions, like 1024-bit AVX...